1
00:00:00,539 --> 00:00:03,539
foreign

2
00:00:11,360 --> 00:00:17,760
welcome back everyone to All Things data

3
00:00:14,820 --> 00:00:19,500
uh we've got in this uh block we have

4
00:00:17,760 --> 00:00:21,060
two talks before we jump into

5
00:00:19,500 --> 00:00:22,439
introducing the next talk just a

6
00:00:21,060 --> 00:00:23,760
reminder that we have a Discord Channel

7
00:00:22,439 --> 00:00:26,039
there's some really good conversations

8
00:00:23,760 --> 00:00:27,960
about the talks and and links to open

9
00:00:26,039 --> 00:00:29,699
source uh packages and things that

10
00:00:27,960 --> 00:00:31,859
people talking about so Jump On In say

11
00:00:29,699 --> 00:00:33,600
hello we've got people on the online uh

12
00:00:31,859 --> 00:00:37,260
tuning in remotely as well joining in

13
00:00:33,600 --> 00:00:40,140
there so uh to kick us off uh we've got

14
00:00:37,260 --> 00:00:42,600
Alex Ware and I'm just going to read a

15
00:00:40,140 --> 00:00:44,760
little intro here uh Alex is a software

16
00:00:42,600 --> 00:00:47,399
engineer working for geoscape Australia

17
00:00:44,760 --> 00:00:48,899
and he's based in Canberra Australia she

18
00:00:47,399 --> 00:00:50,760
previously spent several years working

19
00:00:48,899 --> 00:00:52,920
as a data engineer in the Australian

20
00:00:50,760 --> 00:00:54,600
Public Service she's a co-organizer of

21
00:00:52,920 --> 00:00:56,219
the Canberra python user group and a

22
00:00:54,600 --> 00:00:58,320
co-organizer of the upcoming Django

23
00:00:56,219 --> 00:01:01,320
girls Canberra Workshop she's passionate

24
00:00:58,320 --> 00:01:03,059
about Big Data clean code and supporting

25
00:01:01,320 --> 00:01:06,420
those with a marginalized experience of

26
00:01:03,059 --> 00:01:09,420
gender in Tech so uh with that um please

27
00:01:06,420 --> 00:01:11,700
big round of applause for uh

28
00:01:09,420 --> 00:01:14,340
uh Alex Ware who's going to be talking

29
00:01:11,700 --> 00:01:17,060
about an introduction to Pi Spark

30
00:01:14,340 --> 00:01:17,060
oops

31
00:01:18,479 --> 00:01:22,439
hello hopefully this microphone is

32
00:01:20,280 --> 00:01:24,960
working okay uh so today I'm going to be

33
00:01:22,439 --> 00:01:26,400
giving an introduction to Pi spark in my

34
00:01:24,960 --> 00:01:29,400
talk I'm going to be trying to answer

35
00:01:26,400 --> 00:01:32,340
some questions such as what actually is

36
00:01:29,400 --> 00:01:34,680
pi spark can it really solve all of my

37
00:01:32,340 --> 00:01:36,960
data problems and possibly most

38
00:01:34,680 --> 00:01:39,119
important are you sure I can't just use

39
00:01:36,960 --> 00:01:41,220
pandas instead

40
00:01:39,119 --> 00:01:42,299
uh we sort of covered the who I am a

41
00:01:41,220 --> 00:01:43,680
little bit so software engineer at

42
00:01:42,299 --> 00:01:45,600
geoscope Australia working predominantly

43
00:01:43,680 --> 00:01:47,700
with geospatial data uh the very quick

44
00:01:45,600 --> 00:01:49,079
call out for my employers we our claim

45
00:01:47,700 --> 00:01:50,939
to fame is Gina from the national

46
00:01:49,079 --> 00:01:52,619
address file but we also do a whole

47
00:01:50,939 --> 00:01:53,579
bunch of other geospatial products I

48
00:01:52,619 --> 00:01:55,259
know there's been a bit of discussion

49
00:01:53,579 --> 00:01:57,000
about geospatial stuff today so if

50
00:01:55,259 --> 00:01:59,460
you're interested in things like data

51
00:01:57,000 --> 00:02:03,180
around property cadasta buildings roads

52
00:01:59,460 --> 00:02:04,799
solar trees maybe check us out uh

53
00:02:03,180 --> 00:02:06,240
previously in the public service won't

54
00:02:04,799 --> 00:02:07,799
talk about that too much other than

55
00:02:06,240 --> 00:02:10,080
that's where I first started using pi

56
00:02:07,799 --> 00:02:12,360
spark and the workshop has actually

57
00:02:10,080 --> 00:02:14,400
happened now it happened a week ago and

58
00:02:12,360 --> 00:02:15,840
it was a lot of fun and thank you to

59
00:02:14,400 --> 00:02:18,000
Django girls for supporting us in

60
00:02:15,840 --> 00:02:20,640
running that

61
00:02:18,000 --> 00:02:22,680
okay the really important disclaimer is

62
00:02:20,640 --> 00:02:25,200
that after that slide is I don't talk

63
00:02:22,680 --> 00:02:27,180
for any current or former employees I am

64
00:02:25,200 --> 00:02:29,520
here entirely sort of presenting my own

65
00:02:27,180 --> 00:02:31,980
opinions and very importantly I don't

66
00:02:29,520 --> 00:02:33,780
have any like major or minor or any

67
00:02:31,980 --> 00:02:35,520
links to Apache I do not speak for them

68
00:02:33,780 --> 00:02:36,900
in any way I am just a hobbyist who

69
00:02:35,520 --> 00:02:39,360
played around with the library and got

70
00:02:36,900 --> 00:02:41,879
talked into giving a talk at pycon

71
00:02:39,360 --> 00:02:45,780
um as happens to the best of us

72
00:02:41,879 --> 00:02:47,099
cool so as part of this talk I'm going

73
00:02:45,780 --> 00:02:49,980
to be giving a couple of code examples

74
00:02:47,099 --> 00:02:51,540
to do that I needed some data and very

75
00:02:49,980 --> 00:02:53,160
fortunately Brisbane City Council

76
00:02:51,540 --> 00:02:54,599
actually releases a bunch of information

77
00:02:53,160 --> 00:02:56,099
about the library checkups that have

78
00:02:54,599 --> 00:02:57,480
happened over a three day period each

79
00:02:56,099 --> 00:03:00,060
month they've been doing this since back

80
00:02:57,480 --> 00:03:01,739
in about the start of 2020 so there's a

81
00:03:00,060 --> 00:03:03,120
fair bit of data there

82
00:03:01,739 --> 00:03:05,040
um and I am going to use some of that

83
00:03:03,120 --> 00:03:06,780
for my presentation

84
00:03:05,040 --> 00:03:09,900
so

85
00:03:06,780 --> 00:03:11,819
let's start with pandas hopefully this

86
00:03:09,900 --> 00:03:13,440
is all pretty familiar so far I actually

87
00:03:11,819 --> 00:03:15,120
it's very bright up here so I can't see

88
00:03:13,440 --> 00:03:16,980
faces too well

89
00:03:15,120 --> 00:03:20,040
um but yeah basic example we're going to

90
00:03:16,980 --> 00:03:21,840
read in some data from July of this year

91
00:03:20,040 --> 00:03:23,459
we're going to select some columns or

92
00:03:21,840 --> 00:03:24,900
fields and we're going to look at those

93
00:03:23,459 --> 00:03:26,640
rows

94
00:03:24,900 --> 00:03:28,940
so far so good

95
00:03:26,640 --> 00:03:30,959
we can do the same thing with pricebach

96
00:03:28,940 --> 00:03:33,420
now you might be looking at this code

97
00:03:30,959 --> 00:03:35,159
example and saying hey Alex well what's

98
00:03:33,420 --> 00:03:37,140
that thing going on with the spark

99
00:03:35,159 --> 00:03:38,700
session I'm going to go that's a great

100
00:03:37,140 --> 00:03:40,560
question I'm not going to answer it yet

101
00:03:38,700 --> 00:03:44,459
but if you look at the two lines

102
00:03:40,560 --> 00:03:46,799
underneath it very similar to pandas so

103
00:03:44,459 --> 00:03:49,019
we read in the same data select the

104
00:03:46,799 --> 00:03:51,180
columns and we're going to show it

105
00:03:49,019 --> 00:03:52,799
in a very similar output

106
00:03:51,180 --> 00:03:54,299
maybe we want to do something a little

107
00:03:52,799 --> 00:03:56,099
bit more interesting with our data we

108
00:03:54,299 --> 00:03:57,239
want to pull out some statistics I

109
00:03:56,099 --> 00:03:58,799
actually don't write pandas very often

110
00:03:57,239 --> 00:04:01,680
so if I've committed any cardinal sins

111
00:03:58,799 --> 00:04:04,080
please forgive me but hopefully it looks

112
00:04:01,680 --> 00:04:06,120
mostly how it's meant to look in terms

113
00:04:04,080 --> 00:04:07,680
of getting out a row count maybe looking

114
00:04:06,120 --> 00:04:08,940
at the different languages how many

115
00:04:07,680 --> 00:04:11,040
different languages of books been

116
00:04:08,940 --> 00:04:12,780
checked out in and what's the breakdown

117
00:04:11,040 --> 00:04:16,380
by sort of the different age categories

118
00:04:12,780 --> 00:04:18,239
uh juvenile is the most popular for this

119
00:04:16,380 --> 00:04:20,820
life across the Brisbane libraries but

120
00:04:18,239 --> 00:04:23,580
interestingly like the adult category is

121
00:04:20,820 --> 00:04:25,680
pretty much up there as well

122
00:04:23,580 --> 00:04:28,560
and possibly more interesting let's look

123
00:04:25,680 --> 00:04:30,780
at the pi spark version so we can see

124
00:04:28,560 --> 00:04:32,699
pretty much most of the functions do

125
00:04:30,780 --> 00:04:34,440
what they say on the tin we've got our

126
00:04:32,699 --> 00:04:36,479
account at the top which will tell us

127
00:04:34,440 --> 00:04:38,580
our rows in our data frame we can select

128
00:04:36,479 --> 00:04:41,759
and filter down our data frame to get

129
00:04:38,580 --> 00:04:43,320
those languages or this distinct we can

130
00:04:41,759 --> 00:04:45,780
do a group by an account on those age

131
00:04:43,320 --> 00:04:48,300
categories so at this point it should be

132
00:04:45,780 --> 00:04:51,060
at least becoming a little bit obvious

133
00:04:48,300 --> 00:04:53,160
that like if pandas is python kind of

134
00:04:51,060 --> 00:04:55,139
pretending to be art of some extent

135
00:04:53,160 --> 00:04:57,300
this part of the pi Spock library is

136
00:04:55,139 --> 00:04:59,280
python pretending to be SQL which I

137
00:04:57,300 --> 00:05:01,199
quite like I I find that quite intuitive

138
00:04:59,280 --> 00:05:02,460
there's an element of you're able to

139
00:05:01,199 --> 00:05:03,780
make some assumptions about what you

140
00:05:02,460 --> 00:05:05,400
should be able to do based on your

141
00:05:03,780 --> 00:05:08,460
knowledge of SQL and you can translate

142
00:05:05,400 --> 00:05:09,780
that across into python

143
00:05:08,460 --> 00:05:13,259
cool

144
00:05:09,780 --> 00:05:16,139
let's go with one more scenario so in

145
00:05:13,259 --> 00:05:17,820
this case we have we're going to try and

146
00:05:16,139 --> 00:05:19,199
group any checkouts that occur within

147
00:05:17,820 --> 00:05:20,940
five seconds of each other in the same

148
00:05:19,199 --> 00:05:22,020
Library so I'm going to make this

149
00:05:20,940 --> 00:05:23,699
assumption if it's happening close

150
00:05:22,020 --> 00:05:25,139
together in time then I can say that the

151
00:05:23,699 --> 00:05:26,400
same person did it now obviously you

152
00:05:25,139 --> 00:05:28,680
might have different checkout machines

153
00:05:26,400 --> 00:05:29,759
in the same Library it's not perfect but

154
00:05:28,680 --> 00:05:32,160
it's going to let us play around with

155
00:05:29,759 --> 00:05:33,600
the data a little bit

156
00:05:32,160 --> 00:05:35,039
and we've got another code example now

157
00:05:33,600 --> 00:05:36,720
don't worry too much if you can't really

158
00:05:35,039 --> 00:05:37,740
read this this isn't really one that I'm

159
00:05:36,720 --> 00:05:38,940
going to like go through in a lot of

160
00:05:37,740 --> 00:05:41,460
detail this is more just about like

161
00:05:38,940 --> 00:05:44,400
proving that I wrote the code

162
00:05:41,460 --> 00:05:45,720
um that it is possible also just it's a

163
00:05:44,400 --> 00:05:47,039
little bit of my love letter to window

164
00:05:45,720 --> 00:05:49,380
functions because I think they're

165
00:05:47,039 --> 00:05:50,759
fantastic and I love using them and I

166
00:05:49,380 --> 00:05:53,220
liked that I got to use them in this

167
00:05:50,759 --> 00:05:55,080
example but I will move on pretty

168
00:05:53,220 --> 00:05:56,580
quickly but feel free if you are curious

169
00:05:55,080 --> 00:05:59,100
about this or any other element to come

170
00:05:56,580 --> 00:06:01,199
find me afterwards and I'm also going to

171
00:05:59,100 --> 00:06:03,419
be putting a lot of this up on GitHub so

172
00:06:01,199 --> 00:06:04,740
you can find it later

173
00:06:03,419 --> 00:06:07,919
but

174
00:06:04,740 --> 00:06:10,020
if we run this we can find the checkout

175
00:06:07,919 --> 00:06:12,720
with the low group with the largest

176
00:06:10,020 --> 00:06:13,979
amount of checkouts and we can go have a

177
00:06:12,720 --> 00:06:16,259
look at it

178
00:06:13,979 --> 00:06:18,180
and it's this one which might be a

179
00:06:16,259 --> 00:06:19,979
little bit hard to read but I love this

180
00:06:18,180 --> 00:06:22,319
because I get to imagine some kid had

181
00:06:19,979 --> 00:06:24,380
just like the best 68 seconds of their

182
00:06:22,319 --> 00:06:27,120
life as they borrowed out like

183
00:06:24,380 --> 00:06:29,819
everything ever like there is Wings of

184
00:06:27,120 --> 00:06:32,580
Fire Miles Morales uh half of ando's

185
00:06:29,819 --> 00:06:34,860
back catalog and I mean really like why

186
00:06:32,580 --> 00:06:36,360
would we work with data except to find

187
00:06:34,860 --> 00:06:37,620
things like this

188
00:06:36,360 --> 00:06:39,720
um my personal favorite I don't know if

189
00:06:37,620 --> 00:06:41,039
you know how fastbacks work but

190
00:06:39,720 --> 00:06:43,380
basically it's a category in the library

191
00:06:41,039 --> 00:06:44,699
where you get exactly a week to read

192
00:06:43,380 --> 00:06:47,280
that book and you are not allowed to

193
00:06:44,699 --> 00:06:49,080
renew it the degree of optimism I get to

194
00:06:47,280 --> 00:06:50,520
imagine that this kid has about their

195
00:06:49,080 --> 00:06:51,300
week

196
00:06:50,520 --> 00:06:52,500
um

197
00:06:51,300 --> 00:06:54,600
like

198
00:06:52,500 --> 00:06:56,880
I love that so much

199
00:06:54,600 --> 00:06:58,380
um cool but back to the point of the

200
00:06:56,880 --> 00:06:59,940
talk

201
00:06:58,380 --> 00:07:01,740
what should I use

202
00:06:59,940 --> 00:07:03,660
who doesn't love a benchmark

203
00:07:01,740 --> 00:07:06,060
okay so luckily I found one someone made

204
00:07:03,660 --> 00:07:07,560
before me so you can go check them out

205
00:07:06,060 --> 00:07:09,479
I'm

206
00:07:07,560 --> 00:07:12,240
these report this evaluation actually

207
00:07:09,479 --> 00:07:13,800
happened back in 2021 all of these

208
00:07:12,240 --> 00:07:15,600
libraries have changed to some degree

209
00:07:13,800 --> 00:07:17,400
since then so this is really just meant

210
00:07:15,600 --> 00:07:21,599
to be a very like look they're all

211
00:07:17,400 --> 00:07:22,319
relatively similar on small data sets

212
00:07:21,599 --> 00:07:28,340
um

213
00:07:22,319 --> 00:07:31,259
but if we go to a larger data set

214
00:07:28,340 --> 00:07:33,300
and this is an important point of uh one

215
00:07:31,259 --> 00:07:35,580
of the reasons you might be interested

216
00:07:33,300 --> 00:07:37,440
in learning more about Pi spark might

217
00:07:35,580 --> 00:07:40,020
have less to do with your like innate

218
00:07:37,440 --> 00:07:41,880
interest in learning about a new topic

219
00:07:40,020 --> 00:07:43,380
and it might be that your current tools

220
00:07:41,880 --> 00:07:45,419
have stopped working for your data set

221
00:07:43,380 --> 00:07:49,620
because it got too big

222
00:07:45,419 --> 00:07:50,699
so that's kind of a big selling point

223
00:07:49,620 --> 00:07:53,280
and maybe you're at this point you're

224
00:07:50,699 --> 00:07:54,599
like okay fair enough I'm curious I want

225
00:07:53,280 --> 00:07:58,319
to learn a bit more

226
00:07:54,599 --> 00:08:01,500
let's go have a look under the hood

227
00:07:58,319 --> 00:08:02,880
and this is the count function and we

228
00:08:01,500 --> 00:08:05,759
see that when we call the count function

229
00:08:02,880 --> 00:08:07,680
on a data frame it immediately calls

230
00:08:05,759 --> 00:08:10,680
something else and asks it for the count

231
00:08:07,680 --> 00:08:11,400
and that thing has a little J prefixing

232
00:08:10,680 --> 00:08:13,080
it

233
00:08:11,400 --> 00:08:14,699
which is making us think like maybe

234
00:08:13,080 --> 00:08:18,060
under the hood there's something

235
00:08:14,699 --> 00:08:21,539
happening that's not totally pythonic

236
00:08:18,060 --> 00:08:23,759
and we go to but we go to a collect

237
00:08:21,539 --> 00:08:25,979
and a collect is where we say hey I want

238
00:08:23,759 --> 00:08:27,900
this data frame as a list of Records a

239
00:08:25,979 --> 00:08:30,180
list is that's very like that has to

240
00:08:27,900 --> 00:08:31,860
that is in Python this has to be far

241
00:08:30,180 --> 00:08:32,580
more python

242
00:08:31,860 --> 00:08:35,820
um

243
00:08:32,580 --> 00:08:37,500
you know this will be great

244
00:08:35,820 --> 00:08:40,380
uh

245
00:08:37,500 --> 00:08:44,060
yes who doesn't love seeing sockets and

246
00:08:40,380 --> 00:08:44,060
pickling in their python Library

247
00:08:44,159 --> 00:08:47,160
yeah

248
00:08:45,600 --> 00:08:49,380
so that's

249
00:08:47,160 --> 00:08:50,940
I mean I'm told it works and a lot of

250
00:08:49,380 --> 00:08:53,040
smarter the people than me have worked

251
00:08:50,940 --> 00:08:56,100
on this and I believe it works and I try

252
00:08:53,040 --> 00:08:57,720
not to think too hard about this exact

253
00:08:56,100 --> 00:08:59,399
moment when daughter is being thrown

254
00:08:57,720 --> 00:09:01,260
back and forth in the worst game of

255
00:08:59,399 --> 00:09:03,839
catch ever

256
00:09:01,260 --> 00:09:06,060
um but yeah I as far as I'm aware it

257
00:09:03,839 --> 00:09:07,620
works we get the data out in Python it

258
00:09:06,060 --> 00:09:09,779
does appear to be that a Java is

259
00:09:07,620 --> 00:09:11,760
entirely hiding under my library

260
00:09:09,779 --> 00:09:13,320
slightly concerning

261
00:09:11,760 --> 00:09:15,180
and at this point I have to come clean

262
00:09:13,320 --> 00:09:16,740
and say that I have actually come to a

263
00:09:15,180 --> 00:09:18,660
python conference

264
00:09:16,740 --> 00:09:22,560
ostensibly to talk about a python

265
00:09:18,660 --> 00:09:24,480
library that is just like the smallest

266
00:09:22,560 --> 00:09:28,200
amount of python wrapped around a

267
00:09:24,480 --> 00:09:29,760
scholar Library so I'm sorry but you can

268
00:09:28,200 --> 00:09:31,500
for the most part pretend The Scholar

269
00:09:29,760 --> 00:09:34,260
doesn't exist except when it throws

270
00:09:31,500 --> 00:09:36,540
errors that a massive Java stack traces

271
00:09:34,260 --> 00:09:37,980
which is slightly ominous from your

272
00:09:36,540 --> 00:09:40,500
python code

273
00:09:37,980 --> 00:09:42,300
um I like this image because uh it's

274
00:09:40,500 --> 00:09:43,920
just this the star on the end with the

275
00:09:42,300 --> 00:09:45,060
tiny little bit of python wrapped around

276
00:09:43,920 --> 00:09:46,980
it because I think it is very

277
00:09:45,060 --> 00:09:49,160
illustrative of what is going on in this

278
00:09:46,980 --> 00:09:49,160
Library

279
00:09:49,320 --> 00:09:56,160
but this does raise the obvious question

280
00:09:52,920 --> 00:09:58,800
why why do we need a massive thing of

281
00:09:56,160 --> 00:10:01,200
Scala underneath our python right what

282
00:09:58,800 --> 00:10:03,000
could possibly be offering me that would

283
00:10:01,200 --> 00:10:04,980
justify this

284
00:10:03,000 --> 00:10:06,899
uh

285
00:10:04,980 --> 00:10:08,940
and this could Spark

286
00:10:06,899 --> 00:10:12,300
it's basically what if we had functional

287
00:10:08,940 --> 00:10:14,640
programming on data and we distributed

288
00:10:12,300 --> 00:10:16,200
it and wouldn't that be great and fine

289
00:10:14,640 --> 00:10:18,779
and it would never cause any problems

290
00:10:16,200 --> 00:10:20,940
for anyone and it's fantastic

291
00:10:18,779 --> 00:10:23,100
cool sorry

292
00:10:20,940 --> 00:10:25,019
distribution

293
00:10:23,100 --> 00:10:26,339
um now important to touch on at this

294
00:10:25,019 --> 00:10:27,600
point in the presentation all of the

295
00:10:26,339 --> 00:10:29,459
code examples everything I've been kind

296
00:10:27,600 --> 00:10:30,899
of talking to up to this point that's

297
00:10:29,459 --> 00:10:32,820
the driver program we can see that

298
00:10:30,899 --> 00:10:34,260
lovely spark context kind of dodged a

299
00:10:32,820 --> 00:10:37,560
little bit earlier

300
00:10:34,260 --> 00:10:38,640
um in terms of if you just want to get

301
00:10:37,560 --> 00:10:40,019
started you just want to have a play

302
00:10:38,640 --> 00:10:41,519
around you're not too concerned about

303
00:10:40,019 --> 00:10:43,800
sort of dealing with that out of memory

304
00:10:41,519 --> 00:10:46,380
like really running a massive data set

305
00:10:43,800 --> 00:10:47,640
you can kind of get to this point and

306
00:10:46,380 --> 00:10:49,800
just have a play around you don't need

307
00:10:47,640 --> 00:10:51,420
to worry about the next part yet you can

308
00:10:49,800 --> 00:10:53,220
do a pip install maybe play around with

309
00:10:51,420 --> 00:10:54,600
a couple of java settings but you can

310
00:10:53,220 --> 00:10:56,640
open up a notebook and be running pretty

311
00:10:54,600 --> 00:10:58,860
quickly and on your sort of smaller

312
00:10:56,640 --> 00:10:59,519
Library checkout data set

313
00:10:58,860 --> 00:11:01,320
um

314
00:10:59,519 --> 00:11:02,640
but let's say you are interested in this

315
00:11:01,320 --> 00:11:04,200
idea if I want to I want to throw some

316
00:11:02,640 --> 00:11:06,000
worker nodes on my problem I want to

317
00:11:04,200 --> 00:11:07,380
really have the benefit of being able to

318
00:11:06,000 --> 00:11:08,820
work with that larger data set and not

319
00:11:07,380 --> 00:11:10,680
spend three days waiting for my code to

320
00:11:08,820 --> 00:11:12,899
run and great we're going to stick a

321
00:11:10,680 --> 00:11:15,839
cluster manager in the middle so we have

322
00:11:12,899 --> 00:11:17,279
our Lively diagram and it's going to

323
00:11:15,839 --> 00:11:20,519
look something like this in terms of a

324
00:11:17,279 --> 00:11:22,860
dashboard this one's pretty empty but

325
00:11:20,519 --> 00:11:25,140
it's just there for illustration

326
00:11:22,860 --> 00:11:28,920
and basically the promise Pi spark is

327
00:11:25,140 --> 00:11:31,380
making to us is if you write this Java

328
00:11:28,920 --> 00:11:33,660
program and

329
00:11:31,380 --> 00:11:35,399
you write all the code using the library

330
00:11:33,660 --> 00:11:38,100
that you've been given

331
00:11:35,399 --> 00:11:41,040
it will handle all of the thinking about

332
00:11:38,100 --> 00:11:43,200
how we decide what data goes on which

333
00:11:41,040 --> 00:11:44,760
node and which node is doing which task

334
00:11:43,200 --> 00:11:45,660
and how the processing is going to

335
00:11:44,760 --> 00:11:48,120
happen

336
00:11:45,660 --> 00:11:51,720
and whether you believe the library

337
00:11:48,120 --> 00:11:53,399
is entirely up to you but

338
00:11:51,720 --> 00:11:56,519
I will admit it does make life a lot

339
00:11:53,399 --> 00:11:58,500
easier sometimes

340
00:11:56,519 --> 00:12:00,180
but if you don't totally believe Library

341
00:11:58,500 --> 00:12:01,320
when it says hey you don't never need to

342
00:12:00,180 --> 00:12:02,279
worry about it we might want to know a

343
00:12:01,320 --> 00:12:05,579
little bit more about what's happening

344
00:12:02,279 --> 00:12:07,680
under the hood and to understand that

345
00:12:05,579 --> 00:12:10,079
we have to go back to mapreduce which is

346
00:12:07,680 --> 00:12:11,700
the thing that came before Spark

347
00:12:10,079 --> 00:12:13,560
um and so I'm going to move through this

348
00:12:11,700 --> 00:12:15,320
very very quickly but it's just a kind

349
00:12:13,560 --> 00:12:18,600
of a nice background

350
00:12:15,320 --> 00:12:20,040
mapreduce start with some input data

351
00:12:18,600 --> 00:12:22,980
splits the input data across the

352
00:12:20,040 --> 00:12:24,360
different nodes applies a map to it

353
00:12:22,980 --> 00:12:26,940
shuffles the data across the different

354
00:12:24,360 --> 00:12:29,100
nodes applied to reduce to it and

355
00:12:26,940 --> 00:12:30,660
outputs it's lovely it does exactly what

356
00:12:29,100 --> 00:12:33,480
it says on the tin makes it very simple

357
00:12:30,660 --> 00:12:35,279
and easy to remember super useful in the

358
00:12:33,480 --> 00:12:36,959
sense that it solves a problem around

359
00:12:35,279 --> 00:12:38,360
how do we split out the data and with

360
00:12:36,959 --> 00:12:40,200
our processing

361
00:12:38,360 --> 00:12:41,459
downsides are that if you want to do

362
00:12:40,200 --> 00:12:43,019
more complex things you have to start

363
00:12:41,459 --> 00:12:44,459
stringing a lot of these together those

364
00:12:43,019 --> 00:12:46,440
shuffling steps are going to take a

365
00:12:44,459 --> 00:12:47,880
while the input and output steps are

366
00:12:46,440 --> 00:12:50,040
going to start to come for you after a

367
00:12:47,880 --> 00:12:51,959
while so eventually you are going to

368
00:12:50,040 --> 00:12:54,360
really see some issues with the runtime

369
00:12:51,959 --> 00:12:56,040
of your programs

370
00:12:54,360 --> 00:12:57,600
but we all promise spark was going to do

371
00:12:56,040 --> 00:13:00,480
it better

372
00:12:57,600 --> 00:13:02,639
so what does spark do differently

373
00:13:00,480 --> 00:13:03,899
so spark takes this idea of things like

374
00:13:02,639 --> 00:13:06,120
the map and the reducer those

375
00:13:03,899 --> 00:13:07,500
Transformations and says what if they

376
00:13:06,120 --> 00:13:09,480
were lazy

377
00:13:07,500 --> 00:13:10,920
so we can just as we're building them

378
00:13:09,480 --> 00:13:13,560
all up we can build this lovely big

379
00:13:10,920 --> 00:13:14,700
beautiful evaluation plan of all the

380
00:13:13,560 --> 00:13:15,920
different Transformations we're going to

381
00:13:14,700 --> 00:13:18,540
want to do

382
00:13:15,920 --> 00:13:20,519
and then

383
00:13:18,540 --> 00:13:22,200
so it's going to be a lovely big

384
00:13:20,519 --> 00:13:23,940
directed acyclic graph

385
00:13:22,200 --> 00:13:25,500
uh

386
00:13:23,940 --> 00:13:27,720
and then we can start to lay it out and

387
00:13:25,500 --> 00:13:30,000
we can start to optimize across that so

388
00:13:27,720 --> 00:13:31,800
we can break it out into stages a stage

389
00:13:30,000 --> 00:13:34,019
is any kind of processing we can do

390
00:13:31,800 --> 00:13:36,000
before we have to shuffle the data

391
00:13:34,019 --> 00:13:37,800
so maybe we've got the paralyzer filter

392
00:13:36,000 --> 00:13:39,600
the map they can all happen with the

393
00:13:37,800 --> 00:13:40,380
layout of the data across one node and

394
00:13:39,600 --> 00:13:41,760
it

395
00:13:40,380 --> 00:13:43,500
um and it's sort of moving this around

396
00:13:41,760 --> 00:13:44,880
and reorganizing this under the hood I

397
00:13:43,500 --> 00:13:46,440
think of it a little bit like the

398
00:13:44,880 --> 00:13:47,820
relationship of SQL to relational

399
00:13:46,440 --> 00:13:49,800
algebra I don't know if that's a totally

400
00:13:47,820 --> 00:13:51,360
accurate sort of parallel but it's

401
00:13:49,800 --> 00:13:53,220
useful for me in my head it's where

402
00:13:51,360 --> 00:13:54,779
we've got this we've got a stage two

403
00:13:53,220 --> 00:13:56,459
where because we're going to reduce by

404
00:13:54,779 --> 00:13:59,700
key we need to shuffle the data relative

405
00:13:56,459 --> 00:14:01,500
to the key that we're reducing on

406
00:13:59,700 --> 00:14:02,519
um maybe we have to shuffle again for

407
00:14:01,500 --> 00:14:03,600
our stage three because we're going to

408
00:14:02,519 --> 00:14:05,760
do a join we're joining on some

409
00:14:03,600 --> 00:14:07,680
different fields potentially

410
00:14:05,760 --> 00:14:13,160
um yeah we have this lovely optimized

411
00:14:07,680 --> 00:14:13,160
graph that Pi Spock is producing for us

412
00:14:13,320 --> 00:14:18,959
there is one thing that might come up

413
00:14:15,660 --> 00:14:20,820
again depends how sooner or later

414
00:14:18,959 --> 00:14:22,440
whether you will hit this you might get

415
00:14:20,820 --> 00:14:23,279
an out of memory error at some point and

416
00:14:22,440 --> 00:14:24,360
that's going to be really annoying

417
00:14:23,279 --> 00:14:25,740
because you're going to be like hey I

418
00:14:24,360 --> 00:14:27,420
just started using this library because

419
00:14:25,740 --> 00:14:29,459
it promised it would solve all of my

420
00:14:27,420 --> 00:14:32,639
memory issues and I would never have to

421
00:14:29,459 --> 00:14:34,139
worry about that ever again and yes uh

422
00:14:32,639 --> 00:14:36,120
the problem is is that this out of

423
00:14:34,139 --> 00:14:37,680
memory issue is actually that the plan

424
00:14:36,120 --> 00:14:39,480
got too big

425
00:14:37,680 --> 00:14:41,100
and so you either need to simplify the

426
00:14:39,480 --> 00:14:42,480
plan or allocate more memory in config

427
00:14:41,100 --> 00:14:43,800
it's a very easy fix there's a bunch of

428
00:14:42,480 --> 00:14:45,600
stack Overflow questions and answers

429
00:14:43,800 --> 00:14:47,220
about it it's just a really weird thing

430
00:14:45,600 --> 00:14:50,639
to look up so I like to flag it early

431
00:14:47,220 --> 00:14:51,839
before you spend half your half a day

432
00:14:50,639 --> 00:14:52,920
trying to figure out what just went

433
00:14:51,839 --> 00:14:54,720
wrong and why your program is

434
00:14:52,920 --> 00:14:56,399
complaining at you

435
00:14:54,720 --> 00:14:57,779
cool

436
00:14:56,399 --> 00:14:59,459
so we've got our lovely Transformations

437
00:14:57,779 --> 00:15:01,620
we've got all those stages of processing

438
00:14:59,459 --> 00:15:02,519
but they're all lazy so at some point

439
00:15:01,620 --> 00:15:04,079
we're going to have to throw some

440
00:15:02,519 --> 00:15:06,000
actions into the mix which is when we

441
00:15:04,079 --> 00:15:07,260
actually want the result

442
00:15:06,000 --> 00:15:08,579
so these are going to be eager they're

443
00:15:07,260 --> 00:15:09,899
going to force the evaluation of the

444
00:15:08,579 --> 00:15:11,940
plan

445
00:15:09,899 --> 00:15:14,459
and you might be sitting there thinking

446
00:15:11,940 --> 00:15:16,079
great I got you

447
00:15:14,459 --> 00:15:17,760
I'm going to have my lovely plan I'm

448
00:15:16,079 --> 00:15:19,920
going to do my stages one two three of

449
00:15:17,760 --> 00:15:20,820
processing and then I'm going to do my

450
00:15:19,920 --> 00:15:21,779
first action which is going to be

451
00:15:20,820 --> 00:15:23,940
account because I want to know how many

452
00:15:21,779 --> 00:15:25,500
rows of data my output's going to have

453
00:15:23,940 --> 00:15:26,579
I'm going to have my shirt because I

454
00:15:25,500 --> 00:15:29,160
want to have a couple of examples of

455
00:15:26,579 --> 00:15:30,600
what my output looks like and then I'm

456
00:15:29,160 --> 00:15:34,040
going to you know save out save as table

457
00:15:30,600 --> 00:15:34,040
save a safety whatever I'm going to do

458
00:15:34,800 --> 00:15:38,480
and that's

459
00:15:36,300 --> 00:15:40,800
so yes

460
00:15:38,480 --> 00:15:42,540
but your processing might end up looking

461
00:15:40,800 --> 00:15:44,699
more like this

462
00:15:42,540 --> 00:15:45,540
because for all of the optimization

463
00:15:44,699 --> 00:15:47,100
that's going to happen on the

464
00:15:45,540 --> 00:15:49,680
transformation side

465
00:15:47,100 --> 00:15:51,660
spark is going to be really eager as

466
00:15:49,680 --> 00:15:53,279
soon as it hits an action so it's going

467
00:15:51,660 --> 00:15:54,420
to do all the processing when you ask it

468
00:15:53,279 --> 00:15:55,560
for account and it'll give you that

469
00:15:54,420 --> 00:15:57,360
count result

470
00:15:55,560 --> 00:15:58,680
and then it will potentially do all of

471
00:15:57,360 --> 00:16:00,540
that processing again

472
00:15:58,680 --> 00:16:02,040
and show your results

473
00:16:00,540 --> 00:16:03,839
and then it'll do all of that person

474
00:16:02,040 --> 00:16:06,300
using a third time

475
00:16:03,839 --> 00:16:08,399
and then it'll save out the data

476
00:16:06,300 --> 00:16:09,720
so if you're programmed are running

477
00:16:08,399 --> 00:16:11,100
really slow you're feeling like they're

478
00:16:09,720 --> 00:16:12,899
running maybe three to four times longer

479
00:16:11,100 --> 00:16:15,480
than they probably should take

480
00:16:12,899 --> 00:16:18,000
this might be what's happening

481
00:16:15,480 --> 00:16:20,880
so again in terms of introductory stuff

482
00:16:18,000 --> 00:16:23,160
I'm not going to sort of go hugely in

483
00:16:20,880 --> 00:16:24,899
depth into memory management

484
00:16:23,160 --> 00:16:26,940
but it's worth flagging that it's

485
00:16:24,899 --> 00:16:29,459
something that you will need to think

486
00:16:26,940 --> 00:16:31,440
about sort of sooner or later and it the

487
00:16:29,459 --> 00:16:32,579
sooner is going to be when stuff starts

488
00:16:31,440 --> 00:16:35,279
taking away longer than you're thinking

489
00:16:32,579 --> 00:16:38,060
but you when you think it should

490
00:16:35,279 --> 00:16:41,100
um but definitely

491
00:16:38,060 --> 00:16:43,740
you need to be thinking about this to

492
00:16:41,100 --> 00:16:46,860
avoid that sort of multiple levels of

493
00:16:43,740 --> 00:16:48,360
processing across the stages

494
00:16:46,860 --> 00:16:50,100
so

495
00:16:48,360 --> 00:16:52,320
as an example I found this stack

496
00:16:50,100 --> 00:16:54,720
Overflow question I love it because it's

497
00:16:52,320 --> 00:16:57,899
someone basically has asked hey I put

498
00:16:54,720 --> 00:17:00,000
some checkpoints in and uh I have all

499
00:16:57,899 --> 00:17:01,560
these skip stages and is that helping my

500
00:17:00,000 --> 00:17:04,140
performance

501
00:17:01,560 --> 00:17:05,699
and the answer is yes

502
00:17:04,140 --> 00:17:07,439
um so we can kind of and we can break

503
00:17:05,699 --> 00:17:08,400
this down and we can see uh the concepts

504
00:17:07,439 --> 00:17:10,980
we've been talking about up to this

505
00:17:08,400 --> 00:17:13,380
point so each row is an action that has

506
00:17:10,980 --> 00:17:16,140
happened it might be a little bit hard

507
00:17:13,380 --> 00:17:17,819
to see but we can see in that second

508
00:17:16,140 --> 00:17:19,679
from the right column the different

509
00:17:17,819 --> 00:17:20,939
stages of processing and especially near

510
00:17:19,679 --> 00:17:22,319
that top they've got sort of seven

511
00:17:20,939 --> 00:17:23,640
stages of processing happening but

512
00:17:22,319 --> 00:17:25,079
they've got 11 skipped and that's

513
00:17:23,640 --> 00:17:26,220
because they're holding the data that's

514
00:17:25,079 --> 00:17:28,500
been processed up to that point in

515
00:17:26,220 --> 00:17:29,880
memory so they've told spot like hey I

516
00:17:28,500 --> 00:17:31,980
want you to reuse this bit don't forget

517
00:17:29,880 --> 00:17:33,900
about it the second it's gone

518
00:17:31,980 --> 00:17:35,940
um and right in the file we've got the

519
00:17:33,900 --> 00:17:37,799
tasks which is the count of like the

520
00:17:35,940 --> 00:17:39,720
individual tasks that happen across all

521
00:17:37,799 --> 00:17:40,559
of the worker nodes across all of the

522
00:17:39,720 --> 00:17:44,280
stages

523
00:17:40,559 --> 00:17:46,700
so that number tends to be pretty big

524
00:17:44,280 --> 00:17:46,700
cool

525
00:17:46,919 --> 00:17:50,880
um so in terms of what I want to cover

526
00:17:49,260 --> 00:17:52,320
oh yeah there was going to be a whole

527
00:17:50,880 --> 00:17:53,940
version of this where we just ranted

528
00:17:52,320 --> 00:17:56,039
about Java stack traces in my Python

529
00:17:53,940 --> 00:17:57,600
program for ages but full credit to the

530
00:17:56,039 --> 00:17:59,820
people working on postmark is that the

531
00:17:57,600 --> 00:18:02,160
error messages have gotten substantially

532
00:17:59,820 --> 00:18:03,900
better uh over the last few years so

533
00:18:02,160 --> 00:18:07,260
there is substantially less ranting from

534
00:18:03,900 --> 00:18:11,340
me uh on that front uh so shout out to

535
00:18:07,260 --> 00:18:13,380
them uh in terms of otherwise because uh

536
00:18:11,340 --> 00:18:17,220
we have the actions of being

537
00:18:13,380 --> 00:18:18,240
lazy and then the eager evaluation

538
00:18:17,220 --> 00:18:20,340
um

539
00:18:18,240 --> 00:18:22,919
depending on the complexity of your era

540
00:18:20,340 --> 00:18:25,500
it's pretty likely that you're an error

541
00:18:22,919 --> 00:18:27,900
in your plan is going to happen or be

542
00:18:25,500 --> 00:18:30,000
thrown at the point of evaluation so at

543
00:18:27,900 --> 00:18:31,799
that count at that show

544
00:18:30,000 --> 00:18:33,600
but it could be quite higher up in the

545
00:18:31,799 --> 00:18:35,460
fair bit higher up in the plan and so it

546
00:18:33,600 --> 00:18:38,580
is worth looking at

547
00:18:35,460 --> 00:18:39,840
um not too scary but just I have had a

548
00:18:38,580 --> 00:18:41,460
colleague who haven't initially had a

549
00:18:39,840 --> 00:18:44,100
heap of programming experience come in

550
00:18:41,460 --> 00:18:46,260
from SQL and they've been doing a really

551
00:18:44,100 --> 00:18:48,900
good job of like figuring out where the

552
00:18:46,260 --> 00:18:50,039
error was happening and trying to put in

553
00:18:48,900 --> 00:18:52,679
statements around it and trying to

554
00:18:50,039 --> 00:18:54,539
explore it and just not realizing that

555
00:18:52,679 --> 00:18:57,360
it was substantially higher up because

556
00:18:54,539 --> 00:18:58,980
of the way the lazy and eager interact

557
00:18:57,360 --> 00:19:02,160
with each other

558
00:18:58,980 --> 00:19:03,780
um the other fun gotcha is uh because

559
00:19:02,160 --> 00:19:06,059
our code is going to look more like the

560
00:19:03,780 --> 00:19:08,160
thing on the left in terms of you know

561
00:19:06,059 --> 00:19:09,539
we've written out the stages of the

562
00:19:08,160 --> 00:19:11,280
Transformations that we want and then

563
00:19:09,539 --> 00:19:11,940
the actions

564
00:19:11,280 --> 00:19:13,500
um

565
00:19:11,940 --> 00:19:16,820
but the processing is potentially with

566
00:19:13,500 --> 00:19:19,140
that duplication is if you have

567
00:19:16,820 --> 00:19:21,480
non-deterministic data coming through so

568
00:19:19,140 --> 00:19:23,039
for example if you are grabbing a

569
00:19:21,480 --> 00:19:24,600
thousand rows from your table and you

570
00:19:23,039 --> 00:19:26,280
don't really care which thousand rows

571
00:19:24,600 --> 00:19:28,880
you grab at the start you could

572
00:19:26,280 --> 00:19:32,760
potentially end up in a situation where

573
00:19:28,880 --> 00:19:34,440
the rows that are fed into that show are

574
00:19:32,760 --> 00:19:37,260
different to the rows that are fed into

575
00:19:34,440 --> 00:19:38,520
that safe and again memory management is

576
00:19:37,260 --> 00:19:39,780
going to potentially save you there in

577
00:19:38,520 --> 00:19:41,160
terms of saying no I actually just I

578
00:19:39,780 --> 00:19:43,460
want to save the processing at this

579
00:19:41,160 --> 00:19:46,260
point and use it from there

580
00:19:43,460 --> 00:19:48,419
that most confusingly comes up when you

581
00:19:46,260 --> 00:19:50,160
start joining a data frame to itself and

582
00:19:48,419 --> 00:19:54,120
you aren't getting any results from the

583
00:19:50,160 --> 00:19:57,480
join because it each side of that data

584
00:19:54,120 --> 00:19:58,860
frame is taking its own source data so

585
00:19:57,480 --> 00:20:00,600
that can be a bit of an odd one to track

586
00:19:58,860 --> 00:20:03,419
down again and if you sort of shows in

587
00:20:00,600 --> 00:20:05,100
they'll get different results as well uh

588
00:20:03,419 --> 00:20:06,539
cool the other way to handle that is to

589
00:20:05,100 --> 00:20:08,280
just put like an order by or be more

590
00:20:06,539 --> 00:20:11,160
selective in terms of how you if you are

591
00:20:08,280 --> 00:20:13,260
doing a subset of data

592
00:20:11,160 --> 00:20:15,059
um cool

593
00:20:13,260 --> 00:20:17,220
so

594
00:20:15,059 --> 00:20:18,419
what should I use

595
00:20:17,220 --> 00:20:21,720
um this is going to depend really

596
00:20:18,419 --> 00:20:24,360
heavily on you and I unfortunately can't

597
00:20:21,720 --> 00:20:25,740
recommend uh one way or the other there

598
00:20:24,360 --> 00:20:27,299
are a bunch of tools out there but

599
00:20:25,740 --> 00:20:29,520
things you might want to consider how

600
00:20:27,299 --> 00:20:31,320
much data do you have uh what does your

601
00:20:29,520 --> 00:20:34,260
current code base look like

602
00:20:31,320 --> 00:20:36,600
um so in my current job we work with a

603
00:20:34,260 --> 00:20:38,700
lot of geospatial data Pi spark doesn't

604
00:20:36,600 --> 00:20:40,200
have good geospatial capability there is

605
00:20:38,700 --> 00:20:42,480
Apache Sedona which just came out of

606
00:20:40,200 --> 00:20:44,340
incubation in I think March of this year

607
00:20:42,480 --> 00:20:46,559
if anyone here knows anything about it

608
00:20:44,340 --> 00:20:48,059
please come and find me I am very very

609
00:20:46,559 --> 00:20:50,220
curious to learn more about what is

610
00:20:48,059 --> 00:20:51,000
going on in that space

611
00:20:50,220 --> 00:20:52,740
um

612
00:20:51,000 --> 00:20:54,780
other questions how much time do you

613
00:20:52,740 --> 00:20:56,460
want to spend on infrastructure that can

614
00:20:54,780 --> 00:20:58,080
also be substituted for how much money

615
00:20:56,460 --> 00:21:00,299
how much headaches what's the capability

616
00:20:58,080 --> 00:21:01,919
of the people in your team do you want

617
00:21:00,299 --> 00:21:04,740
to ask someone else buy a different

618
00:21:01,919 --> 00:21:06,240
product uh I'm sure there are people at

619
00:21:04,740 --> 00:21:08,820
this conference who would probably very

620
00:21:06,240 --> 00:21:10,559
willingly sell you some stuff uh who is

621
00:21:08,820 --> 00:21:12,000
working on your code and any personal

622
00:21:10,559 --> 00:21:13,799
preferences I actually just really like

623
00:21:12,000 --> 00:21:16,559
using pi spark as just like noodling

624
00:21:13,799 --> 00:21:17,820
around on my laptop for small projects I

625
00:21:16,559 --> 00:21:19,980
find it a little bit more intuitive than

626
00:21:17,820 --> 00:21:23,000
pandas but that's going to be a total

627
00:21:19,980 --> 00:21:23,000
personal preference thing

628
00:21:23,400 --> 00:21:26,400
um

629
00:21:24,120 --> 00:21:28,919
in conclusion

630
00:21:26,400 --> 00:21:30,720
what is pi Spark

631
00:21:28,919 --> 00:21:32,880
oh yep

632
00:21:30,720 --> 00:21:35,340
hopefully I've answered that at least a

633
00:21:32,880 --> 00:21:38,940
little bit uh

634
00:21:35,340 --> 00:21:40,919
can it solve all of my data problems

635
00:21:38,940 --> 00:21:42,539
kind of but you also get some really fun

636
00:21:40,919 --> 00:21:43,320
new ones

637
00:21:42,539 --> 00:21:45,480
um

638
00:21:43,320 --> 00:21:47,280
and are you sure I can't just use pandas

639
00:21:45,480 --> 00:21:49,260
instead

640
00:21:47,280 --> 00:21:50,760
so they they did a bit of a maintenance

641
00:21:49,260 --> 00:21:52,919
which they did actually bring out a

642
00:21:50,760 --> 00:21:55,080
pandas API a while back I haven't

643
00:21:52,919 --> 00:21:58,320
personally used it so I can't give you

644
00:21:55,080 --> 00:22:00,360
any advice on how well it works but it

645
00:21:58,320 --> 00:22:02,039
is there so if you don't like the look

646
00:22:00,360 --> 00:22:03,600
of the code I've been showing you there

647
00:22:02,039 --> 00:22:05,280
is a whole other section of the library

648
00:22:03,600 --> 00:22:07,039
that you can definitely check out and

649
00:22:05,280 --> 00:22:10,140
have a look at

650
00:22:07,039 --> 00:22:12,419
uh cool and then the one other thing I

651
00:22:10,140 --> 00:22:13,799
wanted to cover uh sweet so if you have

652
00:22:12,419 --> 00:22:15,059
sat through to this far and you're like

653
00:22:13,799 --> 00:22:16,500
yep that's that's pretty good that's

654
00:22:15,059 --> 00:22:18,360
pretty informative

655
00:22:16,500 --> 00:22:19,919
um but Alex like I I really want to know

656
00:22:18,360 --> 00:22:21,480
more about this and I really want to

657
00:22:19,919 --> 00:22:23,220
know from someone who actually like

658
00:22:21,480 --> 00:22:24,780
really knows what they're talking about

659
00:22:23,220 --> 00:22:27,120
um I highly recommend checking out

660
00:22:24,780 --> 00:22:29,580
Holden Corral uh she actually gave a

661
00:22:27,120 --> 00:22:32,039
presentation on Pi spark at pycon

662
00:22:29,580 --> 00:22:33,299
Australia back in 2017 so that's on

663
00:22:32,039 --> 00:22:34,679
YouTube

664
00:22:33,299 --> 00:22:36,720
um but she's given a bunch of

665
00:22:34,679 --> 00:22:38,700
presentations about Pi spark I watched

666
00:22:36,720 --> 00:22:41,520
many of them when preparing for this

667
00:22:38,700 --> 00:22:42,720
talk uh so I'm definitely very grateful

668
00:22:41,520 --> 00:22:43,799
for that

669
00:22:42,720 --> 00:22:45,600
um she's written a bunch of books on

670
00:22:43,799 --> 00:22:47,820
spark so if you have an employer that

671
00:22:45,600 --> 00:22:48,960
enjoys buying textbooks that's worth

672
00:22:47,820 --> 00:22:51,360
checking out

673
00:22:48,960 --> 00:22:53,780
um but yeah otherwise I think that's

674
00:22:51,360 --> 00:22:53,780
everything

675
00:23:00,120 --> 00:23:04,080
sweet

676
00:23:01,140 --> 00:23:05,820
I totally lost track of time on that one

677
00:23:04,080 --> 00:23:07,860
no that was great thank you so much for

678
00:23:05,820 --> 00:23:09,659
that talk um I think the problem of what

679
00:23:07,860 --> 00:23:12,419
you know what I've run out of machine

680
00:23:09,659 --> 00:23:15,419
for memory for analyzing my data and uh

681
00:23:12,419 --> 00:23:17,220
what what to do is a perennial one which

682
00:23:15,419 --> 00:23:19,080
I think it's always good to have some

683
00:23:17,220 --> 00:23:21,440
good guidance on what to do and it was a

684
00:23:19,080 --> 00:23:24,600
really good um practically sort of

685
00:23:21,440 --> 00:23:25,559
demystifying uh Pi spark so thank you so

686
00:23:24,600 --> 00:23:28,620
much

687
00:23:25,559 --> 00:23:31,400
um uh and I'm going to just forgotten

688
00:23:28,620 --> 00:23:31,400
my

689
00:23:34,679 --> 00:23:38,240
gift we have here for you

690
00:23:38,340 --> 00:23:42,900
thank you very much if you have

691
00:23:41,100 --> 00:23:46,080
questions for Alex please pop them in

692
00:23:42,900 --> 00:23:48,059
the Discord or come and chat to Alex uh

693
00:23:46,080 --> 00:23:49,600
later on um and so with that can we have

694
00:23:48,059 --> 00:23:56,040
a big round of applause

695
00:23:49,600 --> 00:23:56,040
[Applause]