1 00:00:00,539 --> 00:00:03,539 foreign 2 00:00:11,360 --> 00:00:17,760 welcome back everyone to All Things data 3 00:00:14,820 --> 00:00:19,500 uh we've got in this uh block we have 4 00:00:17,760 --> 00:00:21,060 two talks before we jump into 5 00:00:19,500 --> 00:00:22,439 introducing the next talk just a 6 00:00:21,060 --> 00:00:23,760 reminder that we have a Discord Channel 7 00:00:22,439 --> 00:00:26,039 there's some really good conversations 8 00:00:23,760 --> 00:00:27,960 about the talks and and links to open 9 00:00:26,039 --> 00:00:29,699 source uh packages and things that 10 00:00:27,960 --> 00:00:31,859 people talking about so Jump On In say 11 00:00:29,699 --> 00:00:33,600 hello we've got people on the online uh 12 00:00:31,859 --> 00:00:37,260 tuning in remotely as well joining in 13 00:00:33,600 --> 00:00:40,140 there so uh to kick us off uh we've got 14 00:00:37,260 --> 00:00:42,600 Alex Ware and I'm just going to read a 15 00:00:40,140 --> 00:00:44,760 little intro here uh Alex is a software 16 00:00:42,600 --> 00:00:47,399 engineer working for geoscape Australia 17 00:00:44,760 --> 00:00:48,899 and he's based in Canberra Australia she 18 00:00:47,399 --> 00:00:50,760 previously spent several years working 19 00:00:48,899 --> 00:00:52,920 as a data engineer in the Australian 20 00:00:50,760 --> 00:00:54,600 Public Service she's a co-organizer of 21 00:00:52,920 --> 00:00:56,219 the Canberra python user group and a 22 00:00:54,600 --> 00:00:58,320 co-organizer of the upcoming Django 23 00:00:56,219 --> 00:01:01,320 girls Canberra Workshop she's passionate 24 00:00:58,320 --> 00:01:03,059 about Big Data clean code and supporting 25 00:01:01,320 --> 00:01:06,420 those with a marginalized experience of 26 00:01:03,059 --> 00:01:09,420 gender in Tech so uh with that um please 27 00:01:06,420 --> 00:01:11,700 big round of applause for uh 28 00:01:09,420 --> 00:01:14,340 uh Alex Ware who's going to be talking 29 00:01:11,700 --> 00:01:17,060 about an introduction to Pi Spark 30 00:01:14,340 --> 00:01:17,060 oops 31 00:01:18,479 --> 00:01:22,439 hello hopefully this microphone is 32 00:01:20,280 --> 00:01:24,960 working okay uh so today I'm going to be 33 00:01:22,439 --> 00:01:26,400 giving an introduction to Pi spark in my 34 00:01:24,960 --> 00:01:29,400 talk I'm going to be trying to answer 35 00:01:26,400 --> 00:01:32,340 some questions such as what actually is 36 00:01:29,400 --> 00:01:34,680 pi spark can it really solve all of my 37 00:01:32,340 --> 00:01:36,960 data problems and possibly most 38 00:01:34,680 --> 00:01:39,119 important are you sure I can't just use 39 00:01:36,960 --> 00:01:41,220 pandas instead 40 00:01:39,119 --> 00:01:42,299 uh we sort of covered the who I am a 41 00:01:41,220 --> 00:01:43,680 little bit so software engineer at 42 00:01:42,299 --> 00:01:45,600 geoscope Australia working predominantly 43 00:01:43,680 --> 00:01:47,700 with geospatial data uh the very quick 44 00:01:45,600 --> 00:01:49,079 call out for my employers we our claim 45 00:01:47,700 --> 00:01:50,939 to fame is Gina from the national 46 00:01:49,079 --> 00:01:52,619 address file but we also do a whole 47 00:01:50,939 --> 00:01:53,579 bunch of other geospatial products I 48 00:01:52,619 --> 00:01:55,259 know there's been a bit of discussion 49 00:01:53,579 --> 00:01:57,000 about geospatial stuff today so if 50 00:01:55,259 --> 00:01:59,460 you're interested in things like data 51 00:01:57,000 --> 00:02:03,180 around property cadasta buildings roads 52 00:01:59,460 --> 00:02:04,799 solar trees maybe check us out uh 53 00:02:03,180 --> 00:02:06,240 previously in the public service won't 54 00:02:04,799 --> 00:02:07,799 talk about that too much other than 55 00:02:06,240 --> 00:02:10,080 that's where I first started using pi 56 00:02:07,799 --> 00:02:12,360 spark and the workshop has actually 57 00:02:10,080 --> 00:02:14,400 happened now it happened a week ago and 58 00:02:12,360 --> 00:02:15,840 it was a lot of fun and thank you to 59 00:02:14,400 --> 00:02:18,000 Django girls for supporting us in 60 00:02:15,840 --> 00:02:20,640 running that 61 00:02:18,000 --> 00:02:22,680 okay the really important disclaimer is 62 00:02:20,640 --> 00:02:25,200 that after that slide is I don't talk 63 00:02:22,680 --> 00:02:27,180 for any current or former employees I am 64 00:02:25,200 --> 00:02:29,520 here entirely sort of presenting my own 65 00:02:27,180 --> 00:02:31,980 opinions and very importantly I don't 66 00:02:29,520 --> 00:02:33,780 have any like major or minor or any 67 00:02:31,980 --> 00:02:35,520 links to Apache I do not speak for them 68 00:02:33,780 --> 00:02:36,900 in any way I am just a hobbyist who 69 00:02:35,520 --> 00:02:39,360 played around with the library and got 70 00:02:36,900 --> 00:02:41,879 talked into giving a talk at pycon 71 00:02:39,360 --> 00:02:45,780 um as happens to the best of us 72 00:02:41,879 --> 00:02:47,099 cool so as part of this talk I'm going 73 00:02:45,780 --> 00:02:49,980 to be giving a couple of code examples 74 00:02:47,099 --> 00:02:51,540 to do that I needed some data and very 75 00:02:49,980 --> 00:02:53,160 fortunately Brisbane City Council 76 00:02:51,540 --> 00:02:54,599 actually releases a bunch of information 77 00:02:53,160 --> 00:02:56,099 about the library checkups that have 78 00:02:54,599 --> 00:02:57,480 happened over a three day period each 79 00:02:56,099 --> 00:03:00,060 month they've been doing this since back 80 00:02:57,480 --> 00:03:01,739 in about the start of 2020 so there's a 81 00:03:00,060 --> 00:03:03,120 fair bit of data there 82 00:03:01,739 --> 00:03:05,040 um and I am going to use some of that 83 00:03:03,120 --> 00:03:06,780 for my presentation 84 00:03:05,040 --> 00:03:09,900 so 85 00:03:06,780 --> 00:03:11,819 let's start with pandas hopefully this 86 00:03:09,900 --> 00:03:13,440 is all pretty familiar so far I actually 87 00:03:11,819 --> 00:03:15,120 it's very bright up here so I can't see 88 00:03:13,440 --> 00:03:16,980 faces too well 89 00:03:15,120 --> 00:03:20,040 um but yeah basic example we're going to 90 00:03:16,980 --> 00:03:21,840 read in some data from July of this year 91 00:03:20,040 --> 00:03:23,459 we're going to select some columns or 92 00:03:21,840 --> 00:03:24,900 fields and we're going to look at those 93 00:03:23,459 --> 00:03:26,640 rows 94 00:03:24,900 --> 00:03:28,940 so far so good 95 00:03:26,640 --> 00:03:30,959 we can do the same thing with pricebach 96 00:03:28,940 --> 00:03:33,420 now you might be looking at this code 97 00:03:30,959 --> 00:03:35,159 example and saying hey Alex well what's 98 00:03:33,420 --> 00:03:37,140 that thing going on with the spark 99 00:03:35,159 --> 00:03:38,700 session I'm going to go that's a great 100 00:03:37,140 --> 00:03:40,560 question I'm not going to answer it yet 101 00:03:38,700 --> 00:03:44,459 but if you look at the two lines 102 00:03:40,560 --> 00:03:46,799 underneath it very similar to pandas so 103 00:03:44,459 --> 00:03:49,019 we read in the same data select the 104 00:03:46,799 --> 00:03:51,180 columns and we're going to show it 105 00:03:49,019 --> 00:03:52,799 in a very similar output 106 00:03:51,180 --> 00:03:54,299 maybe we want to do something a little 107 00:03:52,799 --> 00:03:56,099 bit more interesting with our data we 108 00:03:54,299 --> 00:03:57,239 want to pull out some statistics I 109 00:03:56,099 --> 00:03:58,799 actually don't write pandas very often 110 00:03:57,239 --> 00:04:01,680 so if I've committed any cardinal sins 111 00:03:58,799 --> 00:04:04,080 please forgive me but hopefully it looks 112 00:04:01,680 --> 00:04:06,120 mostly how it's meant to look in terms 113 00:04:04,080 --> 00:04:07,680 of getting out a row count maybe looking 114 00:04:06,120 --> 00:04:08,940 at the different languages how many 115 00:04:07,680 --> 00:04:11,040 different languages of books been 116 00:04:08,940 --> 00:04:12,780 checked out in and what's the breakdown 117 00:04:11,040 --> 00:04:16,380 by sort of the different age categories 118 00:04:12,780 --> 00:04:18,239 uh juvenile is the most popular for this 119 00:04:16,380 --> 00:04:20,820 life across the Brisbane libraries but 120 00:04:18,239 --> 00:04:23,580 interestingly like the adult category is 121 00:04:20,820 --> 00:04:25,680 pretty much up there as well 122 00:04:23,580 --> 00:04:28,560 and possibly more interesting let's look 123 00:04:25,680 --> 00:04:30,780 at the pi spark version so we can see 124 00:04:28,560 --> 00:04:32,699 pretty much most of the functions do 125 00:04:30,780 --> 00:04:34,440 what they say on the tin we've got our 126 00:04:32,699 --> 00:04:36,479 account at the top which will tell us 127 00:04:34,440 --> 00:04:38,580 our rows in our data frame we can select 128 00:04:36,479 --> 00:04:41,759 and filter down our data frame to get 129 00:04:38,580 --> 00:04:43,320 those languages or this distinct we can 130 00:04:41,759 --> 00:04:45,780 do a group by an account on those age 131 00:04:43,320 --> 00:04:48,300 categories so at this point it should be 132 00:04:45,780 --> 00:04:51,060 at least becoming a little bit obvious 133 00:04:48,300 --> 00:04:53,160 that like if pandas is python kind of 134 00:04:51,060 --> 00:04:55,139 pretending to be art of some extent 135 00:04:53,160 --> 00:04:57,300 this part of the pi Spock library is 136 00:04:55,139 --> 00:04:59,280 python pretending to be SQL which I 137 00:04:57,300 --> 00:05:01,199 quite like I I find that quite intuitive 138 00:04:59,280 --> 00:05:02,460 there's an element of you're able to 139 00:05:01,199 --> 00:05:03,780 make some assumptions about what you 140 00:05:02,460 --> 00:05:05,400 should be able to do based on your 141 00:05:03,780 --> 00:05:08,460 knowledge of SQL and you can translate 142 00:05:05,400 --> 00:05:09,780 that across into python 143 00:05:08,460 --> 00:05:13,259 cool 144 00:05:09,780 --> 00:05:16,139 let's go with one more scenario so in 145 00:05:13,259 --> 00:05:17,820 this case we have we're going to try and 146 00:05:16,139 --> 00:05:19,199 group any checkouts that occur within 147 00:05:17,820 --> 00:05:20,940 five seconds of each other in the same 148 00:05:19,199 --> 00:05:22,020 Library so I'm going to make this 149 00:05:20,940 --> 00:05:23,699 assumption if it's happening close 150 00:05:22,020 --> 00:05:25,139 together in time then I can say that the 151 00:05:23,699 --> 00:05:26,400 same person did it now obviously you 152 00:05:25,139 --> 00:05:28,680 might have different checkout machines 153 00:05:26,400 --> 00:05:29,759 in the same Library it's not perfect but 154 00:05:28,680 --> 00:05:32,160 it's going to let us play around with 155 00:05:29,759 --> 00:05:33,600 the data a little bit 156 00:05:32,160 --> 00:05:35,039 and we've got another code example now 157 00:05:33,600 --> 00:05:36,720 don't worry too much if you can't really 158 00:05:35,039 --> 00:05:37,740 read this this isn't really one that I'm 159 00:05:36,720 --> 00:05:38,940 going to like go through in a lot of 160 00:05:37,740 --> 00:05:41,460 detail this is more just about like 161 00:05:38,940 --> 00:05:44,400 proving that I wrote the code 162 00:05:41,460 --> 00:05:45,720 um that it is possible also just it's a 163 00:05:44,400 --> 00:05:47,039 little bit of my love letter to window 164 00:05:45,720 --> 00:05:49,380 functions because I think they're 165 00:05:47,039 --> 00:05:50,759 fantastic and I love using them and I 166 00:05:49,380 --> 00:05:53,220 liked that I got to use them in this 167 00:05:50,759 --> 00:05:55,080 example but I will move on pretty 168 00:05:53,220 --> 00:05:56,580 quickly but feel free if you are curious 169 00:05:55,080 --> 00:05:59,100 about this or any other element to come 170 00:05:56,580 --> 00:06:01,199 find me afterwards and I'm also going to 171 00:05:59,100 --> 00:06:03,419 be putting a lot of this up on GitHub so 172 00:06:01,199 --> 00:06:04,740 you can find it later 173 00:06:03,419 --> 00:06:07,919 but 174 00:06:04,740 --> 00:06:10,020 if we run this we can find the checkout 175 00:06:07,919 --> 00:06:12,720 with the low group with the largest 176 00:06:10,020 --> 00:06:13,979 amount of checkouts and we can go have a 177 00:06:12,720 --> 00:06:16,259 look at it 178 00:06:13,979 --> 00:06:18,180 and it's this one which might be a 179 00:06:16,259 --> 00:06:19,979 little bit hard to read but I love this 180 00:06:18,180 --> 00:06:22,319 because I get to imagine some kid had 181 00:06:19,979 --> 00:06:24,380 just like the best 68 seconds of their 182 00:06:22,319 --> 00:06:27,120 life as they borrowed out like 183 00:06:24,380 --> 00:06:29,819 everything ever like there is Wings of 184 00:06:27,120 --> 00:06:32,580 Fire Miles Morales uh half of ando's 185 00:06:29,819 --> 00:06:34,860 back catalog and I mean really like why 186 00:06:32,580 --> 00:06:36,360 would we work with data except to find 187 00:06:34,860 --> 00:06:37,620 things like this 188 00:06:36,360 --> 00:06:39,720 um my personal favorite I don't know if 189 00:06:37,620 --> 00:06:41,039 you know how fastbacks work but 190 00:06:39,720 --> 00:06:43,380 basically it's a category in the library 191 00:06:41,039 --> 00:06:44,699 where you get exactly a week to read 192 00:06:43,380 --> 00:06:47,280 that book and you are not allowed to 193 00:06:44,699 --> 00:06:49,080 renew it the degree of optimism I get to 194 00:06:47,280 --> 00:06:50,520 imagine that this kid has about their 195 00:06:49,080 --> 00:06:51,300 week 196 00:06:50,520 --> 00:06:52,500 um 197 00:06:51,300 --> 00:06:54,600 like 198 00:06:52,500 --> 00:06:56,880 I love that so much 199 00:06:54,600 --> 00:06:58,380 um cool but back to the point of the 200 00:06:56,880 --> 00:06:59,940 talk 201 00:06:58,380 --> 00:07:01,740 what should I use 202 00:06:59,940 --> 00:07:03,660 who doesn't love a benchmark 203 00:07:01,740 --> 00:07:06,060 okay so luckily I found one someone made 204 00:07:03,660 --> 00:07:07,560 before me so you can go check them out 205 00:07:06,060 --> 00:07:09,479 I'm 206 00:07:07,560 --> 00:07:12,240 these report this evaluation actually 207 00:07:09,479 --> 00:07:13,800 happened back in 2021 all of these 208 00:07:12,240 --> 00:07:15,600 libraries have changed to some degree 209 00:07:13,800 --> 00:07:17,400 since then so this is really just meant 210 00:07:15,600 --> 00:07:21,599 to be a very like look they're all 211 00:07:17,400 --> 00:07:22,319 relatively similar on small data sets 212 00:07:21,599 --> 00:07:28,340 um 213 00:07:22,319 --> 00:07:31,259 but if we go to a larger data set 214 00:07:28,340 --> 00:07:33,300 and this is an important point of uh one 215 00:07:31,259 --> 00:07:35,580 of the reasons you might be interested 216 00:07:33,300 --> 00:07:37,440 in learning more about Pi spark might 217 00:07:35,580 --> 00:07:40,020 have less to do with your like innate 218 00:07:37,440 --> 00:07:41,880 interest in learning about a new topic 219 00:07:40,020 --> 00:07:43,380 and it might be that your current tools 220 00:07:41,880 --> 00:07:45,419 have stopped working for your data set 221 00:07:43,380 --> 00:07:49,620 because it got too big 222 00:07:45,419 --> 00:07:50,699 so that's kind of a big selling point 223 00:07:49,620 --> 00:07:53,280 and maybe you're at this point you're 224 00:07:50,699 --> 00:07:54,599 like okay fair enough I'm curious I want 225 00:07:53,280 --> 00:07:58,319 to learn a bit more 226 00:07:54,599 --> 00:08:01,500 let's go have a look under the hood 227 00:07:58,319 --> 00:08:02,880 and this is the count function and we 228 00:08:01,500 --> 00:08:05,759 see that when we call the count function 229 00:08:02,880 --> 00:08:07,680 on a data frame it immediately calls 230 00:08:05,759 --> 00:08:10,680 something else and asks it for the count 231 00:08:07,680 --> 00:08:11,400 and that thing has a little J prefixing 232 00:08:10,680 --> 00:08:13,080 it 233 00:08:11,400 --> 00:08:14,699 which is making us think like maybe 234 00:08:13,080 --> 00:08:18,060 under the hood there's something 235 00:08:14,699 --> 00:08:21,539 happening that's not totally pythonic 236 00:08:18,060 --> 00:08:23,759 and we go to but we go to a collect 237 00:08:21,539 --> 00:08:25,979 and a collect is where we say hey I want 238 00:08:23,759 --> 00:08:27,900 this data frame as a list of Records a 239 00:08:25,979 --> 00:08:30,180 list is that's very like that has to 240 00:08:27,900 --> 00:08:31,860 that is in Python this has to be far 241 00:08:30,180 --> 00:08:32,580 more python 242 00:08:31,860 --> 00:08:35,820 um 243 00:08:32,580 --> 00:08:37,500 you know this will be great 244 00:08:35,820 --> 00:08:40,380 uh 245 00:08:37,500 --> 00:08:44,060 yes who doesn't love seeing sockets and 246 00:08:40,380 --> 00:08:44,060 pickling in their python Library 247 00:08:44,159 --> 00:08:47,160 yeah 248 00:08:45,600 --> 00:08:49,380 so that's 249 00:08:47,160 --> 00:08:50,940 I mean I'm told it works and a lot of 250 00:08:49,380 --> 00:08:53,040 smarter the people than me have worked 251 00:08:50,940 --> 00:08:56,100 on this and I believe it works and I try 252 00:08:53,040 --> 00:08:57,720 not to think too hard about this exact 253 00:08:56,100 --> 00:08:59,399 moment when daughter is being thrown 254 00:08:57,720 --> 00:09:01,260 back and forth in the worst game of 255 00:08:59,399 --> 00:09:03,839 catch ever 256 00:09:01,260 --> 00:09:06,060 um but yeah I as far as I'm aware it 257 00:09:03,839 --> 00:09:07,620 works we get the data out in Python it 258 00:09:06,060 --> 00:09:09,779 does appear to be that a Java is 259 00:09:07,620 --> 00:09:11,760 entirely hiding under my library 260 00:09:09,779 --> 00:09:13,320 slightly concerning 261 00:09:11,760 --> 00:09:15,180 and at this point I have to come clean 262 00:09:13,320 --> 00:09:16,740 and say that I have actually come to a 263 00:09:15,180 --> 00:09:18,660 python conference 264 00:09:16,740 --> 00:09:22,560 ostensibly to talk about a python 265 00:09:18,660 --> 00:09:24,480 library that is just like the smallest 266 00:09:22,560 --> 00:09:28,200 amount of python wrapped around a 267 00:09:24,480 --> 00:09:29,760 scholar Library so I'm sorry but you can 268 00:09:28,200 --> 00:09:31,500 for the most part pretend The Scholar 269 00:09:29,760 --> 00:09:34,260 doesn't exist except when it throws 270 00:09:31,500 --> 00:09:36,540 errors that a massive Java stack traces 271 00:09:34,260 --> 00:09:37,980 which is slightly ominous from your 272 00:09:36,540 --> 00:09:40,500 python code 273 00:09:37,980 --> 00:09:42,300 um I like this image because uh it's 274 00:09:40,500 --> 00:09:43,920 just this the star on the end with the 275 00:09:42,300 --> 00:09:45,060 tiny little bit of python wrapped around 276 00:09:43,920 --> 00:09:46,980 it because I think it is very 277 00:09:45,060 --> 00:09:49,160 illustrative of what is going on in this 278 00:09:46,980 --> 00:09:49,160 Library 279 00:09:49,320 --> 00:09:56,160 but this does raise the obvious question 280 00:09:52,920 --> 00:09:58,800 why why do we need a massive thing of 281 00:09:56,160 --> 00:10:01,200 Scala underneath our python right what 282 00:09:58,800 --> 00:10:03,000 could possibly be offering me that would 283 00:10:01,200 --> 00:10:04,980 justify this 284 00:10:03,000 --> 00:10:06,899 uh 285 00:10:04,980 --> 00:10:08,940 and this could Spark 286 00:10:06,899 --> 00:10:12,300 it's basically what if we had functional 287 00:10:08,940 --> 00:10:14,640 programming on data and we distributed 288 00:10:12,300 --> 00:10:16,200 it and wouldn't that be great and fine 289 00:10:14,640 --> 00:10:18,779 and it would never cause any problems 290 00:10:16,200 --> 00:10:20,940 for anyone and it's fantastic 291 00:10:18,779 --> 00:10:23,100 cool sorry 292 00:10:20,940 --> 00:10:25,019 distribution 293 00:10:23,100 --> 00:10:26,339 um now important to touch on at this 294 00:10:25,019 --> 00:10:27,600 point in the presentation all of the 295 00:10:26,339 --> 00:10:29,459 code examples everything I've been kind 296 00:10:27,600 --> 00:10:30,899 of talking to up to this point that's 297 00:10:29,459 --> 00:10:32,820 the driver program we can see that 298 00:10:30,899 --> 00:10:34,260 lovely spark context kind of dodged a 299 00:10:32,820 --> 00:10:37,560 little bit earlier 300 00:10:34,260 --> 00:10:38,640 um in terms of if you just want to get 301 00:10:37,560 --> 00:10:40,019 started you just want to have a play 302 00:10:38,640 --> 00:10:41,519 around you're not too concerned about 303 00:10:40,019 --> 00:10:43,800 sort of dealing with that out of memory 304 00:10:41,519 --> 00:10:46,380 like really running a massive data set 305 00:10:43,800 --> 00:10:47,640 you can kind of get to this point and 306 00:10:46,380 --> 00:10:49,800 just have a play around you don't need 307 00:10:47,640 --> 00:10:51,420 to worry about the next part yet you can 308 00:10:49,800 --> 00:10:53,220 do a pip install maybe play around with 309 00:10:51,420 --> 00:10:54,600 a couple of java settings but you can 310 00:10:53,220 --> 00:10:56,640 open up a notebook and be running pretty 311 00:10:54,600 --> 00:10:58,860 quickly and on your sort of smaller 312 00:10:56,640 --> 00:10:59,519 Library checkout data set 313 00:10:58,860 --> 00:11:01,320 um 314 00:10:59,519 --> 00:11:02,640 but let's say you are interested in this 315 00:11:01,320 --> 00:11:04,200 idea if I want to I want to throw some 316 00:11:02,640 --> 00:11:06,000 worker nodes on my problem I want to 317 00:11:04,200 --> 00:11:07,380 really have the benefit of being able to 318 00:11:06,000 --> 00:11:08,820 work with that larger data set and not 319 00:11:07,380 --> 00:11:10,680 spend three days waiting for my code to 320 00:11:08,820 --> 00:11:12,899 run and great we're going to stick a 321 00:11:10,680 --> 00:11:15,839 cluster manager in the middle so we have 322 00:11:12,899 --> 00:11:17,279 our Lively diagram and it's going to 323 00:11:15,839 --> 00:11:20,519 look something like this in terms of a 324 00:11:17,279 --> 00:11:22,860 dashboard this one's pretty empty but 325 00:11:20,519 --> 00:11:25,140 it's just there for illustration 326 00:11:22,860 --> 00:11:28,920 and basically the promise Pi spark is 327 00:11:25,140 --> 00:11:31,380 making to us is if you write this Java 328 00:11:28,920 --> 00:11:33,660 program and 329 00:11:31,380 --> 00:11:35,399 you write all the code using the library 330 00:11:33,660 --> 00:11:38,100 that you've been given 331 00:11:35,399 --> 00:11:41,040 it will handle all of the thinking about 332 00:11:38,100 --> 00:11:43,200 how we decide what data goes on which 333 00:11:41,040 --> 00:11:44,760 node and which node is doing which task 334 00:11:43,200 --> 00:11:45,660 and how the processing is going to 335 00:11:44,760 --> 00:11:48,120 happen 336 00:11:45,660 --> 00:11:51,720 and whether you believe the library 337 00:11:48,120 --> 00:11:53,399 is entirely up to you but 338 00:11:51,720 --> 00:11:56,519 I will admit it does make life a lot 339 00:11:53,399 --> 00:11:58,500 easier sometimes 340 00:11:56,519 --> 00:12:00,180 but if you don't totally believe Library 341 00:11:58,500 --> 00:12:01,320 when it says hey you don't never need to 342 00:12:00,180 --> 00:12:02,279 worry about it we might want to know a 343 00:12:01,320 --> 00:12:05,579 little bit more about what's happening 344 00:12:02,279 --> 00:12:07,680 under the hood and to understand that 345 00:12:05,579 --> 00:12:10,079 we have to go back to mapreduce which is 346 00:12:07,680 --> 00:12:11,700 the thing that came before Spark 347 00:12:10,079 --> 00:12:13,560 um and so I'm going to move through this 348 00:12:11,700 --> 00:12:15,320 very very quickly but it's just a kind 349 00:12:13,560 --> 00:12:18,600 of a nice background 350 00:12:15,320 --> 00:12:20,040 mapreduce start with some input data 351 00:12:18,600 --> 00:12:22,980 splits the input data across the 352 00:12:20,040 --> 00:12:24,360 different nodes applies a map to it 353 00:12:22,980 --> 00:12:26,940 shuffles the data across the different 354 00:12:24,360 --> 00:12:29,100 nodes applied to reduce to it and 355 00:12:26,940 --> 00:12:30,660 outputs it's lovely it does exactly what 356 00:12:29,100 --> 00:12:33,480 it says on the tin makes it very simple 357 00:12:30,660 --> 00:12:35,279 and easy to remember super useful in the 358 00:12:33,480 --> 00:12:36,959 sense that it solves a problem around 359 00:12:35,279 --> 00:12:38,360 how do we split out the data and with 360 00:12:36,959 --> 00:12:40,200 our processing 361 00:12:38,360 --> 00:12:41,459 downsides are that if you want to do 362 00:12:40,200 --> 00:12:43,019 more complex things you have to start 363 00:12:41,459 --> 00:12:44,459 stringing a lot of these together those 364 00:12:43,019 --> 00:12:46,440 shuffling steps are going to take a 365 00:12:44,459 --> 00:12:47,880 while the input and output steps are 366 00:12:46,440 --> 00:12:50,040 going to start to come for you after a 367 00:12:47,880 --> 00:12:51,959 while so eventually you are going to 368 00:12:50,040 --> 00:12:54,360 really see some issues with the runtime 369 00:12:51,959 --> 00:12:56,040 of your programs 370 00:12:54,360 --> 00:12:57,600 but we all promise spark was going to do 371 00:12:56,040 --> 00:13:00,480 it better 372 00:12:57,600 --> 00:13:02,639 so what does spark do differently 373 00:13:00,480 --> 00:13:03,899 so spark takes this idea of things like 374 00:13:02,639 --> 00:13:06,120 the map and the reducer those 375 00:13:03,899 --> 00:13:07,500 Transformations and says what if they 376 00:13:06,120 --> 00:13:09,480 were lazy 377 00:13:07,500 --> 00:13:10,920 so we can just as we're building them 378 00:13:09,480 --> 00:13:13,560 all up we can build this lovely big 379 00:13:10,920 --> 00:13:14,700 beautiful evaluation plan of all the 380 00:13:13,560 --> 00:13:15,920 different Transformations we're going to 381 00:13:14,700 --> 00:13:18,540 want to do 382 00:13:15,920 --> 00:13:20,519 and then 383 00:13:18,540 --> 00:13:22,200 so it's going to be a lovely big 384 00:13:20,519 --> 00:13:23,940 directed acyclic graph 385 00:13:22,200 --> 00:13:25,500 uh 386 00:13:23,940 --> 00:13:27,720 and then we can start to lay it out and 387 00:13:25,500 --> 00:13:30,000 we can start to optimize across that so 388 00:13:27,720 --> 00:13:31,800 we can break it out into stages a stage 389 00:13:30,000 --> 00:13:34,019 is any kind of processing we can do 390 00:13:31,800 --> 00:13:36,000 before we have to shuffle the data 391 00:13:34,019 --> 00:13:37,800 so maybe we've got the paralyzer filter 392 00:13:36,000 --> 00:13:39,600 the map they can all happen with the 393 00:13:37,800 --> 00:13:40,380 layout of the data across one node and 394 00:13:39,600 --> 00:13:41,760 it 395 00:13:40,380 --> 00:13:43,500 um and it's sort of moving this around 396 00:13:41,760 --> 00:13:44,880 and reorganizing this under the hood I 397 00:13:43,500 --> 00:13:46,440 think of it a little bit like the 398 00:13:44,880 --> 00:13:47,820 relationship of SQL to relational 399 00:13:46,440 --> 00:13:49,800 algebra I don't know if that's a totally 400 00:13:47,820 --> 00:13:51,360 accurate sort of parallel but it's 401 00:13:49,800 --> 00:13:53,220 useful for me in my head it's where 402 00:13:51,360 --> 00:13:54,779 we've got this we've got a stage two 403 00:13:53,220 --> 00:13:56,459 where because we're going to reduce by 404 00:13:54,779 --> 00:13:59,700 key we need to shuffle the data relative 405 00:13:56,459 --> 00:14:01,500 to the key that we're reducing on 406 00:13:59,700 --> 00:14:02,519 um maybe we have to shuffle again for 407 00:14:01,500 --> 00:14:03,600 our stage three because we're going to 408 00:14:02,519 --> 00:14:05,760 do a join we're joining on some 409 00:14:03,600 --> 00:14:07,680 different fields potentially 410 00:14:05,760 --> 00:14:13,160 um yeah we have this lovely optimized 411 00:14:07,680 --> 00:14:13,160 graph that Pi Spock is producing for us 412 00:14:13,320 --> 00:14:18,959 there is one thing that might come up 413 00:14:15,660 --> 00:14:20,820 again depends how sooner or later 414 00:14:18,959 --> 00:14:22,440 whether you will hit this you might get 415 00:14:20,820 --> 00:14:23,279 an out of memory error at some point and 416 00:14:22,440 --> 00:14:24,360 that's going to be really annoying 417 00:14:23,279 --> 00:14:25,740 because you're going to be like hey I 418 00:14:24,360 --> 00:14:27,420 just started using this library because 419 00:14:25,740 --> 00:14:29,459 it promised it would solve all of my 420 00:14:27,420 --> 00:14:32,639 memory issues and I would never have to 421 00:14:29,459 --> 00:14:34,139 worry about that ever again and yes uh 422 00:14:32,639 --> 00:14:36,120 the problem is is that this out of 423 00:14:34,139 --> 00:14:37,680 memory issue is actually that the plan 424 00:14:36,120 --> 00:14:39,480 got too big 425 00:14:37,680 --> 00:14:41,100 and so you either need to simplify the 426 00:14:39,480 --> 00:14:42,480 plan or allocate more memory in config 427 00:14:41,100 --> 00:14:43,800 it's a very easy fix there's a bunch of 428 00:14:42,480 --> 00:14:45,600 stack Overflow questions and answers 429 00:14:43,800 --> 00:14:47,220 about it it's just a really weird thing 430 00:14:45,600 --> 00:14:50,639 to look up so I like to flag it early 431 00:14:47,220 --> 00:14:51,839 before you spend half your half a day 432 00:14:50,639 --> 00:14:52,920 trying to figure out what just went 433 00:14:51,839 --> 00:14:54,720 wrong and why your program is 434 00:14:52,920 --> 00:14:56,399 complaining at you 435 00:14:54,720 --> 00:14:57,779 cool 436 00:14:56,399 --> 00:14:59,459 so we've got our lovely Transformations 437 00:14:57,779 --> 00:15:01,620 we've got all those stages of processing 438 00:14:59,459 --> 00:15:02,519 but they're all lazy so at some point 439 00:15:01,620 --> 00:15:04,079 we're going to have to throw some 440 00:15:02,519 --> 00:15:06,000 actions into the mix which is when we 441 00:15:04,079 --> 00:15:07,260 actually want the result 442 00:15:06,000 --> 00:15:08,579 so these are going to be eager they're 443 00:15:07,260 --> 00:15:09,899 going to force the evaluation of the 444 00:15:08,579 --> 00:15:11,940 plan 445 00:15:09,899 --> 00:15:14,459 and you might be sitting there thinking 446 00:15:11,940 --> 00:15:16,079 great I got you 447 00:15:14,459 --> 00:15:17,760 I'm going to have my lovely plan I'm 448 00:15:16,079 --> 00:15:19,920 going to do my stages one two three of 449 00:15:17,760 --> 00:15:20,820 processing and then I'm going to do my 450 00:15:19,920 --> 00:15:21,779 first action which is going to be 451 00:15:20,820 --> 00:15:23,940 account because I want to know how many 452 00:15:21,779 --> 00:15:25,500 rows of data my output's going to have 453 00:15:23,940 --> 00:15:26,579 I'm going to have my shirt because I 454 00:15:25,500 --> 00:15:29,160 want to have a couple of examples of 455 00:15:26,579 --> 00:15:30,600 what my output looks like and then I'm 456 00:15:29,160 --> 00:15:34,040 going to you know save out save as table 457 00:15:30,600 --> 00:15:34,040 save a safety whatever I'm going to do 458 00:15:34,800 --> 00:15:38,480 and that's 459 00:15:36,300 --> 00:15:40,800 so yes 460 00:15:38,480 --> 00:15:42,540 but your processing might end up looking 461 00:15:40,800 --> 00:15:44,699 more like this 462 00:15:42,540 --> 00:15:45,540 because for all of the optimization 463 00:15:44,699 --> 00:15:47,100 that's going to happen on the 464 00:15:45,540 --> 00:15:49,680 transformation side 465 00:15:47,100 --> 00:15:51,660 spark is going to be really eager as 466 00:15:49,680 --> 00:15:53,279 soon as it hits an action so it's going 467 00:15:51,660 --> 00:15:54,420 to do all the processing when you ask it 468 00:15:53,279 --> 00:15:55,560 for account and it'll give you that 469 00:15:54,420 --> 00:15:57,360 count result 470 00:15:55,560 --> 00:15:58,680 and then it will potentially do all of 471 00:15:57,360 --> 00:16:00,540 that processing again 472 00:15:58,680 --> 00:16:02,040 and show your results 473 00:16:00,540 --> 00:16:03,839 and then it'll do all of that person 474 00:16:02,040 --> 00:16:06,300 using a third time 475 00:16:03,839 --> 00:16:08,399 and then it'll save out the data 476 00:16:06,300 --> 00:16:09,720 so if you're programmed are running 477 00:16:08,399 --> 00:16:11,100 really slow you're feeling like they're 478 00:16:09,720 --> 00:16:12,899 running maybe three to four times longer 479 00:16:11,100 --> 00:16:15,480 than they probably should take 480 00:16:12,899 --> 00:16:18,000 this might be what's happening 481 00:16:15,480 --> 00:16:20,880 so again in terms of introductory stuff 482 00:16:18,000 --> 00:16:23,160 I'm not going to sort of go hugely in 483 00:16:20,880 --> 00:16:24,899 depth into memory management 484 00:16:23,160 --> 00:16:26,940 but it's worth flagging that it's 485 00:16:24,899 --> 00:16:29,459 something that you will need to think 486 00:16:26,940 --> 00:16:31,440 about sort of sooner or later and it the 487 00:16:29,459 --> 00:16:32,579 sooner is going to be when stuff starts 488 00:16:31,440 --> 00:16:35,279 taking away longer than you're thinking 489 00:16:32,579 --> 00:16:38,060 but you when you think it should 490 00:16:35,279 --> 00:16:41,100 um but definitely 491 00:16:38,060 --> 00:16:43,740 you need to be thinking about this to 492 00:16:41,100 --> 00:16:46,860 avoid that sort of multiple levels of 493 00:16:43,740 --> 00:16:48,360 processing across the stages 494 00:16:46,860 --> 00:16:50,100 so 495 00:16:48,360 --> 00:16:52,320 as an example I found this stack 496 00:16:50,100 --> 00:16:54,720 Overflow question I love it because it's 497 00:16:52,320 --> 00:16:57,899 someone basically has asked hey I put 498 00:16:54,720 --> 00:17:00,000 some checkpoints in and uh I have all 499 00:16:57,899 --> 00:17:01,560 these skip stages and is that helping my 500 00:17:00,000 --> 00:17:04,140 performance 501 00:17:01,560 --> 00:17:05,699 and the answer is yes 502 00:17:04,140 --> 00:17:07,439 um so we can kind of and we can break 503 00:17:05,699 --> 00:17:08,400 this down and we can see uh the concepts 504 00:17:07,439 --> 00:17:10,980 we've been talking about up to this 505 00:17:08,400 --> 00:17:13,380 point so each row is an action that has 506 00:17:10,980 --> 00:17:16,140 happened it might be a little bit hard 507 00:17:13,380 --> 00:17:17,819 to see but we can see in that second 508 00:17:16,140 --> 00:17:19,679 from the right column the different 509 00:17:17,819 --> 00:17:20,939 stages of processing and especially near 510 00:17:19,679 --> 00:17:22,319 that top they've got sort of seven 511 00:17:20,939 --> 00:17:23,640 stages of processing happening but 512 00:17:22,319 --> 00:17:25,079 they've got 11 skipped and that's 513 00:17:23,640 --> 00:17:26,220 because they're holding the data that's 514 00:17:25,079 --> 00:17:28,500 been processed up to that point in 515 00:17:26,220 --> 00:17:29,880 memory so they've told spot like hey I 516 00:17:28,500 --> 00:17:31,980 want you to reuse this bit don't forget 517 00:17:29,880 --> 00:17:33,900 about it the second it's gone 518 00:17:31,980 --> 00:17:35,940 um and right in the file we've got the 519 00:17:33,900 --> 00:17:37,799 tasks which is the count of like the 520 00:17:35,940 --> 00:17:39,720 individual tasks that happen across all 521 00:17:37,799 --> 00:17:40,559 of the worker nodes across all of the 522 00:17:39,720 --> 00:17:44,280 stages 523 00:17:40,559 --> 00:17:46,700 so that number tends to be pretty big 524 00:17:44,280 --> 00:17:46,700 cool 525 00:17:46,919 --> 00:17:50,880 um so in terms of what I want to cover 526 00:17:49,260 --> 00:17:52,320 oh yeah there was going to be a whole 527 00:17:50,880 --> 00:17:53,940 version of this where we just ranted 528 00:17:52,320 --> 00:17:56,039 about Java stack traces in my Python 529 00:17:53,940 --> 00:17:57,600 program for ages but full credit to the 530 00:17:56,039 --> 00:17:59,820 people working on postmark is that the 531 00:17:57,600 --> 00:18:02,160 error messages have gotten substantially 532 00:17:59,820 --> 00:18:03,900 better uh over the last few years so 533 00:18:02,160 --> 00:18:07,260 there is substantially less ranting from 534 00:18:03,900 --> 00:18:11,340 me uh on that front uh so shout out to 535 00:18:07,260 --> 00:18:13,380 them uh in terms of otherwise because uh 536 00:18:11,340 --> 00:18:17,220 we have the actions of being 537 00:18:13,380 --> 00:18:18,240 lazy and then the eager evaluation 538 00:18:17,220 --> 00:18:20,340 um 539 00:18:18,240 --> 00:18:22,919 depending on the complexity of your era 540 00:18:20,340 --> 00:18:25,500 it's pretty likely that you're an error 541 00:18:22,919 --> 00:18:27,900 in your plan is going to happen or be 542 00:18:25,500 --> 00:18:30,000 thrown at the point of evaluation so at 543 00:18:27,900 --> 00:18:31,799 that count at that show 544 00:18:30,000 --> 00:18:33,600 but it could be quite higher up in the 545 00:18:31,799 --> 00:18:35,460 fair bit higher up in the plan and so it 546 00:18:33,600 --> 00:18:38,580 is worth looking at 547 00:18:35,460 --> 00:18:39,840 um not too scary but just I have had a 548 00:18:38,580 --> 00:18:41,460 colleague who haven't initially had a 549 00:18:39,840 --> 00:18:44,100 heap of programming experience come in 550 00:18:41,460 --> 00:18:46,260 from SQL and they've been doing a really 551 00:18:44,100 --> 00:18:48,900 good job of like figuring out where the 552 00:18:46,260 --> 00:18:50,039 error was happening and trying to put in 553 00:18:48,900 --> 00:18:52,679 statements around it and trying to 554 00:18:50,039 --> 00:18:54,539 explore it and just not realizing that 555 00:18:52,679 --> 00:18:57,360 it was substantially higher up because 556 00:18:54,539 --> 00:18:58,980 of the way the lazy and eager interact 557 00:18:57,360 --> 00:19:02,160 with each other 558 00:18:58,980 --> 00:19:03,780 um the other fun gotcha is uh because 559 00:19:02,160 --> 00:19:06,059 our code is going to look more like the 560 00:19:03,780 --> 00:19:08,160 thing on the left in terms of you know 561 00:19:06,059 --> 00:19:09,539 we've written out the stages of the 562 00:19:08,160 --> 00:19:11,280 Transformations that we want and then 563 00:19:09,539 --> 00:19:11,940 the actions 564 00:19:11,280 --> 00:19:13,500 um 565 00:19:11,940 --> 00:19:16,820 but the processing is potentially with 566 00:19:13,500 --> 00:19:19,140 that duplication is if you have 567 00:19:16,820 --> 00:19:21,480 non-deterministic data coming through so 568 00:19:19,140 --> 00:19:23,039 for example if you are grabbing a 569 00:19:21,480 --> 00:19:24,600 thousand rows from your table and you 570 00:19:23,039 --> 00:19:26,280 don't really care which thousand rows 571 00:19:24,600 --> 00:19:28,880 you grab at the start you could 572 00:19:26,280 --> 00:19:32,760 potentially end up in a situation where 573 00:19:28,880 --> 00:19:34,440 the rows that are fed into that show are 574 00:19:32,760 --> 00:19:37,260 different to the rows that are fed into 575 00:19:34,440 --> 00:19:38,520 that safe and again memory management is 576 00:19:37,260 --> 00:19:39,780 going to potentially save you there in 577 00:19:38,520 --> 00:19:41,160 terms of saying no I actually just I 578 00:19:39,780 --> 00:19:43,460 want to save the processing at this 579 00:19:41,160 --> 00:19:46,260 point and use it from there 580 00:19:43,460 --> 00:19:48,419 that most confusingly comes up when you 581 00:19:46,260 --> 00:19:50,160 start joining a data frame to itself and 582 00:19:48,419 --> 00:19:54,120 you aren't getting any results from the 583 00:19:50,160 --> 00:19:57,480 join because it each side of that data 584 00:19:54,120 --> 00:19:58,860 frame is taking its own source data so 585 00:19:57,480 --> 00:20:00,600 that can be a bit of an odd one to track 586 00:19:58,860 --> 00:20:03,419 down again and if you sort of shows in 587 00:20:00,600 --> 00:20:05,100 they'll get different results as well uh 588 00:20:03,419 --> 00:20:06,539 cool the other way to handle that is to 589 00:20:05,100 --> 00:20:08,280 just put like an order by or be more 590 00:20:06,539 --> 00:20:11,160 selective in terms of how you if you are 591 00:20:08,280 --> 00:20:13,260 doing a subset of data 592 00:20:11,160 --> 00:20:15,059 um cool 593 00:20:13,260 --> 00:20:17,220 so 594 00:20:15,059 --> 00:20:18,419 what should I use 595 00:20:17,220 --> 00:20:21,720 um this is going to depend really 596 00:20:18,419 --> 00:20:24,360 heavily on you and I unfortunately can't 597 00:20:21,720 --> 00:20:25,740 recommend uh one way or the other there 598 00:20:24,360 --> 00:20:27,299 are a bunch of tools out there but 599 00:20:25,740 --> 00:20:29,520 things you might want to consider how 600 00:20:27,299 --> 00:20:31,320 much data do you have uh what does your 601 00:20:29,520 --> 00:20:34,260 current code base look like 602 00:20:31,320 --> 00:20:36,600 um so in my current job we work with a 603 00:20:34,260 --> 00:20:38,700 lot of geospatial data Pi spark doesn't 604 00:20:36,600 --> 00:20:40,200 have good geospatial capability there is 605 00:20:38,700 --> 00:20:42,480 Apache Sedona which just came out of 606 00:20:40,200 --> 00:20:44,340 incubation in I think March of this year 607 00:20:42,480 --> 00:20:46,559 if anyone here knows anything about it 608 00:20:44,340 --> 00:20:48,059 please come and find me I am very very 609 00:20:46,559 --> 00:20:50,220 curious to learn more about what is 610 00:20:48,059 --> 00:20:51,000 going on in that space 611 00:20:50,220 --> 00:20:52,740 um 612 00:20:51,000 --> 00:20:54,780 other questions how much time do you 613 00:20:52,740 --> 00:20:56,460 want to spend on infrastructure that can 614 00:20:54,780 --> 00:20:58,080 also be substituted for how much money 615 00:20:56,460 --> 00:21:00,299 how much headaches what's the capability 616 00:20:58,080 --> 00:21:01,919 of the people in your team do you want 617 00:21:00,299 --> 00:21:04,740 to ask someone else buy a different 618 00:21:01,919 --> 00:21:06,240 product uh I'm sure there are people at 619 00:21:04,740 --> 00:21:08,820 this conference who would probably very 620 00:21:06,240 --> 00:21:10,559 willingly sell you some stuff uh who is 621 00:21:08,820 --> 00:21:12,000 working on your code and any personal 622 00:21:10,559 --> 00:21:13,799 preferences I actually just really like 623 00:21:12,000 --> 00:21:16,559 using pi spark as just like noodling 624 00:21:13,799 --> 00:21:17,820 around on my laptop for small projects I 625 00:21:16,559 --> 00:21:19,980 find it a little bit more intuitive than 626 00:21:17,820 --> 00:21:23,000 pandas but that's going to be a total 627 00:21:19,980 --> 00:21:23,000 personal preference thing 628 00:21:23,400 --> 00:21:26,400 um 629 00:21:24,120 --> 00:21:28,919 in conclusion 630 00:21:26,400 --> 00:21:30,720 what is pi Spark 631 00:21:28,919 --> 00:21:32,880 oh yep 632 00:21:30,720 --> 00:21:35,340 hopefully I've answered that at least a 633 00:21:32,880 --> 00:21:38,940 little bit uh 634 00:21:35,340 --> 00:21:40,919 can it solve all of my data problems 635 00:21:38,940 --> 00:21:42,539 kind of but you also get some really fun 636 00:21:40,919 --> 00:21:43,320 new ones 637 00:21:42,539 --> 00:21:45,480 um 638 00:21:43,320 --> 00:21:47,280 and are you sure I can't just use pandas 639 00:21:45,480 --> 00:21:49,260 instead 640 00:21:47,280 --> 00:21:50,760 so they they did a bit of a maintenance 641 00:21:49,260 --> 00:21:52,919 which they did actually bring out a 642 00:21:50,760 --> 00:21:55,080 pandas API a while back I haven't 643 00:21:52,919 --> 00:21:58,320 personally used it so I can't give you 644 00:21:55,080 --> 00:22:00,360 any advice on how well it works but it 645 00:21:58,320 --> 00:22:02,039 is there so if you don't like the look 646 00:22:00,360 --> 00:22:03,600 of the code I've been showing you there 647 00:22:02,039 --> 00:22:05,280 is a whole other section of the library 648 00:22:03,600 --> 00:22:07,039 that you can definitely check out and 649 00:22:05,280 --> 00:22:10,140 have a look at 650 00:22:07,039 --> 00:22:12,419 uh cool and then the one other thing I 651 00:22:10,140 --> 00:22:13,799 wanted to cover uh sweet so if you have 652 00:22:12,419 --> 00:22:15,059 sat through to this far and you're like 653 00:22:13,799 --> 00:22:16,500 yep that's that's pretty good that's 654 00:22:15,059 --> 00:22:18,360 pretty informative 655 00:22:16,500 --> 00:22:19,919 um but Alex like I I really want to know 656 00:22:18,360 --> 00:22:21,480 more about this and I really want to 657 00:22:19,919 --> 00:22:23,220 know from someone who actually like 658 00:22:21,480 --> 00:22:24,780 really knows what they're talking about 659 00:22:23,220 --> 00:22:27,120 um I highly recommend checking out 660 00:22:24,780 --> 00:22:29,580 Holden Corral uh she actually gave a 661 00:22:27,120 --> 00:22:32,039 presentation on Pi spark at pycon 662 00:22:29,580 --> 00:22:33,299 Australia back in 2017 so that's on 663 00:22:32,039 --> 00:22:34,679 YouTube 664 00:22:33,299 --> 00:22:36,720 um but she's given a bunch of 665 00:22:34,679 --> 00:22:38,700 presentations about Pi spark I watched 666 00:22:36,720 --> 00:22:41,520 many of them when preparing for this 667 00:22:38,700 --> 00:22:42,720 talk uh so I'm definitely very grateful 668 00:22:41,520 --> 00:22:43,799 for that 669 00:22:42,720 --> 00:22:45,600 um she's written a bunch of books on 670 00:22:43,799 --> 00:22:47,820 spark so if you have an employer that 671 00:22:45,600 --> 00:22:48,960 enjoys buying textbooks that's worth 672 00:22:47,820 --> 00:22:51,360 checking out 673 00:22:48,960 --> 00:22:53,780 um but yeah otherwise I think that's 674 00:22:51,360 --> 00:22:53,780 everything 675 00:23:00,120 --> 00:23:04,080 sweet 676 00:23:01,140 --> 00:23:05,820 I totally lost track of time on that one 677 00:23:04,080 --> 00:23:07,860 no that was great thank you so much for 678 00:23:05,820 --> 00:23:09,659 that talk um I think the problem of what 679 00:23:07,860 --> 00:23:12,419 you know what I've run out of machine 680 00:23:09,659 --> 00:23:15,419 for memory for analyzing my data and uh 681 00:23:12,419 --> 00:23:17,220 what what to do is a perennial one which 682 00:23:15,419 --> 00:23:19,080 I think it's always good to have some 683 00:23:17,220 --> 00:23:21,440 good guidance on what to do and it was a 684 00:23:19,080 --> 00:23:24,600 really good um practically sort of 685 00:23:21,440 --> 00:23:25,559 demystifying uh Pi spark so thank you so 686 00:23:24,600 --> 00:23:28,620 much 687 00:23:25,559 --> 00:23:31,400 um uh and I'm going to just forgotten 688 00:23:28,620 --> 00:23:31,400 my 689 00:23:34,679 --> 00:23:38,240 gift we have here for you 690 00:23:38,340 --> 00:23:42,900 thank you very much if you have 691 00:23:41,100 --> 00:23:46,080 questions for Alex please pop them in 692 00:23:42,900 --> 00:23:48,059 the Discord or come and chat to Alex uh 693 00:23:46,080 --> 00:23:49,600 later on um and so with that can we have 694 00:23:48,059 --> 00:23:56,040 a big round of applause 695 00:23:49,600 --> 00:23:56,040 [Applause]