1 00:00:00,539 --> 00:00:03,539 foreign 2 00:00:13,940 --> 00:00:20,160 so let's get back on track so we're 3 00:00:17,699 --> 00:00:21,600 starting at 11 have 30 minutes and then 4 00:00:20,160 --> 00:00:25,439 at 10 minutes 5 00:00:21,600 --> 00:00:28,199 uh time to switch speakers uh now we 6 00:00:25,439 --> 00:00:30,359 have some hymns uh postdoctoral research 7 00:00:28,199 --> 00:00:33,120 fellow of school of language and culture 8 00:00:30,359 --> 00:00:36,420 of University of Queensland he's gonna 9 00:00:33,120 --> 00:00:39,540 talk about stock overflow analysis it's 10 00:00:36,420 --> 00:00:42,540 being very much into uh in the attack 11 00:00:39,540 --> 00:00:45,840 word about stack Overflow the traffic 12 00:00:42,540 --> 00:00:49,260 apparently is dropping because of AI so 13 00:00:45,840 --> 00:00:52,260 let's see what his talk is about and uh 14 00:00:49,260 --> 00:00:54,800 I want to welcome him so with you Sam 15 00:00:52,260 --> 00:00:54,800 blue 16 00:00:55,440 --> 00:00:59,340 so I'm going to talk about a recent 17 00:00:57,300 --> 00:01:01,620 history of python uh because python was 18 00:00:59,340 --> 00:01:04,920 launched in 19 first python release in 19 00:01:01,620 --> 00:01:07,380 1991 I think and stack Overflow started 20 00:01:04,920 --> 00:01:08,640 in 2008 so obviously there's a big bunch 21 00:01:07,380 --> 00:01:09,840 of time I'm not going to touch on in 22 00:01:08,640 --> 00:01:11,280 this talk so that's why I'm calling this 23 00:01:09,840 --> 00:01:12,720 a recent history 24 00:01:11,280 --> 00:01:15,180 so the 25 00:01:12,720 --> 00:01:16,799 kind of opening question I have for you 26 00:01:15,180 --> 00:01:19,520 is put up your hand if you've used stack 27 00:01:16,799 --> 00:01:19,520 Overflow before 28 00:01:19,920 --> 00:01:23,340 okay I think that's pretty much 29 00:01:21,240 --> 00:01:24,900 everybody which is about what is 30 00:01:23,340 --> 00:01:26,220 expected and that's why I thought I 31 00:01:24,900 --> 00:01:27,960 really wanted to look at this particular 32 00:01:26,220 --> 00:01:31,080 topic because I think 33 00:01:27,960 --> 00:01:32,400 stack Overflow is a really interesting 34 00:01:31,080 --> 00:01:34,740 um 35 00:01:32,400 --> 00:01:36,360 tool Community platform that everybody 36 00:01:34,740 --> 00:01:38,400 seems to use 37 00:01:36,360 --> 00:01:40,560 in fact they're 38 00:01:38,400 --> 00:01:43,740 CEO talks about 100 million monthly 39 00:01:40,560 --> 00:01:46,500 visitors top 50 of all websites 40 00:01:43,740 --> 00:01:48,720 um 50 billion exit times access in the 41 00:01:46,500 --> 00:01:50,640 last years so that's that's the one hand 42 00:01:48,720 --> 00:01:53,040 so if you've ever searched for a 43 00:01:50,640 --> 00:01:54,420 technical problem on Google or any other 44 00:01:53,040 --> 00:01:56,579 search engine you've probably got a link 45 00:01:54,420 --> 00:01:58,079 to stack Overflow at least once and I 46 00:01:56,579 --> 00:01:59,700 imagine that's how many of you ended up 47 00:01:58,079 --> 00:02:02,100 using stack overflow 48 00:01:59,700 --> 00:02:04,500 but the flip side of that 49 00:02:02,100 --> 00:02:05,820 is that stack Overflow then occupies 50 00:02:04,500 --> 00:02:07,799 this really weird Niche where it's a 51 00:02:05,820 --> 00:02:09,539 really popular site to visit with the 52 00:02:07,799 --> 00:02:11,520 actual demographics of the users are 53 00:02:09,539 --> 00:02:13,620 highly skewed 54 00:02:11,520 --> 00:02:15,900 uh mail 55 00:02:13,620 --> 00:02:18,020 Western particularly white European and 56 00:02:15,900 --> 00:02:20,840 younger so those are from their 2022 57 00:02:18,020 --> 00:02:23,819 demographics survey 58 00:02:20,840 --> 00:02:27,540 by extension presumably English speaking 59 00:02:23,819 --> 00:02:29,819 but it's not questions they asked and 60 00:02:27,540 --> 00:02:31,379 their demographics are so skewed and 61 00:02:29,819 --> 00:02:34,080 they got so much bad press from their 62 00:02:31,379 --> 00:02:36,180 2022 survey showing that 93 of their 63 00:02:34,080 --> 00:02:38,040 users were male that 64 00:02:36,180 --> 00:02:40,200 they got rid of almost all of their 65 00:02:38,040 --> 00:02:43,140 demographic questions for 2023. so 66 00:02:40,200 --> 00:02:44,580 there's something going on there 67 00:02:43,140 --> 00:02:47,220 I'm actually here today to talk about 68 00:02:44,580 --> 00:02:48,780 Python and looking at this as stack 69 00:02:47,220 --> 00:02:50,700 Overflow and I'm going to draw some 70 00:02:48,780 --> 00:02:52,319 limited inferences to the rest of the 71 00:02:50,700 --> 00:02:54,180 world I'm not going to assume that stack 72 00:02:52,319 --> 00:02:55,739 Overflow is all development 73 00:02:54,180 --> 00:02:58,800 although I'll probably slip up at least 74 00:02:55,739 --> 00:02:59,700 once and make that connection 75 00:02:58,800 --> 00:03:01,260 um 76 00:02:59,700 --> 00:03:03,000 the other interesting part about it is 77 00:03:01,260 --> 00:03:04,379 that stack Overflow data is open all of 78 00:03:03,000 --> 00:03:07,440 the users submitted content on stack 79 00:03:04,379 --> 00:03:10,080 Overflow is licensed under one version 80 00:03:07,440 --> 00:03:13,739 of Creative Commons by attribution share 81 00:03:10,080 --> 00:03:16,260 a like so that's important and all of 82 00:03:13,739 --> 00:03:17,400 the stack exchange sites data they 83 00:03:16,260 --> 00:03:18,720 actually have a real commitment to 84 00:03:17,400 --> 00:03:19,980 transparency and that they just release 85 00:03:18,720 --> 00:03:22,560 all of the data you can go to 86 00:03:19,980 --> 00:03:24,180 archive.org you can download it 87 00:03:22,560 --> 00:03:25,860 so the the question for me and kind of 88 00:03:24,180 --> 00:03:28,860 the motivation for this is how do we 89 00:03:25,860 --> 00:03:30,180 actually make use of that data to tell 90 00:03:28,860 --> 00:03:31,920 us something interesting about the 91 00:03:30,180 --> 00:03:34,379 recent history of python can we do 92 00:03:31,920 --> 00:03:37,220 something with this to show us something 93 00:03:34,379 --> 00:03:37,220 new 94 00:03:38,099 --> 00:03:43,980 the first problem if you actually go to 95 00:03:40,200 --> 00:03:47,159 look at the data 96 00:03:43,980 --> 00:03:48,420 the single post XML file that holds all 97 00:03:47,159 --> 00:03:52,700 the questions and answers and a few 98 00:03:48,420 --> 00:03:52,700 other things is 95 gigabytes of XML 99 00:03:52,860 --> 00:03:56,400 so that's fun and that's not even 100 00:03:55,019 --> 00:03:58,260 including the comments that's not even 101 00:03:56,400 --> 00:04:00,060 there's also the full history of posts 102 00:03:58,260 --> 00:04:02,120 which is even bigger 103 00:04:00,060 --> 00:04:02,120 um 104 00:04:02,400 --> 00:04:05,760 and the other the other flip side of 105 00:04:03,840 --> 00:04:07,680 that is when I say post one of the 106 00:04:05,760 --> 00:04:11,220 interesting things about um 107 00:04:07,680 --> 00:04:12,900 stack exchange data is that their kind 108 00:04:11,220 --> 00:04:14,459 of unifying concept is posts but that 109 00:04:12,900 --> 00:04:16,620 doesn't necessarily map to what you see 110 00:04:14,459 --> 00:04:18,479 on the screen so this is the earliest 111 00:04:16,620 --> 00:04:22,800 question tagged python that I found in 112 00:04:18,479 --> 00:04:25,500 the data set it's about XML processing 113 00:04:22,800 --> 00:04:26,880 every well the for the purposes of this 114 00:04:25,500 --> 00:04:28,259 talk I'm interested in two kinds of 115 00:04:26,880 --> 00:04:31,320 posts there's a couple of others which 116 00:04:28,259 --> 00:04:34,440 I'm not interested in um 117 00:04:31,320 --> 00:04:36,720 this is a typical question on stack 118 00:04:34,440 --> 00:04:38,220 Overflow we've got well actually this is 119 00:04:36,720 --> 00:04:39,600 atypical because it doesn't actually ask 120 00:04:38,220 --> 00:04:40,320 a question 121 00:04:39,600 --> 00:04:42,780 um 122 00:04:40,320 --> 00:04:44,940 so we've got a this is the 123 00:04:42,780 --> 00:04:46,560 first post in this particular grouping 124 00:04:44,940 --> 00:04:48,540 it's the question that initiates 125 00:04:46,560 --> 00:04:50,759 everything it's got a title it's got a 126 00:04:48,540 --> 00:04:53,400 body it's got some tags it's by a 127 00:04:50,759 --> 00:04:56,160 particular person if we scroll down we 128 00:04:53,400 --> 00:04:57,960 see more answers each of these answers 129 00:04:56,160 --> 00:04:59,520 is another post so when you see a page 130 00:04:57,960 --> 00:05:01,919 like this which is a question and a 131 00:04:59,520 --> 00:05:04,259 bunch of answers this is a grouping of 132 00:05:01,919 --> 00:05:05,400 posts that are glued together because we 133 00:05:04,259 --> 00:05:07,639 know they're an answer to a particular 134 00:05:05,400 --> 00:05:07,639 question 135 00:05:07,800 --> 00:05:11,160 so the kind of starting point for me for 136 00:05:09,660 --> 00:05:12,240 dealing with all of this data is that we 137 00:05:11,160 --> 00:05:15,600 want to break this down into something 138 00:05:12,240 --> 00:05:17,880 more sensible and the approach I've 139 00:05:15,600 --> 00:05:19,020 taken is to look at posts as units and 140 00:05:17,880 --> 00:05:20,759 I'll do some breakdowns later 141 00:05:19,020 --> 00:05:23,039 differentiating between questions and 142 00:05:20,759 --> 00:05:25,020 answers the other thing I've done is 143 00:05:23,039 --> 00:05:27,600 broken out 144 00:05:25,020 --> 00:05:29,639 the body the question the body of the 145 00:05:27,600 --> 00:05:30,960 questions and answers the title of the 146 00:05:29,639 --> 00:05:33,000 question and I've pulled out all of the 147 00:05:30,960 --> 00:05:35,759 words from all of them I've also taken 148 00:05:33,000 --> 00:05:37,680 the extra step because it turns out 149 00:05:35,759 --> 00:05:39,539 code blocks are really hard to deal with 150 00:05:37,680 --> 00:05:40,979 in a text analytics fashion because 151 00:05:39,539 --> 00:05:43,620 there's so much stuff there are so many 152 00:05:40,979 --> 00:05:45,600 people who there are lots and lots of um 153 00:05:43,620 --> 00:05:47,400 Trace facts and error messages and 154 00:05:45,600 --> 00:05:49,320 displays of random shell output with 155 00:05:47,400 --> 00:05:50,520 lots of so you end up with it's really 156 00:05:49,320 --> 00:05:52,380 hard to deal with the code blocks 157 00:05:50,520 --> 00:05:53,520 directly so I've actually pulled them 158 00:05:52,380 --> 00:05:55,199 out and I haven't even looked at them 159 00:05:53,520 --> 00:05:58,820 for the purposes of this talk 160 00:05:55,199 --> 00:05:58,820 um maybe I'll look at them later 161 00:06:01,800 --> 00:06:06,780 so that's kind of how the what the 162 00:06:03,360 --> 00:06:08,460 underlying data structure looks like 163 00:06:06,780 --> 00:06:10,199 um if we start 164 00:06:08,460 --> 00:06:13,139 breaking it down and looking at Trends 165 00:06:10,199 --> 00:06:15,060 so I just before this talk or sorry it 166 00:06:13,139 --> 00:06:18,600 was in the introduction to this talk 167 00:06:15,060 --> 00:06:21,840 uh there was a mention about traffic to 168 00:06:18,600 --> 00:06:23,880 stack Overflow Define declining 169 00:06:21,840 --> 00:06:25,740 um I want to point out that you can draw 170 00:06:23,880 --> 00:06:26,940 no conclusions about it from this 171 00:06:25,740 --> 00:06:29,340 particular graph this is just 172 00:06:26,940 --> 00:06:30,780 illustrative of the trends in questions 173 00:06:29,340 --> 00:06:31,919 and all posts over time to stack 174 00:06:30,780 --> 00:06:33,479 overflow 175 00:06:31,919 --> 00:06:35,340 the reason I say you can't draw too many 176 00:06:33,479 --> 00:06:36,600 questions is that stack Overflow is the 177 00:06:35,340 --> 00:06:38,100 biggest part of the stack exchange 178 00:06:36,600 --> 00:06:40,319 network but there are a bunch of other 179 00:06:38,100 --> 00:06:42,419 and have been continually starting new 180 00:06:40,319 --> 00:06:43,800 Bunches of other sites so there's like a 181 00:06:42,419 --> 00:06:46,500 Unix stack exchange there's a system 182 00:06:43,800 --> 00:06:50,039 administrator stack exchange there's 183 00:06:46,500 --> 00:06:51,419 windows and Academia and travel 184 00:06:50,039 --> 00:06:53,220 so 185 00:06:51,419 --> 00:06:54,600 you've got to be careful interpreting a 186 00:06:53,220 --> 00:06:55,860 graph like this looking at just one part 187 00:06:54,600 --> 00:06:58,020 of the network when there's all these 188 00:06:55,860 --> 00:07:00,720 other parts and they seem to be making a 189 00:06:58,020 --> 00:07:02,460 very strategic choice to make 190 00:07:00,720 --> 00:07:04,560 targeted communities for different 191 00:07:02,460 --> 00:07:05,220 things 192 00:07:04,560 --> 00:07:06,479 um 193 00:07:05,220 --> 00:07:09,000 so yeah 194 00:07:06,479 --> 00:07:11,220 the other takeaway from this is that 195 00:07:09,000 --> 00:07:12,360 that's a lot of posts so we're talking 196 00:07:11,220 --> 00:07:15,960 about 197 00:07:12,360 --> 00:07:18,600 at Peak 200 000 questions a month 198 00:07:15,960 --> 00:07:20,340 which that's a lot 199 00:07:18,600 --> 00:07:23,039 I mean I can't read two hundred thousand 200 00:07:20,340 --> 00:07:25,139 questions I can't like can't do much 201 00:07:23,039 --> 00:07:27,599 can't do much with that if I'm going 202 00:07:25,139 --> 00:07:29,599 through it through it manually 203 00:07:27,599 --> 00:07:29,599 um 204 00:07:30,840 --> 00:07:34,620 I wanted to just just to kind of assure 205 00:07:33,419 --> 00:07:36,780 you that there's lots of other stuff in 206 00:07:34,620 --> 00:07:38,340 stack Overflow it's not just python um 207 00:07:36,780 --> 00:07:40,979 python 208 00:07:38,340 --> 00:07:42,419 if we take both the questions that the 209 00:07:40,979 --> 00:07:43,800 tags are assigned to and all of the 210 00:07:42,419 --> 00:07:45,419 answers to those questions as having 211 00:07:43,800 --> 00:07:47,039 that same tag which isn't completely 212 00:07:45,419 --> 00:07:48,180 correct but it's close enough 213 00:07:47,039 --> 00:07:50,520 for this 214 00:07:48,180 --> 00:07:53,400 um python is the number two tag by 215 00:07:50,520 --> 00:07:55,979 number of posts just behind JavaScript 216 00:07:53,400 --> 00:07:57,419 and just ahead of java not a popular 217 00:07:55,979 --> 00:07:58,979 artery contest 218 00:07:57,419 --> 00:08:01,080 these are quantities that don't have any 219 00:07:58,979 --> 00:08:02,940 meaning I wouldn't take this to say that 220 00:08:01,080 --> 00:08:04,919 python is the number two language in 221 00:08:02,940 --> 00:08:05,940 JavaScript is the number one doesn't 222 00:08:04,919 --> 00:08:08,520 mean that 223 00:08:05,940 --> 00:08:12,020 on stack Overflow numerically python is 224 00:08:08,520 --> 00:08:12,020 the second most tagged thing 225 00:08:16,979 --> 00:08:22,020 when we look at just python 226 00:08:19,740 --> 00:08:23,400 I forgot the one the other one important 227 00:08:22,020 --> 00:08:25,319 part 228 00:08:23,400 --> 00:08:26,639 shifting all over the place so from now 229 00:08:25,319 --> 00:08:28,440 on everything I'm going to do give you 230 00:08:26,639 --> 00:08:29,940 as a relative proportion because it kind 231 00:08:28,440 --> 00:08:32,219 of cleans up otherwise every graph is 232 00:08:29,940 --> 00:08:33,599 going to have this shape I'm also going 233 00:08:32,219 --> 00:08:36,959 to chop the first and the last month off 234 00:08:33,599 --> 00:08:38,039 because they're not complete so just 235 00:08:36,959 --> 00:08:40,440 about everything I've got from now on 236 00:08:38,039 --> 00:08:44,539 will be relative so so you don't just 237 00:08:40,440 --> 00:08:44,539 see this on a bunch of different graphs 238 00:08:45,000 --> 00:08:48,240 if we look at Pi if we just look at the 239 00:08:46,740 --> 00:08:49,800 python tag then it's that's where it 240 00:08:48,240 --> 00:08:51,959 becomes really interesting 241 00:08:49,800 --> 00:08:53,820 um so in the last few years 242 00:08:51,959 --> 00:08:55,380 seven and a half percent of questions to 243 00:08:53,820 --> 00:08:57,959 stack Overflow have been tagged with 244 00:08:55,380 --> 00:08:58,560 python seven and a half percent 245 00:08:57,959 --> 00:09:01,320 um 246 00:08:58,560 --> 00:09:04,260 so regardless of all of our discussion 247 00:09:01,320 --> 00:09:06,420 about the decline or not decline or is 248 00:09:04,260 --> 00:09:08,580 stack Overflow declining is AI replacing 249 00:09:06,420 --> 00:09:10,080 it stack Overflow is still a really 250 00:09:08,580 --> 00:09:11,339 important place to the python Community 251 00:09:10,080 --> 00:09:12,899 there are a lot of people asking 252 00:09:11,339 --> 00:09:15,180 questions there there are a lot of 253 00:09:12,899 --> 00:09:16,620 people answering questions there so 254 00:09:15,180 --> 00:09:18,480 regardless of all that other stuff and 255 00:09:16,620 --> 00:09:20,640 what the future may hold it is important 256 00:09:18,480 --> 00:09:22,260 for that particular reason 257 00:09:20,640 --> 00:09:25,940 keeping in mind all the things I just 258 00:09:22,260 --> 00:09:25,940 said about demographics earlier 259 00:09:27,660 --> 00:09:32,339 so tags are nice but they're actually 260 00:09:30,540 --> 00:09:34,140 really limited like stack Overflow I 261 00:09:32,339 --> 00:09:35,580 believe only a stack exchanger only lets 262 00:09:34,140 --> 00:09:37,620 you add five tags and they're intended 263 00:09:35,580 --> 00:09:39,360 more as how do we find people to answer 264 00:09:37,620 --> 00:09:42,180 this question not as an analytically 265 00:09:39,360 --> 00:09:44,279 useful thing it's not necessarily 266 00:09:42,180 --> 00:09:45,060 useful for finding out 267 00:09:44,279 --> 00:09:46,740 um 268 00:09:45,060 --> 00:09:48,240 what's going on there 269 00:09:46,740 --> 00:09:49,980 the other 270 00:09:48,240 --> 00:09:52,080 so that I mean 271 00:09:49,980 --> 00:09:56,339 we're in a data track I'm sure you're 272 00:09:52,080 --> 00:09:57,899 all you're all thinking of ways to how 273 00:09:56,339 --> 00:09:59,820 you tackle this particular thing things 274 00:09:57,899 --> 00:10:00,660 you might do and there's heaps of things 275 00:09:59,820 --> 00:10:03,540 um 276 00:10:00,660 --> 00:10:04,980 so you could cluster questions and 277 00:10:03,540 --> 00:10:07,260 answers together so you see if there's 278 00:10:04,980 --> 00:10:10,380 any interesting groupings you could 279 00:10:07,260 --> 00:10:12,959 um if you had importantly if you had 280 00:10:10,380 --> 00:10:16,140 some pre-determined questions you wanted 281 00:10:12,959 --> 00:10:18,060 to answer which I don't you could take a 282 00:10:16,140 --> 00:10:19,320 supervised machine learning approach or 283 00:10:18,060 --> 00:10:22,620 you could look at using large language 284 00:10:19,320 --> 00:10:23,940 models you could look at 285 00:10:22,620 --> 00:10:26,459 um 286 00:10:23,940 --> 00:10:28,560 topic modeling and LDA and those class 287 00:10:26,459 --> 00:10:29,700 of algorithms or I'm going to mention 288 00:10:28,560 --> 00:10:31,980 this one because it doesn't get enough 289 00:10:29,700 --> 00:10:33,240 attention you could also take a corpus 290 00:10:31,980 --> 00:10:36,899 linguist approach and you could start 291 00:10:33,240 --> 00:10:38,100 looking at uh big lists of words over 292 00:10:36,899 --> 00:10:40,200 time and see if there's anything 293 00:10:38,100 --> 00:10:42,060 interesting in there or you could do a 294 00:10:40,200 --> 00:10:44,459 keyword detection and find out that 295 00:10:42,060 --> 00:10:45,660 stack Overflow uses a heap of technical 296 00:10:44,459 --> 00:10:47,519 language compared to some other 297 00:10:45,660 --> 00:10:49,980 reference Opus which isn't super 298 00:10:47,519 --> 00:10:51,959 interesting to me um 299 00:10:49,980 --> 00:10:53,820 and because I'm an academic and this is 300 00:10:51,959 --> 00:10:54,959 partly both a research question and I'm 301 00:10:53,820 --> 00:10:56,579 interested in the methods and other 302 00:10:54,959 --> 00:10:58,980 things the actual answer I took was to 303 00:10:56,579 --> 00:11:00,839 Ria just no we'll do none of that let's 304 00:10:58,980 --> 00:11:04,320 start from scratch and do something 305 00:11:00,839 --> 00:11:05,820 else again because I'm actually aiming 306 00:11:04,320 --> 00:11:07,140 for a different audience necessarily 307 00:11:05,820 --> 00:11:09,360 than all of you but I think you'll find 308 00:11:07,140 --> 00:11:10,800 it interesting anyway so it's an open 309 00:11:09,360 --> 00:11:11,700 source package 310 00:11:10,800 --> 00:11:14,100 um 311 00:11:11,700 --> 00:11:17,399 it's open sourced and it's academic so 312 00:11:14,100 --> 00:11:19,560 there's like double no guarantees of any 313 00:11:17,399 --> 00:11:21,300 kind no warranties um 314 00:11:19,560 --> 00:11:25,620 if it breaks 315 00:11:21,300 --> 00:11:26,579 sorry uh not sorry honestly 316 00:11:25,620 --> 00:11:28,380 um 317 00:11:26,579 --> 00:11:30,839 but it is all open there you can have a 318 00:11:28,380 --> 00:11:33,899 look um this is also a follow-up because 319 00:11:30,839 --> 00:11:35,820 I gave a talk in 2019 that you don't 320 00:11:33,899 --> 00:11:38,100 always need numpy and there is no numpy 321 00:11:35,820 --> 00:11:40,220 in this 322 00:11:38,100 --> 00:11:40,220 um 323 00:11:40,980 --> 00:11:45,120 so the actual kind of 324 00:11:43,380 --> 00:11:47,100 what I wanted to get away from I wanted 325 00:11:45,120 --> 00:11:50,040 to the actual approach and inspiration 326 00:11:47,100 --> 00:11:52,800 for this was topic what if topic models 327 00:11:50,040 --> 00:11:55,620 but you could actually edit and interact 328 00:11:52,800 --> 00:11:57,959 and browse and drive through them so 329 00:11:55,620 --> 00:11:59,579 it's an as a kind of 330 00:11:57,959 --> 00:12:00,720 how do we do something useful that isn't 331 00:11:59,579 --> 00:12:02,820 just well here's a probability 332 00:12:00,720 --> 00:12:03,899 distribution of words and topics now 333 00:12:02,820 --> 00:12:05,459 what 334 00:12:03,899 --> 00:12:07,320 um 335 00:12:05,459 --> 00:12:08,940 so the actual solution I came up with is 336 00:12:07,320 --> 00:12:10,440 we take all of the words in all of the 337 00:12:08,940 --> 00:12:11,880 posts 338 00:12:10,440 --> 00:12:13,200 and we cluster them so instead of 339 00:12:11,880 --> 00:12:14,880 clustering the documents we're now 340 00:12:13,200 --> 00:12:16,079 clustering the words that occur in the 341 00:12:14,880 --> 00:12:16,980 documents 342 00:12:16,079 --> 00:12:18,720 um 343 00:12:16,980 --> 00:12:20,100 and it works better than I ever hoped 344 00:12:18,720 --> 00:12:23,519 it's really weird to me every time I 345 00:12:20,100 --> 00:12:25,920 look at a new data set and it just works 346 00:12:23,519 --> 00:12:28,500 um so these are ordered so these are the 347 00:12:25,920 --> 00:12:30,959 outputs of creating a thousand and 24 348 00:12:28,500 --> 00:12:32,880 clusters of words on the stack on all of 349 00:12:30,959 --> 00:12:34,680 the stack Overflow posts I did all of 350 00:12:32,880 --> 00:12:36,360 this on my laptop 351 00:12:34,680 --> 00:12:39,120 um 352 00:12:36,360 --> 00:12:40,620 obviously the top things are the things 353 00:12:39,120 --> 00:12:42,839 that are usually removed by all of the 354 00:12:40,620 --> 00:12:44,459 data science people but like 355 00:12:42,839 --> 00:12:45,480 stop words 356 00:12:44,459 --> 00:12:48,120 um 357 00:12:45,480 --> 00:12:49,380 punctuation they're really Punctuation 358 00:12:48,120 --> 00:12:51,180 is actually really important if you want 359 00:12:49,380 --> 00:12:53,220 to look at code but I'll 360 00:12:51,180 --> 00:12:54,120 not believe at that point 361 00:12:53,220 --> 00:12:57,180 um 362 00:12:54,120 --> 00:12:59,339 so these are ordered by 363 00:12:57,180 --> 00:13:02,760 the number of documents that match so 364 00:12:59,339 --> 00:13:04,500 this cluster 894 matches 58 million 365 00:13:02,760 --> 00:13:06,480 documents because it has common 366 00:13:04,500 --> 00:13:08,459 punctuation and common words so it 367 00:13:06,480 --> 00:13:09,959 matches almost the entire data set and 368 00:13:08,459 --> 00:13:11,279 it matches every document that has one 369 00:13:09,959 --> 00:13:14,700 of these words 370 00:13:11,279 --> 00:13:16,440 as we go down we start to get 371 00:13:14,700 --> 00:13:17,820 you can see there's kind of a mixture of 372 00:13:16,440 --> 00:13:19,860 things so some of these are just the 373 00:13:17,820 --> 00:13:21,660 words of language like language is 374 00:13:19,860 --> 00:13:23,160 highly structured we can't just choose 375 00:13:21,660 --> 00:13:24,360 whatever we like we don't just throw out 376 00:13:23,160 --> 00:13:26,600 a bunch of keywords when we're talking 377 00:13:24,360 --> 00:13:29,959 to other people 378 00:13:26,600 --> 00:13:32,399 as we go down in descending frequency 379 00:13:29,959 --> 00:13:34,440 these clusters start to have more 380 00:13:32,399 --> 00:13:36,300 specific meaning in the context of 381 00:13:34,440 --> 00:13:40,980 programming so we're already down to 382 00:13:36,300 --> 00:13:43,760 types at cluster 570 and if we go 383 00:13:40,980 --> 00:13:47,120 all the way to the end 384 00:13:43,760 --> 00:13:51,240 you get clusters of 385 00:13:47,120 --> 00:13:55,019 a lot of clusters of java errors and 386 00:13:51,240 --> 00:13:57,000 tracebacks which is interesting and some 387 00:13:55,019 --> 00:13:59,700 of this is partly 388 00:13:57,000 --> 00:14:02,160 driven by Norms in that particularly 389 00:13:59,700 --> 00:14:04,139 early on not everything was wrapped in 390 00:14:02,160 --> 00:14:06,420 like not all console output was wrapped 391 00:14:04,139 --> 00:14:07,380 in a code block and it's I think it took 392 00:14:06,420 --> 00:14:09,540 a little while for some of those 393 00:14:07,380 --> 00:14:11,579 standard practices to 394 00:14:09,540 --> 00:14:15,000 um kind of 395 00:14:11,579 --> 00:14:17,880 become enforceable or become Norms so if 396 00:14:15,000 --> 00:14:20,399 we actually so let's go back 397 00:14:17,880 --> 00:14:21,779 so the other the other the other kind of 398 00:14:20,399 --> 00:14:23,959 thing that I think this is important so 399 00:14:21,779 --> 00:14:23,959 let's 400 00:14:25,380 --> 00:14:28,980 let's rearrange everything by similarity 401 00:14:27,420 --> 00:14:31,200 to the word working and let's have a 402 00:14:28,980 --> 00:14:33,120 look at um some of the documents that 403 00:14:31,200 --> 00:14:34,500 match that particular word and we've 404 00:14:33,120 --> 00:14:36,720 sorted all of the Clusters and all the 405 00:14:34,500 --> 00:14:38,880 words in those clusters by similarity to 406 00:14:36,720 --> 00:14:41,760 that word so we see that working is a 407 00:14:38,880 --> 00:14:44,779 very generic word in that it's used in 408 00:14:41,760 --> 00:14:48,899 many different contexts so we got 409 00:14:44,779 --> 00:14:51,000 capistranu philaskian Docker um 410 00:14:48,899 --> 00:14:53,519 compiling assembly files in CMAC so 411 00:14:51,000 --> 00:14:55,199 working is not a not necessarily a 412 00:14:53,519 --> 00:14:59,360 useful word if I want to look at python 413 00:14:55,199 --> 00:14:59,360 but we could do something like 414 00:15:01,440 --> 00:15:05,660 let's search for pythonic in the post 415 00:15:09,660 --> 00:15:12,420 ha that's even better I just realized 416 00:15:11,279 --> 00:15:14,279 that's even better than I thought in 417 00:15:12,420 --> 00:15:17,160 that it's saying 418 00:15:14,279 --> 00:15:18,600 people often say comprehensions are 419 00:15:17,160 --> 00:15:20,100 pythonic 420 00:15:18,600 --> 00:15:21,660 which is already interesting and that's 421 00:15:20,100 --> 00:15:22,440 probably a whole separate thing so the 422 00:15:21,660 --> 00:15:23,699 rest of what I'm going to talk about 423 00:15:22,440 --> 00:15:25,320 today 424 00:15:23,699 --> 00:15:27,180 um so that's that's the thing you can 425 00:15:25,320 --> 00:15:28,199 have a look at it the idea is that so 426 00:15:27,180 --> 00:15:29,339 what I'm going to talk about for the 427 00:15:28,199 --> 00:15:31,440 rest of this is I'm going to talk about 428 00:15:29,339 --> 00:15:32,760 these clusters of words I'm going to 429 00:15:31,440 --> 00:15:35,399 talk about the trends in these classes 430 00:15:32,760 --> 00:15:37,260 of words over time and I'm going to talk 431 00:15:35,399 --> 00:15:38,160 about a bit about because I've also 432 00:15:37,260 --> 00:15:41,160 spent 433 00:15:38,160 --> 00:15:42,360 time reading documents that match these 434 00:15:41,160 --> 00:15:43,860 words as well so it's not just to look 435 00:15:42,360 --> 00:15:45,540 I'm not just looking at the words and 436 00:15:43,860 --> 00:15:47,459 telling you what they mean I'm looking 437 00:15:45,540 --> 00:15:49,860 at a sample of the underlying documents 438 00:15:47,459 --> 00:15:53,180 as well without reading 58 million of 439 00:15:49,860 --> 00:15:53,180 them because I haven't got time for that 440 00:15:53,820 --> 00:15:57,959 so 441 00:15:55,100 --> 00:16:00,779 let's start with 442 00:15:57,959 --> 00:16:02,579 whether this works or not so sometimes 443 00:16:00,779 --> 00:16:05,660 it's really really obvious what you're 444 00:16:02,579 --> 00:16:07,500 looking at when you have 445 00:16:05,660 --> 00:16:09,120 cryptocurrency people using really 446 00:16:07,500 --> 00:16:10,860 specific language and really specific 447 00:16:09,120 --> 00:16:13,320 keywords around Technologies and 448 00:16:10,860 --> 00:16:16,860 approaches you get the raise and fall 449 00:16:13,320 --> 00:16:19,320 and raise and fall again of various 450 00:16:16,860 --> 00:16:22,980 cryptocurrency related projects so 451 00:16:19,320 --> 00:16:24,600 uh in blue is 452 00:16:22,980 --> 00:16:27,480 I'd call it cryptocurrencies because 453 00:16:24,600 --> 00:16:29,760 it's well it's an ethereum and crypto uh 454 00:16:27,480 --> 00:16:31,320 sorry well well it's an ethereum and 455 00:16:29,760 --> 00:16:33,440 Bitcoin 456 00:16:31,320 --> 00:16:33,440 um 457 00:16:33,779 --> 00:16:39,839 green is nfts somehow 458 00:16:37,320 --> 00:16:40,620 and so on so this one makes sense 459 00:16:39,839 --> 00:16:42,540 um 460 00:16:40,620 --> 00:16:45,240 sometimes though 461 00:16:42,540 --> 00:16:46,980 it gets a bit harder so if we if I 462 00:16:45,240 --> 00:16:48,420 picked out these three particular things 463 00:16:46,980 --> 00:16:51,120 which 464 00:16:48,420 --> 00:16:52,940 I guess the face reading is we've got 465 00:16:51,120 --> 00:16:56,759 the 466 00:16:52,940 --> 00:16:58,620 rise and gradual decline of angular 467 00:16:56,759 --> 00:17:00,839 compared to react 468 00:16:58,620 --> 00:17:02,759 with the caveat that react has pulled in 469 00:17:00,839 --> 00:17:04,980 a few so I guess 470 00:17:02,759 --> 00:17:06,240 because this is it's just a numerical 471 00:17:04,980 --> 00:17:08,760 model 472 00:17:06,240 --> 00:17:10,140 native and component have kind of been 473 00:17:08,760 --> 00:17:12,480 glommed into react because there's so 474 00:17:10,140 --> 00:17:14,640 many people talking about react native 475 00:17:12,480 --> 00:17:16,199 and components in reactant what is 476 00:17:14,640 --> 00:17:18,600 actually a general keyword has been 477 00:17:16,199 --> 00:17:20,459 absorbed into the framework 478 00:17:18,600 --> 00:17:22,500 um 479 00:17:20,459 --> 00:17:25,620 and also depending on what you want to 480 00:17:22,500 --> 00:17:28,319 do you might say that HTML JavaScript 481 00:17:25,620 --> 00:17:31,080 CSS jQuery is too General that's too 482 00:17:28,319 --> 00:17:32,880 many things and if you do say that I 483 00:17:31,080 --> 00:17:35,340 completely agree with you 484 00:17:32,880 --> 00:17:38,000 and what I would tell you 485 00:17:35,340 --> 00:17:38,000 let's 486 00:17:44,760 --> 00:17:49,320 if you did if if you did so we could for 487 00:17:47,820 --> 00:17:52,380 example if we just wanted to separate 488 00:17:49,320 --> 00:17:55,320 JavaScript and jQuery as synonymous with 489 00:17:52,380 --> 00:17:57,720 JavaScript during the early part of um 490 00:17:55,320 --> 00:17:59,580 stack Overflow we can just pull it out 491 00:17:57,720 --> 00:18:01,140 on its own and then we can use that as 492 00:17:59,580 --> 00:18:03,059 its own little thing separate to 493 00:18:01,140 --> 00:18:05,220 everything else and it's all dynamically 494 00:18:03,059 --> 00:18:08,460 updated so if we then wanted to look at 495 00:18:05,220 --> 00:18:11,120 this separately to everything else 496 00:18:08,460 --> 00:18:11,120 get this 497 00:18:13,200 --> 00:18:17,160 so oh yeah 498 00:18:15,179 --> 00:18:19,020 lots of questions about clicking buttons 499 00:18:17,160 --> 00:18:21,500 and getting that to work in HTML or in 500 00:18:19,020 --> 00:18:21,500 JavaScript 501 00:18:23,539 --> 00:18:29,220 so the Clusters are useful 502 00:18:27,660 --> 00:18:31,679 but they don't absolve you from the work 503 00:18:29,220 --> 00:18:35,360 of reading and making changes and making 504 00:18:31,679 --> 00:18:35,360 choices and interpretation 505 00:18:35,520 --> 00:18:41,160 so let's let's look at python um 506 00:18:39,360 --> 00:18:42,539 let's let's warm up with a nice easy one 507 00:18:41,160 --> 00:18:46,860 so everything I'm going to show you now 508 00:18:42,539 --> 00:18:48,179 is I've everything is a subset related 509 00:18:46,860 --> 00:18:50,840 to everything that's been tagged with 510 00:18:48,179 --> 00:18:54,000 python so I'm using the python tag as a 511 00:18:50,840 --> 00:18:55,620 approximate filter down to is this about 512 00:18:54,000 --> 00:18:58,500 python or not and I say approximate 513 00:18:55,620 --> 00:19:00,480 because I've read heaps of ones where 514 00:18:58,500 --> 00:19:02,400 the question the person who asked the 515 00:19:00,480 --> 00:19:03,539 question tags at Python and then 516 00:19:02,400 --> 00:19:05,840 somebody gives 517 00:19:03,539 --> 00:19:08,280 an answer in a different language which 518 00:19:05,840 --> 00:19:10,080 maybe is not what the person asking the 519 00:19:08,280 --> 00:19:11,280 question wanted or maybe not 520 00:19:10,080 --> 00:19:12,840 um 521 00:19:11,280 --> 00:19:14,220 so we'll start off with a nice obvious 522 00:19:12,840 --> 00:19:15,840 one we've got 523 00:19:14,220 --> 00:19:17,520 all the questions related to asking 524 00:19:15,840 --> 00:19:19,860 questions and the important punctuation 525 00:19:17,520 --> 00:19:22,080 mark the question mark which indicates 526 00:19:19,860 --> 00:19:24,539 that you are asking a question almost 527 00:19:22,080 --> 00:19:26,640 every question tagged python has those 528 00:19:24,539 --> 00:19:28,220 and that's what you'd expect and also 529 00:19:26,640 --> 00:19:30,900 more importantly 530 00:19:28,220 --> 00:19:33,660 uh lots of the answers don't have that 531 00:19:30,900 --> 00:19:35,299 so the answer is of questions are not 532 00:19:33,660 --> 00:19:37,380 themselves asking that many questions 533 00:19:35,299 --> 00:19:38,760 they might be using some of these other 534 00:19:37,380 --> 00:19:41,760 keywords I mean these other words 535 00:19:38,760 --> 00:19:44,160 because they're quite General but 536 00:19:41,760 --> 00:19:46,020 so I guess that makes sense and the 537 00:19:44,160 --> 00:19:48,179 trend is fairly flat over time I'm going 538 00:19:46,020 --> 00:19:51,260 to say Trend but it 539 00:19:48,179 --> 00:19:51,260 it's more The Vibes 540 00:19:51,900 --> 00:19:56,039 this is all Vibes I'm more a qualitative 541 00:19:53,940 --> 00:19:57,660 person than a quantitative despite the 542 00:19:56,039 --> 00:19:59,780 computational stuff 543 00:19:57,660 --> 00:19:59,780 um 544 00:20:02,160 --> 00:20:07,020 sorry every time I see guys in this I 545 00:20:04,140 --> 00:20:10,380 just sigh a bit because 546 00:20:07,020 --> 00:20:12,120 again back to the demographics and 547 00:20:10,380 --> 00:20:14,100 a whole separate problem 548 00:20:12,120 --> 00:20:16,080 um interestingly uh I think the 549 00:20:14,100 --> 00:20:19,200 exclamation mark in questions has a more 550 00:20:16,080 --> 00:20:21,600 positive emotional affect so thanks in 551 00:20:19,200 --> 00:20:23,160 advance for your help it's more positive 552 00:20:21,600 --> 00:20:24,720 or more 553 00:20:23,160 --> 00:20:26,940 it's not necessarily a negative thing 554 00:20:24,720 --> 00:20:29,100 like it might be in some other context 555 00:20:26,940 --> 00:20:31,620 um so lots of people have exclamation 556 00:20:29,100 --> 00:20:35,160 marks and thanks and 557 00:20:31,620 --> 00:20:36,240 welcomes and nice positive things 558 00:20:35,160 --> 00:20:37,740 um 559 00:20:36,240 --> 00:20:39,179 except in the answers people don't 560 00:20:37,740 --> 00:20:43,160 bother with that in the answers I think 561 00:20:39,179 --> 00:20:43,160 I think answers because I think 562 00:20:43,320 --> 00:20:46,799 uh 563 00:20:44,820 --> 00:20:48,539 I think it's a smaller people group of 564 00:20:46,799 --> 00:20:50,039 people giving more answers and they tend 565 00:20:48,539 --> 00:20:53,340 to be more 566 00:20:50,039 --> 00:20:56,220 focused and direct when giving an answer 567 00:20:53,340 --> 00:20:57,600 like I said vibes 568 00:20:56,220 --> 00:21:00,299 and the other one that's really 569 00:20:57,600 --> 00:21:01,620 interesting to me is and this to me also 570 00:21:00,299 --> 00:21:03,539 highlights one of the challenges with 571 00:21:01,620 --> 00:21:04,919 stack Overflow and talking and going 572 00:21:03,539 --> 00:21:07,440 from stack Overflow to the rest of 573 00:21:04,919 --> 00:21:10,400 programming is that I think stack 574 00:21:07,440 --> 00:21:10,400 Overflow is very 575 00:21:11,240 --> 00:21:16,080 transactional or Tactical 576 00:21:14,520 --> 00:21:17,700 a lot of questions are there because 577 00:21:16,080 --> 00:21:19,980 people have an immediate problem that 578 00:21:17,700 --> 00:21:22,620 they're trying to solve so and and in 579 00:21:19,980 --> 00:21:24,240 fact the site actively discourages more 580 00:21:22,620 --> 00:21:26,580 general questions like relating to 581 00:21:24,240 --> 00:21:28,980 bigger scale topics like architectural 582 00:21:26,580 --> 00:21:33,320 approaches testing approaches and things 583 00:21:28,980 --> 00:21:33,320 like that so a lot of the questions are 584 00:21:33,539 --> 00:21:38,940 I'm getting this error 585 00:21:35,940 --> 00:21:41,400 what do I do what am I doing wrong 586 00:21:38,940 --> 00:21:42,720 can you help me fix this so you've got 587 00:21:41,400 --> 00:21:44,880 to be really careful going from stack 588 00:21:42,720 --> 00:21:46,919 Overflow to the broader practice of 589 00:21:44,880 --> 00:21:48,419 programming because I'm sure many of you 590 00:21:46,919 --> 00:21:50,100 would agree with me in saying that 591 00:21:48,419 --> 00:21:51,780 programming is more than just writing a 592 00:21:50,100 --> 00:21:53,280 bit of code and debugging an immediate 593 00:21:51,780 --> 00:21:56,059 problem there's a lot of other things we 594 00:21:53,280 --> 00:21:56,059 do besides that 595 00:21:59,780 --> 00:22:04,679 now I'll just I'll give you one other 596 00:22:02,159 --> 00:22:06,240 warning the hardest part about this talk 597 00:22:04,679 --> 00:22:07,740 was to figure out what to talk about 598 00:22:06,240 --> 00:22:09,900 because 599 00:22:07,740 --> 00:22:12,659 I have a thousand and 24 clusters I have 600 00:22:09,900 --> 00:22:14,880 1024 Trends over time and I have 601 00:22:12,659 --> 00:22:16,740 intersection with with whatever I want 602 00:22:14,880 --> 00:22:18,360 in the data set 603 00:22:16,740 --> 00:22:21,140 so it was really hard just to focus down 604 00:22:18,360 --> 00:22:23,880 and if you do want like some really 605 00:22:21,140 --> 00:22:25,500 overwhelming grids of grass relating to 606 00:22:23,880 --> 00:22:27,659 Trends I've got heaps of them for you 607 00:22:25,500 --> 00:22:30,240 and I'm tried to select a few 608 00:22:27,659 --> 00:22:32,039 interesting things more based on Vibes 609 00:22:30,240 --> 00:22:33,179 and what you're interested and the fact 610 00:22:32,039 --> 00:22:35,600 that this is a python community 611 00:22:33,179 --> 00:22:38,400 community rather than anything else 612 00:22:35,600 --> 00:22:39,840 so I'm going through a lot of things so 613 00:22:38,400 --> 00:22:41,960 I spent a lot of time looking at graphs 614 00:22:39,840 --> 00:22:44,039 I spend a lot of time reading questions 615 00:22:41,960 --> 00:22:46,140 and I noticed there are a couple of 616 00:22:44,039 --> 00:22:47,640 recurring topics 617 00:22:46,140 --> 00:22:49,140 that I would call fundamental 618 00:22:47,640 --> 00:22:51,000 programming topics and they keep coming 619 00:22:49,140 --> 00:22:53,220 up both in questions and answers and 620 00:22:51,000 --> 00:22:56,100 they have a few the probably the most 621 00:22:53,220 --> 00:22:56,820 the reason I call them fundamentals 622 00:22:56,100 --> 00:22:58,740 um 623 00:22:56,820 --> 00:23:01,980 is that they're directly recognizable 624 00:22:58,740 --> 00:23:05,220 topics but they're used in so many 625 00:23:01,980 --> 00:23:07,080 different contexts that you can't just 626 00:23:05,220 --> 00:23:08,640 say well obviously they're always 627 00:23:07,080 --> 00:23:10,140 talking about this problem it's like no 628 00:23:08,640 --> 00:23:12,120 this is a fundamental thing everybody 629 00:23:10,140 --> 00:23:13,440 when you talk to a programmer or an 630 00:23:12,120 --> 00:23:15,059 experienced programmer they know this 631 00:23:13,440 --> 00:23:18,200 they assume you know this they're going 632 00:23:15,059 --> 00:23:18,200 to answer as if you know this 633 00:23:18,240 --> 00:23:24,419 and the first one I will call the 634 00:23:20,760 --> 00:23:26,880 strings and textual types which again uh 635 00:23:24,419 --> 00:23:28,980 has been there from the beginning so 22 636 00:23:26,880 --> 00:23:31,679 percent of the 637 00:23:28,980 --> 00:23:33,059 questions even at the beginning when a 638 00:23:31,679 --> 00:23:36,000 python was not a big part of stack 639 00:23:33,059 --> 00:23:38,220 Overflow had things related to this and 640 00:23:36,000 --> 00:23:40,020 a few more keywords so 641 00:23:38,220 --> 00:23:41,460 strings and strings and textual types 642 00:23:40,020 --> 00:23:43,320 are a really fundamental thing if you 643 00:23:41,460 --> 00:23:44,460 can't work with them or you're confused 644 00:23:43,320 --> 00:23:46,320 by them or you're having trouble with 645 00:23:44,460 --> 00:23:47,700 them it's going to be hard to do it and 646 00:23:46,320 --> 00:23:49,380 there were so many fun and interesting 647 00:23:47,700 --> 00:23:51,000 ways people actually referred to Strings 648 00:23:49,380 --> 00:23:53,880 like 649 00:23:51,000 --> 00:23:55,559 um so the CSV is there because csvs are 650 00:23:53,880 --> 00:23:56,760 a textual data format 651 00:23:55,559 --> 00:23:58,320 um 652 00:23:56,760 --> 00:23:59,820 and there's so many questions about I 653 00:23:58,320 --> 00:24:01,140 need to go from this to this how do I do 654 00:23:59,820 --> 00:24:03,299 this conversion how do I change this 655 00:24:01,140 --> 00:24:05,460 format how do I do this stuff it's it's 656 00:24:03,299 --> 00:24:06,419 it's a really fundamental thing if you 657 00:24:05,460 --> 00:24:07,380 don't know this you're going to have 658 00:24:06,419 --> 00:24:09,000 struggle 659 00:24:07,380 --> 00:24:11,039 um I should also mention this is not an 660 00:24:09,000 --> 00:24:12,900 exhaustive list I've kind of 661 00:24:11,039 --> 00:24:14,400 there's a lot of fundamental things that 662 00:24:12,900 --> 00:24:15,780 you can probably think of that I won't 663 00:24:14,400 --> 00:24:17,340 mention here today I just thought these 664 00:24:15,780 --> 00:24:19,380 particular ones look nice 665 00:24:17,340 --> 00:24:22,500 um 666 00:24:19,380 --> 00:24:25,799 I don't think I need I mean 667 00:24:22,500 --> 00:24:27,120 files and file systems really hard to do 668 00:24:25,799 --> 00:24:29,220 anything if you don't know how your file 669 00:24:27,120 --> 00:24:31,620 system works or you're struggling to 670 00:24:29,220 --> 00:24:33,059 understand how a path works or where 671 00:24:31,620 --> 00:24:34,860 your file is or what's my working 672 00:24:33,059 --> 00:24:36,120 directory 673 00:24:34,860 --> 00:24:38,760 um 674 00:24:36,120 --> 00:24:40,740 root is a keyword it's a fundamental 675 00:24:38,760 --> 00:24:42,240 part and I just realized it's what it's 676 00:24:40,740 --> 00:24:44,100 it's it's 677 00:24:42,240 --> 00:24:45,299 like everything in programming it's also 678 00:24:44,100 --> 00:24:46,559 overloaded because it means different 679 00:24:45,299 --> 00:24:48,539 things if you're talking about the root 680 00:24:46,559 --> 00:24:50,460 of a file system or a root user in a 681 00:24:48,539 --> 00:24:51,960 Unix system so 682 00:24:50,460 --> 00:24:53,640 one that's actually one of the beginning 683 00:24:51,960 --> 00:24:55,260 challenges beginners face is they don't 684 00:24:53,640 --> 00:24:57,000 even know the right words to use or they 685 00:24:55,260 --> 00:24:58,679 can't differentiate between 686 00:24:57,000 --> 00:25:02,120 the different meanings of the same word 687 00:24:58,679 --> 00:25:02,120 because we all use the same words 688 00:25:03,419 --> 00:25:06,320 variables 689 00:25:07,200 --> 00:25:11,100 hard to program if you don't have any 690 00:25:09,120 --> 00:25:11,760 variables 691 00:25:11,100 --> 00:25:13,500 um 692 00:25:11,760 --> 00:25:16,080 it's one of the nice things about python 693 00:25:13,500 --> 00:25:18,000 is how much you can do just in terms of 694 00:25:16,080 --> 00:25:19,500 assignment and scope and 695 00:25:18,000 --> 00:25:20,220 function scope and all those other 696 00:25:19,500 --> 00:25:21,840 things 697 00:25:20,220 --> 00:25:23,640 um 698 00:25:21,840 --> 00:25:25,440 again iteration 699 00:25:23,640 --> 00:25:28,220 um 700 00:25:25,440 --> 00:25:30,840 with the with a bonus infinite Loop 701 00:25:28,220 --> 00:25:32,279 keyword where people are getting stuck 702 00:25:30,840 --> 00:25:35,000 because they're in a while loop that 703 00:25:32,279 --> 00:25:35,000 never exits 704 00:25:35,820 --> 00:25:41,340 arguments to functions some people get 705 00:25:39,059 --> 00:25:43,440 confused by keyword arguments or how to 706 00:25:41,340 --> 00:25:46,220 use keyword arguments or what's a 707 00:25:43,440 --> 00:25:46,220 callable anyway 708 00:25:47,760 --> 00:25:53,240 or I supplied this is what I supplied 709 00:25:50,279 --> 00:25:53,240 why is it not working 710 00:25:54,539 --> 00:25:58,380 uh 711 00:25:56,279 --> 00:25:59,880 built-in types 712 00:25:58,380 --> 00:26:01,799 super important 713 00:25:59,880 --> 00:26:04,500 um 714 00:26:01,799 --> 00:26:05,220 hard to work in Python without 715 00:26:04,500 --> 00:26:07,940 um 716 00:26:05,220 --> 00:26:07,940 a dictionary 717 00:26:10,440 --> 00:26:14,220 dates and times 718 00:26:12,360 --> 00:26:16,580 please don't ask me about time zones I 719 00:26:14,220 --> 00:26:16,580 don't know 720 00:26:18,600 --> 00:26:23,240 but this one and the next one are 721 00:26:20,580 --> 00:26:23,240 interesting together 722 00:26:24,419 --> 00:26:27,900 and I'm gonna so I'll tell you my theory 723 00:26:26,159 --> 00:26:30,600 now why why there is a decline in 724 00:26:27,900 --> 00:26:33,620 discussion of classes 725 00:26:30,600 --> 00:26:33,620 and I think that's because 726 00:26:34,679 --> 00:26:39,020 python is Shifting so 727 00:26:39,659 --> 00:26:43,020 python is in the position it is today I 728 00:26:42,179 --> 00:26:44,820 think and this is where I'm 729 00:26:43,020 --> 00:26:46,080 extrapolating from stress stack Overflow 730 00:26:44,820 --> 00:26:48,179 to everything else I'm looking at these 731 00:26:46,080 --> 00:26:50,100 Trends and what I know as a professional 732 00:26:48,179 --> 00:26:51,960 software developer 733 00:26:50,100 --> 00:26:54,960 um 734 00:26:51,960 --> 00:26:57,419 probably the most notable part is that 735 00:26:54,960 --> 00:26:59,400 python along with everything else it 736 00:26:57,419 --> 00:27:01,559 does has historically been python is a 737 00:26:59,400 --> 00:27:03,600 language for data science it has an 738 00:27:01,559 --> 00:27:05,820 entire data science ecosystem built on 739 00:27:03,600 --> 00:27:07,500 top of it so we see a real rise in 740 00:27:05,820 --> 00:27:09,059 prominence even as the volume and 741 00:27:07,500 --> 00:27:11,100 proportion of python questions are going 742 00:27:09,059 --> 00:27:14,120 up on stack Overflow the volume and 743 00:27:11,100 --> 00:27:17,520 proportion of uh scientific and 744 00:27:14,120 --> 00:27:19,620 scientific Computing and data science 745 00:27:17,520 --> 00:27:21,419 related Computing stuff is going up at 746 00:27:19,620 --> 00:27:22,440 the same time so I think there's a 747 00:27:21,419 --> 00:27:25,620 there's a 748 00:27:22,440 --> 00:27:27,000 a feedback loop there 749 00:27:25,620 --> 00:27:28,559 and I think part of the reason we're 750 00:27:27,000 --> 00:27:30,120 talking less about classes is more 751 00:27:28,559 --> 00:27:31,500 because 752 00:27:30,120 --> 00:27:33,539 um 753 00:27:31,500 --> 00:27:36,480 in the data Science World you tend to 754 00:27:33,539 --> 00:27:39,059 spend less time constructing systems of 755 00:27:36,480 --> 00:27:40,500 objects you have a data with a 756 00:27:39,059 --> 00:27:42,240 particular schema and that's your model 757 00:27:40,500 --> 00:27:44,299 of the world so you don't have an object 758 00:27:42,240 --> 00:27:47,940 you have a schema 759 00:27:44,299 --> 00:27:50,520 I did say Vibes I'll say it again 760 00:27:47,940 --> 00:27:52,940 um this is the one that surprised me the 761 00:27:50,520 --> 00:27:52,940 most because 762 00:27:53,279 --> 00:27:55,799 um 763 00:27:54,240 --> 00:27:57,840 I was not expecting this particular 764 00:27:55,799 --> 00:28:01,020 volume so 765 00:27:57,840 --> 00:28:03,480 15 of questions in the last few years 766 00:28:01,020 --> 00:28:05,880 to tagged with python and the answers to 767 00:28:03,480 --> 00:28:07,980 those questions 15 of them 768 00:28:05,880 --> 00:28:10,020 touch on pandas and data frames so this 769 00:28:07,980 --> 00:28:12,120 is this is the data this is the data 770 00:28:10,020 --> 00:28:15,059 science in Python Community growing over 771 00:28:12,120 --> 00:28:17,039 time and there's a general consensus 772 00:28:15,059 --> 00:28:19,320 that penders and data frames is a useful 773 00:28:17,039 --> 00:28:20,820 construct and everybody's using them and 774 00:28:19,320 --> 00:28:23,760 everybody and there's a lot of confusion 775 00:28:20,820 --> 00:28:25,500 about them I've got specific confusion 776 00:28:23,760 --> 00:28:29,419 about a few different aspects of penders 777 00:28:25,500 --> 00:28:29,419 which I'm not going to talk about today 778 00:28:29,520 --> 00:28:33,900 um but we've also got machine learning 779 00:28:31,799 --> 00:28:37,140 in general 780 00:28:33,900 --> 00:28:38,880 I don't know why there's a dip there 781 00:28:37,140 --> 00:28:40,200 I got no if somebody's got some ideas 782 00:28:38,880 --> 00:28:41,400 I'd love to hear them I got I got 783 00:28:40,200 --> 00:28:42,840 nothing 784 00:28:41,400 --> 00:28:44,700 um 785 00:28:42,840 --> 00:28:46,200 I also don't know why there's a peak 786 00:28:44,700 --> 00:28:48,960 there but maybe I can blame large 787 00:28:46,200 --> 00:28:51,059 language models for that or thank them 788 00:28:48,960 --> 00:28:52,559 not sure if we trade around 789 00:28:51,059 --> 00:28:54,240 um we've also got the rise in deep 790 00:28:52,559 --> 00:28:56,340 learning so 791 00:28:54,240 --> 00:28:58,860 five percent of questions about 792 00:28:56,340 --> 00:29:00,900 tensorflow and kerosene tensor and Pie 793 00:28:58,860 --> 00:29:04,200 torch and their various things that's 794 00:29:00,900 --> 00:29:07,620 surprising to me that's cool 795 00:29:04,200 --> 00:29:09,179 and lastly plotting and visualization 796 00:29:07,620 --> 00:29:10,860 everybody wants to make plots because 797 00:29:09,179 --> 00:29:13,760 plots are cool visualizations are really 798 00:29:10,860 --> 00:29:13,760 important part of practice 799 00:29:14,880 --> 00:29:17,820 which is because there's a Django 800 00:29:16,140 --> 00:29:20,120 conference right now again this is 801 00:29:17,820 --> 00:29:22,380 proportion and remember at the beginning 802 00:29:20,120 --> 00:29:25,440 stack Overflow is only one or two 803 00:29:22,380 --> 00:29:27,600 percent of questions about python um 804 00:29:25,440 --> 00:29:29,640 I want you to compare this to the rise 805 00:29:27,600 --> 00:29:33,539 and fall of angular I showed you there 806 00:29:29,640 --> 00:29:35,640 and to me this is this is Django as in 807 00:29:33,539 --> 00:29:37,380 it for the Long Haul 808 00:29:35,640 --> 00:29:39,299 still relevant still important still 809 00:29:37,380 --> 00:29:40,980 lots of people asking questions 810 00:29:39,299 --> 00:29:42,539 um 811 00:29:40,980 --> 00:29:43,860 and so on 812 00:29:42,539 --> 00:29:45,720 and in the interest of time I'm going to 813 00:29:43,860 --> 00:29:47,340 skip the last one 814 00:29:45,720 --> 00:29:49,740 oh sorry second last one and just 815 00:29:47,340 --> 00:29:52,380 mention notebooks are cool people more 816 00:29:49,740 --> 00:29:53,580 and more people are using them 817 00:29:52,380 --> 00:29:55,919 so 818 00:29:53,580 --> 00:29:57,960 to summarize uh 819 00:29:55,919 --> 00:29:59,340 in the context of stack Overflow I'd say 820 00:29:57,960 --> 00:30:00,960 that the relative growth in passing 821 00:29:59,340 --> 00:30:03,000 questions on stack Overflow is directly 822 00:30:00,960 --> 00:30:04,340 tied to python as a language for data 823 00:30:03,000 --> 00:30:06,600 science 824 00:30:04,340 --> 00:30:07,980 there's a lot of language fundamentals 825 00:30:06,600 --> 00:30:09,840 that haven't changed even if you're 826 00:30:07,980 --> 00:30:11,880 using the things in the python data 827 00:30:09,840 --> 00:30:13,860 science stack on top of that 828 00:30:11,880 --> 00:30:16,260 um 829 00:30:13,860 --> 00:30:17,940 but 830 00:30:16,260 --> 00:30:19,559 this is lots of really exciting things 831 00:30:17,940 --> 00:30:21,240 happening in the python ecosystem that 832 00:30:19,559 --> 00:30:24,360 build on top of python as a language so 833 00:30:21,240 --> 00:30:26,100 python as data science is not not just 834 00:30:24,360 --> 00:30:29,360 about python the language it's about all 835 00:30:26,100 --> 00:30:29,360 of the things built on top of it 836 00:30:30,240 --> 00:30:33,419 uh and I'm going to leave you with the 837 00:30:31,679 --> 00:30:36,059 metaphor of the nice philosophical 838 00:30:33,419 --> 00:30:38,440 summary python is 839 00:30:36,059 --> 00:30:46,500 um and just leave it there thank you 840 00:30:38,440 --> 00:30:48,360 [Applause] 841 00:30:46,500 --> 00:30:49,980 good 842 00:30:48,360 --> 00:30:52,440 thank you Sam 843 00:30:49,980 --> 00:30:55,679 very good presentation we would love to 844 00:30:52,440 --> 00:30:57,299 have more time but all good so we're not 845 00:30:55,679 --> 00:31:00,360 taking questions so is there any 846 00:30:57,299 --> 00:31:02,940 questions in interesting okay good so 847 00:31:00,360 --> 00:31:06,659 we'll have uh 10 minutes to change rooms 848 00:31:02,940 --> 00:31:09,659 and uh we'll continue here 849 00:31:06,659 --> 00:31:09,659 at 850 00:31:09,720 --> 00:31:14,520 what time is 851 00:31:11,399 --> 00:31:16,980 11 10 30 or 852 00:31:14,520 --> 00:31:19,380 11 40 in the next talk about the 3D 853 00:31:16,980 --> 00:31:20,400 visualization okay Stan thank you very 854 00:31:19,380 --> 00:31:25,939 much 855 00:31:20,400 --> 00:31:25,939 [Applause]