1 00:00:00,480 --> 00:00:03,480 foreign 2 00:00:08,519 --> 00:00:13,559 okay welcome everyone back to All Things 3 00:00:11,580 --> 00:00:16,199 data we're going to jump straight into 4 00:00:13,559 --> 00:00:19,680 the next talk and um you know we often 5 00:00:16,199 --> 00:00:22,320 talk about people process technology and 6 00:00:19,680 --> 00:00:24,060 uh our tools and as a python conference 7 00:00:22,320 --> 00:00:28,439 you know we have a lot of tools coming 8 00:00:24,060 --> 00:00:31,260 up uh and um uh but today we're going to 9 00:00:28,439 --> 00:00:33,300 be talking about people in process and 10 00:00:31,260 --> 00:00:35,460 in particular around data contracts and 11 00:00:33,300 --> 00:00:37,440 how they can help us uh you know get 12 00:00:35,460 --> 00:00:38,579 more data quality and you know all that 13 00:00:37,440 --> 00:00:41,460 good stuff that we're actually trying to 14 00:00:38,579 --> 00:00:44,160 do with data so I'm 15 00:00:41,460 --> 00:00:46,379 uh really excited for uh the talk next 16 00:00:44,160 --> 00:00:48,480 from Ryan Collingwood and I'll just give 17 00:00:46,379 --> 00:00:51,000 a quick introduction to Ryan 18 00:00:48,480 --> 00:00:53,820 um he uh is a boundary spanner between 19 00:00:51,000 --> 00:00:56,039 business and I.T concerns uh he's lived 20 00:00:53,820 --> 00:00:58,500 product management consultancy iteration 21 00:00:56,039 --> 00:01:00,300 management business analysis data 22 00:00:58,500 --> 00:01:02,039 analysis software developer quality 23 00:01:00,300 --> 00:01:04,619 assurance to where it started as it 24 00:01:02,039 --> 00:01:06,060 support every day he practices a bit of 25 00:01:04,619 --> 00:01:08,100 all this to deliver pragmatic and 26 00:01:06,060 --> 00:01:10,380 maintainable solutions he loves the 27 00:01:08,100 --> 00:01:11,880 journey both as a contributor and a 28 00:01:10,380 --> 00:01:13,560 coordinator from the wrangling modeling 29 00:01:11,880 --> 00:01:15,119 sampling digging for insights and all 30 00:01:13,560 --> 00:01:17,400 the important communication of insights 31 00:01:15,119 --> 00:01:19,260 and further actions and he gets a kick 32 00:01:17,400 --> 00:01:21,900 out of teaching others to finish the 33 00:01:19,260 --> 00:01:25,560 many water bodies of data whether they 34 00:01:21,900 --> 00:01:27,240 be ponds lakes or swamps so um let's 35 00:01:25,560 --> 00:01:28,799 hear put your hands together for Ryan 36 00:01:27,240 --> 00:01:31,759 Collingwood talking about data contracts 37 00:01:28,799 --> 00:01:31,759 consensus as code 38 00:01:32,700 --> 00:01:36,720 thank you thank you thank you um I'm 39 00:01:34,619 --> 00:01:39,119 really excited to be here at pycon in 40 00:01:36,720 --> 00:01:41,280 Adelaide it's just fantastic energy and 41 00:01:39,119 --> 00:01:43,320 thank you for being him 42 00:01:41,280 --> 00:01:45,180 so who am I and what is my current 43 00:01:43,320 --> 00:01:46,740 context because I feel like context is 44 00:01:45,180 --> 00:01:49,079 really important to share whenever 45 00:01:46,740 --> 00:01:51,299 you're dispensing advice uh I'm 46 00:01:49,079 --> 00:01:52,920 currently working on the team of data 47 00:01:51,299 --> 00:01:55,140 and analytics at origin if you're 48 00:01:52,920 --> 00:01:57,000 unfamiliar with origin origin is 49 00:01:55,140 --> 00:01:59,700 Australia's at least by my Reckoning 50 00:01:57,000 --> 00:02:03,420 oldest fashion brand Australian fashion 51 00:01:59,700 --> 00:02:05,219 brand founded in 1938 and we are a 52 00:02:03,420 --> 00:02:07,439 centralized team very small centralized 53 00:02:05,219 --> 00:02:10,080 team looking after a much bigger rest of 54 00:02:07,439 --> 00:02:12,000 the organization and our Tech landscape 55 00:02:10,080 --> 00:02:14,220 is dominated by one or two monoliths 56 00:02:12,000 --> 00:02:16,620 namely our enterprise resource planning 57 00:02:14,220 --> 00:02:17,940 application and our pause system and 58 00:02:16,620 --> 00:02:19,920 it's orbited by a number of little 59 00:02:17,940 --> 00:02:21,959 sassers you know so you know these two 60 00:02:19,920 --> 00:02:24,239 big planets and all these smaller ones 61 00:02:21,959 --> 00:02:27,120 orbiting around it and our data is 62 00:02:24,239 --> 00:02:29,040 mostly moved around in a batch fashion 63 00:02:27,120 --> 00:02:30,840 why I think you might care about this 64 00:02:29,040 --> 00:02:33,540 conversation so over here we have a 65 00:02:30,840 --> 00:02:36,180 reference diagram of a data stack 66 00:02:33,540 --> 00:02:38,160 if you are in the pinkish purplish 67 00:02:36,180 --> 00:02:39,959 bucket of data engineering you probably 68 00:02:38,160 --> 00:02:42,420 know all of the pains that I'm about to 69 00:02:39,959 --> 00:02:44,879 talk about if you are in the blue bucket 70 00:02:42,420 --> 00:02:47,040 typically the consumers of the magic 71 00:02:44,879 --> 00:02:48,540 that happens in the pink butt bucket you 72 00:02:47,040 --> 00:02:50,280 probably have felt the impacts of this 73 00:02:48,540 --> 00:02:52,920 pain and if you're on the green bucket 74 00:02:50,280 --> 00:02:54,599 well fantastic to have you here because 75 00:02:52,920 --> 00:02:57,540 maybe you don't know about all of this 76 00:02:54,599 --> 00:02:59,459 and we need more allies from that bucket 77 00:02:57,540 --> 00:03:00,660 before I get too far along with this I 78 00:02:59,459 --> 00:03:02,879 want to give a shout out to Andrew Jones 79 00:03:00,660 --> 00:03:04,860 Andrew Jones is to my knowledge the 80 00:03:02,879 --> 00:03:07,500 person that crystallized the term data 81 00:03:04,860 --> 00:03:09,720 contract within this current context uh 82 00:03:07,500 --> 00:03:11,280 published a book at the end of June uh 83 00:03:09,720 --> 00:03:12,720 was really helpful for me putting 84 00:03:11,280 --> 00:03:14,879 together my thoughts I just wanted to 85 00:03:12,720 --> 00:03:16,739 say thank you 86 00:03:14,879 --> 00:03:19,080 and before I get too far I just love 87 00:03:16,739 --> 00:03:21,000 this quote it may be from a song that 88 00:03:19,080 --> 00:03:22,200 you recognize narrated by baz luhrman 89 00:03:21,000 --> 00:03:24,060 it's actually from an article on the 90 00:03:22,200 --> 00:03:26,220 Chicago Tribune and the gist of it is 91 00:03:24,060 --> 00:03:28,920 take this advice from when it comes that 92 00:03:26,220 --> 00:03:30,420 comes from my context and uh if I could 93 00:03:28,920 --> 00:03:34,319 tell you anything about the future where 94 00:03:30,420 --> 00:03:36,420 sunscreen so what are data contracts 95 00:03:34,319 --> 00:03:37,920 over here I have two quotes one of them 96 00:03:36,420 --> 00:03:39,599 is from Andrew Jones himself the other 97 00:03:37,920 --> 00:03:41,640 is from atlan now there's a lot of words 98 00:03:39,599 --> 00:03:42,959 on screen let's not try and read all of 99 00:03:41,640 --> 00:03:45,180 them at once but what I want to draw 100 00:03:42,959 --> 00:03:48,120 your attention to is the mention of 101 00:03:45,180 --> 00:03:50,400 people that you have a generator 102 00:03:48,120 --> 00:03:52,080 consumer parties if you think about a 103 00:03:50,400 --> 00:03:54,000 contract in the real world it involves 104 00:03:52,080 --> 00:03:55,319 people and that's a core fundamental 105 00:03:54,000 --> 00:03:58,379 thing that we need to keep in our minds 106 00:03:55,319 --> 00:04:00,420 here the other is about an agreement 107 00:03:58,379 --> 00:04:02,220 so that's the blue text there and then 108 00:04:00,420 --> 00:04:04,680 there's a mention there about having a 109 00:04:02,220 --> 00:04:06,840 structure you know or or an interface by 110 00:04:04,680 --> 00:04:08,640 which you access this and then finally 111 00:04:06,840 --> 00:04:10,739 the the Big Kahuna at the end there at 112 00:04:08,640 --> 00:04:12,659 least for my situation right now there's 113 00:04:10,739 --> 00:04:14,760 a bit there about data quality or having 114 00:04:12,659 --> 00:04:16,500 certainty about the what you're going to 115 00:04:14,760 --> 00:04:19,260 get out of this 116 00:04:16,500 --> 00:04:21,900 so let's take a scenario we have team a 117 00:04:19,260 --> 00:04:24,180 team B Team C and let's just say they're 118 00:04:21,900 --> 00:04:27,000 all working in the same organization 119 00:04:24,180 --> 00:04:29,220 data is Flowing everyone's happy but 120 00:04:27,000 --> 00:04:32,100 what I haven't told you is that team C 121 00:04:29,220 --> 00:04:34,680 have set up a non-consensual API 122 00:04:32,100 --> 00:04:36,840 what I mean by that well team C have 123 00:04:34,680 --> 00:04:39,120 found a read replica of one of the 124 00:04:36,840 --> 00:04:41,100 databases controlled by team A and are 125 00:04:39,120 --> 00:04:43,080 pulling out data and using it for their 126 00:04:41,100 --> 00:04:47,340 own products Downstream 127 00:04:43,080 --> 00:04:49,560 but then as always change happens 128 00:04:47,340 --> 00:04:50,820 team a their database goes through some 129 00:04:49,560 --> 00:04:52,560 changes 130 00:04:50,820 --> 00:04:54,479 teammate had an understanding with Team 131 00:04:52,560 --> 00:04:56,460 B so Team B knew this was coming and 132 00:04:54,479 --> 00:04:57,780 they made the necessary steps to 133 00:04:56,460 --> 00:04:59,340 mitigate this change and so their 134 00:04:57,780 --> 00:05:01,560 product is 135 00:04:59,340 --> 00:05:02,880 carrying on still working whereas team C 136 00:05:01,560 --> 00:05:05,400 because they didn't have an agreement 137 00:05:02,880 --> 00:05:07,199 didn't have an understanding this change 138 00:05:05,400 --> 00:05:09,000 blindsided them and so their products 139 00:05:07,199 --> 00:05:10,440 are now impacted and potentially not 140 00:05:09,000 --> 00:05:13,199 working 141 00:05:10,440 --> 00:05:14,639 so if you've been working in that pink 142 00:05:13,199 --> 00:05:16,500 stack in the middle as I was saying in 143 00:05:14,639 --> 00:05:17,880 the beginning you know about this you've 144 00:05:16,500 --> 00:05:19,500 experienced this and maybe you've even 145 00:05:17,880 --> 00:05:22,199 caused this 146 00:05:19,500 --> 00:05:24,300 so what makes up a data contract let's 147 00:05:22,199 --> 00:05:26,220 let's get to it well a data contract 148 00:05:24,300 --> 00:05:28,440 could be as simple as this 149 00:05:26,220 --> 00:05:30,060 this is a yaml document that's 150 00:05:28,440 --> 00:05:31,620 describing an event 151 00:05:30,060 --> 00:05:33,840 it has some information about the 152 00:05:31,620 --> 00:05:36,479 contract itself you know the version who 153 00:05:33,840 --> 00:05:38,699 authored the contract it also has uh 154 00:05:36,479 --> 00:05:41,039 service level objectives so we're making 155 00:05:38,699 --> 00:05:43,500 a promise an expectation about 156 00:05:41,039 --> 00:05:45,419 how complete this data set will be how 157 00:05:43,500 --> 00:05:47,460 fresh it will be and then below you'll 158 00:05:45,419 --> 00:05:50,340 have a there's a collection of things 159 00:05:47,460 --> 00:05:52,560 that you might recognize as being schema 160 00:05:50,340 --> 00:05:54,840 but what I want to really draw your 161 00:05:52,560 --> 00:05:56,699 attention to is that a data contract is 162 00:05:54,840 --> 00:05:59,340 not just schema 163 00:05:56,699 --> 00:06:01,500 as I mentioned it's people if there was 164 00:05:59,340 --> 00:06:03,060 anything that I had to prioritize and 165 00:06:01,500 --> 00:06:06,060 don't take the ordering as a 166 00:06:03,060 --> 00:06:08,460 prioritization however I will say this 167 00:06:06,060 --> 00:06:10,860 if you don't recognize and understand 168 00:06:08,460 --> 00:06:12,960 the people as in the people who are 169 00:06:10,860 --> 00:06:15,060 creating the data and the people who are 170 00:06:12,960 --> 00:06:16,380 looking to consume the data and even the 171 00:06:15,060 --> 00:06:17,759 people who are responsible for 172 00:06:16,380 --> 00:06:19,740 maintaining the underlying 173 00:06:17,759 --> 00:06:21,960 infrastructure if that is not captured 174 00:06:19,740 --> 00:06:24,600 and part of your data contract 175 00:06:21,960 --> 00:06:25,979 you probably don't have a data contract 176 00:06:24,600 --> 00:06:27,660 other things that you might want to have 177 00:06:25,979 --> 00:06:30,180 in your data contract are a schema we 178 00:06:27,660 --> 00:06:31,500 love schema schema is great uh contract 179 00:06:30,180 --> 00:06:33,539 governance so this could be things about 180 00:06:31,500 --> 00:06:35,460 say the version of the contract or the 181 00:06:33,539 --> 00:06:37,620 publishing state of the contract then 182 00:06:35,460 --> 00:06:39,300 there's semantics semantics is different 183 00:06:37,620 --> 00:06:41,699 from schema and we'll talk about that in 184 00:06:39,300 --> 00:06:44,400 a second but the tldr is schema for 185 00:06:41,699 --> 00:06:46,259 machines semantics for people Fair bit 186 00:06:44,400 --> 00:06:47,819 of overlap then there is a bit there 187 00:06:46,259 --> 00:06:50,100 about service level objectives because 188 00:06:47,819 --> 00:06:52,380 again if you are providing a contract 189 00:06:50,100 --> 00:06:54,419 another thing that is inherent contract 190 00:06:52,380 --> 00:06:55,319 is an expectation of being able to get a 191 00:06:54,419 --> 00:06:57,240 benefit 192 00:06:55,319 --> 00:06:59,039 then there's the governance of the data 193 00:06:57,240 --> 00:07:00,960 set so this might speak to how 194 00:06:59,039 --> 00:07:02,780 privileged is this information is it 195 00:07:00,960 --> 00:07:05,220 classified is it internal is it public 196 00:07:02,780 --> 00:07:07,020 and then finally maybe something about 197 00:07:05,220 --> 00:07:10,319 the mechanisms of transmission so this 198 00:07:07,020 --> 00:07:13,199 might cover how the data was captured 199 00:07:10,319 --> 00:07:14,699 um whether it was from a human entering 200 00:07:13,199 --> 00:07:16,800 data into an interface or was it a 201 00:07:14,699 --> 00:07:18,539 sensor and then your expectations about 202 00:07:16,800 --> 00:07:20,340 how this data can and should be 203 00:07:18,539 --> 00:07:22,080 transmitted 204 00:07:20,340 --> 00:07:23,759 let's take a minute to take a slight 205 00:07:22,080 --> 00:07:24,960 detour about schema and semantics 206 00:07:23,759 --> 00:07:27,960 because this thing is near and dear to 207 00:07:24,960 --> 00:07:30,000 my heart as I said schema exists in my 208 00:07:27,960 --> 00:07:32,160 mind for the benefit of systems 209 00:07:30,000 --> 00:07:33,840 so you have a data type in your code 210 00:07:32,160 --> 00:07:36,300 it's an integer 211 00:07:33,840 --> 00:07:37,919 it's an integer because your system 212 00:07:36,300 --> 00:07:39,539 needs to make certain assumptions about 213 00:07:37,919 --> 00:07:42,060 it so that if you try and do an 214 00:07:39,539 --> 00:07:44,280 arithmetic operation on that variable it 215 00:07:42,060 --> 00:07:47,699 knows treat this like a number 216 00:07:44,280 --> 00:07:50,099 whereas semantics is for humans and 217 00:07:47,699 --> 00:07:52,380 human expectation 218 00:07:50,099 --> 00:07:53,940 let's take an email address from a 219 00:07:52,380 --> 00:07:56,340 schema perspective what is an email 220 00:07:53,940 --> 00:07:57,780 address it is a series of characters one 221 00:07:56,340 --> 00:07:59,699 after the other you might call it a 222 00:07:57,780 --> 00:08:02,099 string or if you're speaking in database 223 00:07:59,699 --> 00:08:04,199 parlance maybe it's a varchar right 224 00:08:02,099 --> 00:08:06,599 but for a person 225 00:08:04,199 --> 00:08:08,340 we don't see it that way we see it as 226 00:08:06,599 --> 00:08:11,220 this is a piece of information that has 227 00:08:08,340 --> 00:08:13,319 a very specific format that I can use to 228 00:08:11,220 --> 00:08:15,060 communicate to somebody and I have 229 00:08:13,319 --> 00:08:17,699 expectations that I can use this 230 00:08:15,060 --> 00:08:19,740 information to have a conversation right 231 00:08:17,699 --> 00:08:21,720 and while we're on the topic of email 232 00:08:19,740 --> 00:08:24,300 addresses there's a fantastic GitHub 233 00:08:21,720 --> 00:08:25,919 repo called awesome falsehoods if you 234 00:08:24,300 --> 00:08:27,360 want to have a crisis of confidence 235 00:08:25,919 --> 00:08:30,419 about what you think you understand 236 00:08:27,360 --> 00:08:32,159 about email addresses about names about 237 00:08:30,419 --> 00:08:34,020 any sort of what you might think is a 238 00:08:32,159 --> 00:08:35,940 standard data type go check out that 239 00:08:34,020 --> 00:08:39,599 repo and have your confidence utterly 240 00:08:35,940 --> 00:08:41,520 destroyed but that repo is really 241 00:08:39,599 --> 00:08:43,680 speaking to sort of the semantics you 242 00:08:41,520 --> 00:08:45,540 know there's as I said schema is about 243 00:08:43,680 --> 00:08:47,339 being able to store and retrieve 244 00:08:45,540 --> 00:08:49,519 information and do operations on the 245 00:08:47,339 --> 00:08:52,560 information by systems with confidence 246 00:08:49,519 --> 00:08:54,420 whereas semantics is ensuring that we as 247 00:08:52,560 --> 00:08:55,860 people make the right interpretations 248 00:08:54,420 --> 00:08:57,839 and assumptions about the data that 249 00:08:55,860 --> 00:08:59,700 we're looking at 250 00:08:57,839 --> 00:09:01,980 so what's the minimum amount of stuff 251 00:08:59,700 --> 00:09:03,720 that you need to be you know thinking 252 00:09:01,980 --> 00:09:05,880 about if you're looking to implement 253 00:09:03,720 --> 00:09:07,440 data contracts well first of all don't 254 00:09:05,880 --> 00:09:09,000 set out to implement data contracts set 255 00:09:07,440 --> 00:09:10,140 out to solve problems but let's say 256 00:09:09,000 --> 00:09:12,120 you're looking to solve problems with 257 00:09:10,140 --> 00:09:13,800 data contracts well again you're going 258 00:09:12,120 --> 00:09:15,660 to need someone 259 00:09:13,800 --> 00:09:17,940 generating the data and someone who 260 00:09:15,660 --> 00:09:19,860 wants to consume it that's again people 261 00:09:17,940 --> 00:09:20,940 first and foremost 262 00:09:19,860 --> 00:09:22,980 then 263 00:09:20,940 --> 00:09:25,200 you're going to need to define the data 264 00:09:22,980 --> 00:09:27,600 contract now this is a this is from 265 00:09:25,200 --> 00:09:29,820 Andrew's book and it my mind is fairly 266 00:09:27,600 --> 00:09:31,019 optimistic as we'll see as we get 267 00:09:29,820 --> 00:09:33,120 towards the end of this talk where I 268 00:09:31,019 --> 00:09:34,560 bring it to my context there is this 269 00:09:33,120 --> 00:09:37,019 optimistic hope that your data 270 00:09:34,560 --> 00:09:38,640 generators will Define the contract and 271 00:09:37,019 --> 00:09:40,620 then you know from that contract you can 272 00:09:38,640 --> 00:09:43,380 do things like provisional interface 273 00:09:40,620 --> 00:09:45,540 which represents a service that has its 274 00:09:43,380 --> 00:09:48,500 own database that the consumers can then 275 00:09:45,540 --> 00:09:51,060 interact with via that interface 276 00:09:48,500 --> 00:09:52,740 I've put an extra little database 277 00:09:51,060 --> 00:09:54,899 diagram up there because the guidance 278 00:09:52,740 --> 00:09:56,279 that I've seen in the literature is you 279 00:09:54,899 --> 00:09:58,500 don't want to write a data contract 280 00:09:56,279 --> 00:10:01,160 against say your operational database 281 00:09:58,500 --> 00:10:03,420 you want to have sort of a layer of 282 00:10:01,160 --> 00:10:05,399 abstraction like an interface over the 283 00:10:03,420 --> 00:10:08,339 top so maybe it's that you publish 284 00:10:05,399 --> 00:10:11,279 events of interest to a topic queue or 285 00:10:08,339 --> 00:10:13,500 you provide a database right which is 286 00:10:11,279 --> 00:10:15,480 it gets its definition from the contract 287 00:10:13,500 --> 00:10:17,580 itself or maybe it's just a view that 288 00:10:15,480 --> 00:10:19,080 you offer up to your consumers but you 289 00:10:17,580 --> 00:10:21,959 don't want to write it directly against 290 00:10:19,080 --> 00:10:23,279 your concrete implementation because if 291 00:10:21,959 --> 00:10:25,440 you're needing to then change the 292 00:10:23,279 --> 00:10:26,940 internals you then have to reroute the 293 00:10:25,440 --> 00:10:29,580 contract right so you want that level of 294 00:10:26,940 --> 00:10:31,800 abstraction so that is a key Point not 295 00:10:29,580 --> 00:10:33,480 to be overlooked 296 00:10:31,800 --> 00:10:35,660 so you've got this data contract what 297 00:10:33,480 --> 00:10:37,980 can you do with it a couple of things 298 00:10:35,660 --> 00:10:40,500 from the data contract you could 299 00:10:37,980 --> 00:10:42,480 generate some Json that you can then 300 00:10:40,500 --> 00:10:43,740 feed into say cloud formation or 301 00:10:42,480 --> 00:10:47,040 terraform or whatever your 302 00:10:43,740 --> 00:10:49,860 infrastructure is tooling uh of choices 303 00:10:47,040 --> 00:10:51,600 and then provision infrastructure so you 304 00:10:49,860 --> 00:10:54,360 could take a data contract and provision 305 00:10:51,600 --> 00:10:56,940 a big query table that has a schema as 306 00:10:54,360 --> 00:10:58,860 defined in the data contract right 307 00:10:56,940 --> 00:11:00,899 you could take a data contract and 308 00:10:58,860 --> 00:11:02,279 generate just on schema so if you're not 309 00:11:00,899 --> 00:11:04,140 familiar just on schema is an 310 00:11:02,279 --> 00:11:06,300 opinionated way for describing entities 311 00:11:04,140 --> 00:11:09,120 and their attributes and from just on 312 00:11:06,300 --> 00:11:10,800 schema you could then generate apis so 313 00:11:09,120 --> 00:11:13,500 you're given your consumers not only 314 00:11:10,800 --> 00:11:14,940 access to your data but also the 315 00:11:13,500 --> 00:11:17,279 mechanism they don't have to write new 316 00:11:14,940 --> 00:11:19,320 code here's a library have one for 317 00:11:17,279 --> 00:11:20,579 python have one for JavaScript off you 318 00:11:19,320 --> 00:11:22,200 go 319 00:11:20,579 --> 00:11:24,120 another thing you can do with data 320 00:11:22,200 --> 00:11:26,399 contracts and this is you know what 321 00:11:24,120 --> 00:11:29,459 brought me to the store is being able to 322 00:11:26,399 --> 00:11:31,620 Define data quality checks all right 323 00:11:29,459 --> 00:11:33,000 there's three places broadly speaking 324 00:11:31,620 --> 00:11:34,740 where you can Implement quality checks 325 00:11:33,000 --> 00:11:37,260 the first is that publishing time so if 326 00:11:34,740 --> 00:11:39,899 you have influence over the system that 327 00:11:37,260 --> 00:11:42,899 is capturing the information before that 328 00:11:39,899 --> 00:11:44,820 data is even written persisted you can 329 00:11:42,899 --> 00:11:46,800 do data quality checks to make sure that 330 00:11:44,820 --> 00:11:49,740 it's conforming right or at the 331 00:11:46,800 --> 00:11:51,540 infrastructure itself so say you pass it 332 00:11:49,740 --> 00:11:54,480 on to a message queue you can then 333 00:11:51,540 --> 00:11:56,880 identify does this event conform with my 334 00:11:54,480 --> 00:11:58,980 expectations no put it on a dead data 335 00:11:56,880 --> 00:12:01,140 queue or manage it in some way and then 336 00:11:58,980 --> 00:12:03,120 finally there's after publishing now if 337 00:12:01,140 --> 00:12:04,620 you're living in a batch world this is 338 00:12:03,120 --> 00:12:05,700 probably going to be realistically where 339 00:12:04,620 --> 00:12:06,480 you're going to be doing a lot of your 340 00:12:05,700 --> 00:12:08,640 work 341 00:12:06,480 --> 00:12:10,140 the one downside to doing it off the 342 00:12:08,640 --> 00:12:12,420 publishing is that if you have errant 343 00:12:10,140 --> 00:12:13,620 data it's out there all right so then 344 00:12:12,420 --> 00:12:15,959 you're going to have to think about okay 345 00:12:13,620 --> 00:12:17,399 what is our case management Incident 346 00:12:15,959 --> 00:12:19,440 Management around data that's 347 00:12:17,399 --> 00:12:22,260 potentially leaked and you know now 348 00:12:19,440 --> 00:12:23,579 there's Gremlins all over the kitchen 349 00:12:22,260 --> 00:12:26,279 so I'm going to bring it back to my 350 00:12:23,579 --> 00:12:28,260 context as I said in the beginning I 351 00:12:26,279 --> 00:12:29,279 have one or two big monoliths in my 352 00:12:28,260 --> 00:12:32,760 world and one of them is the enterprise 353 00:12:29,279 --> 00:12:34,800 resource planning system so for those of 354 00:12:32,760 --> 00:12:36,600 you who are not familiar with what an 355 00:12:34,800 --> 00:12:38,880 Erp represents just think of it as 356 00:12:36,600 --> 00:12:41,399 perhaps all the microservices that you 357 00:12:38,880 --> 00:12:43,040 might have but except in one box and one 358 00:12:41,399 --> 00:12:45,360 interface right 359 00:12:43,040 --> 00:12:48,660 so people from various functions you 360 00:12:45,360 --> 00:12:51,779 know from procurement from Finance from 361 00:12:48,660 --> 00:12:53,160 design they work in the Erp in various 362 00:12:51,779 --> 00:12:55,200 places and that information gets written 363 00:12:53,160 --> 00:12:56,820 out to the erp's database and that 364 00:12:55,200 --> 00:12:58,680 eventually Finds Its way into my world 365 00:12:56,820 --> 00:13:00,899 and the data Lake pretty standard bronze 366 00:12:58,680 --> 00:13:03,060 silver gold situation going on there and 367 00:13:00,899 --> 00:13:05,519 from that we as the data team pick it up 368 00:13:03,060 --> 00:13:07,800 clean it up conform it to a model that 369 00:13:05,519 --> 00:13:10,380 we then present in a variety of ways be 370 00:13:07,800 --> 00:13:13,440 it reporting be it you know sort of 371 00:13:10,380 --> 00:13:15,660 analysis and Discovery and 372 00:13:13,440 --> 00:13:17,760 that's the value chain of data in my 373 00:13:15,660 --> 00:13:19,920 organization but when I look at this 374 00:13:17,760 --> 00:13:21,480 again through the context of 375 00:13:19,920 --> 00:13:24,300 producers 376 00:13:21,480 --> 00:13:25,980 what I see are producer boundaries so 377 00:13:24,300 --> 00:13:28,079 there's a producer boundary all the way 378 00:13:25,980 --> 00:13:30,600 on the left where people are entering 379 00:13:28,079 --> 00:13:32,160 information into the Erp zy 380 00:13:30,600 --> 00:13:35,940 there's another producer boundary 381 00:13:32,160 --> 00:13:37,440 between the Erp database and where it 382 00:13:35,940 --> 00:13:38,959 gets picked up and adjusted into the 383 00:13:37,440 --> 00:13:42,120 data Lake and finally there's another 384 00:13:38,959 --> 00:13:43,920 producer boundary between where where's 385 00:13:42,120 --> 00:13:46,560 the data team have now generated these 386 00:13:43,920 --> 00:13:47,820 opinionated gold standard models and 387 00:13:46,560 --> 00:13:49,380 surfacing up to the rest of the 388 00:13:47,820 --> 00:13:51,660 organization 389 00:13:49,380 --> 00:13:53,880 and the people on the right may also be 390 00:13:51,660 --> 00:13:54,839 some of the same people on the left so 391 00:13:53,880 --> 00:13:57,120 again 392 00:13:54,839 --> 00:14:00,000 in the beginning we had that nice linear 393 00:13:57,120 --> 00:14:01,860 diagram of producers and consumers 394 00:14:00,000 --> 00:14:04,139 they may actually be the same people in 395 00:14:01,860 --> 00:14:07,440 some situations 396 00:14:04,139 --> 00:14:10,019 so at these producer boundaries when I 397 00:14:07,440 --> 00:14:12,600 think about it these are inputs to a 398 00:14:10,019 --> 00:14:15,420 data contract so between the people 399 00:14:12,600 --> 00:14:17,880 working with and interacting with the 400 00:14:15,420 --> 00:14:20,220 Erp that's where we can get an 401 00:14:17,880 --> 00:14:21,839 understanding of the semantics 402 00:14:20,220 --> 00:14:24,060 one of the 403 00:14:21,839 --> 00:14:25,620 depending on your point of view perks of 404 00:14:24,060 --> 00:14:27,540 a lot of these Erp systems is that you 405 00:14:25,620 --> 00:14:29,880 can configure them in any way that you 406 00:14:27,540 --> 00:14:31,980 choose to often you're advised not to 407 00:14:29,880 --> 00:14:33,660 but no one listens to that they go ahead 408 00:14:31,980 --> 00:14:35,820 and configure it anyways 409 00:14:33,660 --> 00:14:37,560 or conversely 410 00:14:35,820 --> 00:14:39,540 they don't configure it and they just 411 00:14:37,560 --> 00:14:41,699 enforce Rules by convention 412 00:14:39,540 --> 00:14:43,260 so there is a understanding within the 413 00:14:41,699 --> 00:14:46,199 people operating the systems that okay 414 00:14:43,260 --> 00:14:48,180 this field over here don't populate it 415 00:14:46,199 --> 00:14:50,519 if these other things are true you know 416 00:14:48,180 --> 00:14:52,680 so if it's if it's a purchase order say 417 00:14:50,519 --> 00:14:54,480 for apparel don't fill in that field 418 00:14:52,680 --> 00:14:56,699 over there but if it's a purchase order 419 00:14:54,480 --> 00:14:58,560 say for materials fill in that field 420 00:14:56,699 --> 00:15:01,199 over there but that isn't enforced by 421 00:14:58,560 --> 00:15:03,779 the system right so that is where again 422 00:15:01,199 --> 00:15:05,459 the semantics of the data in the world 423 00:15:03,779 --> 00:15:07,079 that we're living in that's where I can 424 00:15:05,459 --> 00:15:08,459 get that information from having that 425 00:15:07,079 --> 00:15:11,940 conversation there 426 00:15:08,459 --> 00:15:13,920 similarly between the Erp database you 427 00:15:11,940 --> 00:15:16,440 know it materializing its information 428 00:15:13,920 --> 00:15:18,060 and its schema and its point of view I 429 00:15:16,440 --> 00:15:19,740 can then start to pick up on the schema 430 00:15:18,060 --> 00:15:22,440 and perhaps the service level objectives 431 00:15:19,740 --> 00:15:24,420 that I can expect profiling the data um 432 00:15:22,440 --> 00:15:26,459 I see it looks like there's a purchase 433 00:15:24,420 --> 00:15:27,779 order raised every day so maybe that's a 434 00:15:26,459 --> 00:15:29,579 test that I could write you know expect 435 00:15:27,779 --> 00:15:32,279 that we will see at least one record 436 00:15:29,579 --> 00:15:34,199 every day appearing in the table that 437 00:15:32,279 --> 00:15:35,519 represents purchase orders and then 438 00:15:34,199 --> 00:15:37,920 finally 439 00:15:35,519 --> 00:15:39,899 um I will take those two things in my 440 00:15:37,920 --> 00:15:41,459 context and then generate the checks and 441 00:15:39,899 --> 00:15:43,320 tests which I'm applying at that last 442 00:15:41,459 --> 00:15:45,000 boundary over there 443 00:15:43,320 --> 00:15:47,040 the reality that I have defined since 444 00:15:45,000 --> 00:15:49,500 starting this journey it's been about a 445 00:15:47,040 --> 00:15:51,240 month and a half now is 446 00:15:49,500 --> 00:15:54,000 the people at the very end also have 447 00:15:51,240 --> 00:15:57,480 their interpretation of the semantics so 448 00:15:54,000 --> 00:16:00,440 again that initial diagram that had this 449 00:15:57,480 --> 00:16:03,120 ideal linear you know left to right flow 450 00:16:00,440 --> 00:16:04,980 in reality it may be a little different 451 00:16:03,120 --> 00:16:07,079 so consider the people at the beginning 452 00:16:04,980 --> 00:16:08,820 and at the end because they may have 453 00:16:07,079 --> 00:16:11,760 slightly different but still significant 454 00:16:08,820 --> 00:16:13,440 interpretations of what the data means 455 00:16:11,760 --> 00:16:15,779 so how are we going to make this all 456 00:16:13,440 --> 00:16:18,899 happen well we're going to use awesome 457 00:16:15,779 --> 00:16:20,579 people that understand modeling that 458 00:16:18,899 --> 00:16:22,920 understand abstractions and understand 459 00:16:20,579 --> 00:16:25,560 constraints and I think I'm looking at 460 00:16:22,920 --> 00:16:27,300 them we could even do it in code 461 00:16:25,560 --> 00:16:29,399 and you should definitely Version 462 00:16:27,300 --> 00:16:31,680 Control it 463 00:16:29,399 --> 00:16:33,660 but you might be thinking hold up a 464 00:16:31,680 --> 00:16:35,339 minute a couple of slides ago you showed 465 00:16:33,660 --> 00:16:37,079 me a yaml document and said this is a 466 00:16:35,339 --> 00:16:39,839 data contract why are we talking about 467 00:16:37,079 --> 00:16:42,600 code now a couple of reasons 468 00:16:39,839 --> 00:16:44,040 um predominantly 469 00:16:42,600 --> 00:16:46,320 documents 470 00:16:44,040 --> 00:16:48,540 you know be it a word docket Excel 471 00:16:46,320 --> 00:16:51,300 document even to a lesser extent a 472 00:16:48,540 --> 00:16:53,579 structured document like joson or yaml 473 00:16:51,300 --> 00:16:55,320 they suffer in varying degrees from what 474 00:16:53,579 --> 00:16:57,360 I describe as the entanglement of 475 00:16:55,320 --> 00:16:59,100 meaning and representation so the 476 00:16:57,360 --> 00:17:00,839 simplest example I can give with us is 477 00:16:59,100 --> 00:17:02,880 say you've got a Word document you know 478 00:17:00,839 --> 00:17:04,020 a rich document that people can apply 479 00:17:02,880 --> 00:17:06,000 formatting to 480 00:17:04,020 --> 00:17:07,500 you can start typing a sentence a 481 00:17:06,000 --> 00:17:09,600 statement of fat 482 00:17:07,500 --> 00:17:11,459 later you can come back to that document 483 00:17:09,600 --> 00:17:13,079 and apply formatting that strikes 484 00:17:11,459 --> 00:17:15,600 through that sentence 485 00:17:13,079 --> 00:17:17,579 when reading that document you as a 486 00:17:15,600 --> 00:17:19,500 person interpret that oh that sentence 487 00:17:17,579 --> 00:17:22,319 is no longer valid it no longer applies 488 00:17:19,500 --> 00:17:23,880 however if you parse that document you 489 00:17:22,319 --> 00:17:25,919 may not pick up on that 490 00:17:23,880 --> 00:17:28,380 so again it's the entanglement of 491 00:17:25,919 --> 00:17:30,960 meaning and representation when we write 492 00:17:28,380 --> 00:17:32,820 code we say exactly what we mean because 493 00:17:30,960 --> 00:17:34,500 we have to 494 00:17:32,820 --> 00:17:36,120 another reason why you may want to 495 00:17:34,500 --> 00:17:38,340 consider doing this as code rather than 496 00:17:36,120 --> 00:17:40,620 a whole bunch of documents is finding 497 00:17:38,340 --> 00:17:42,840 references so when you have your code 498 00:17:40,620 --> 00:17:45,299 and you want to find where a particular 499 00:17:42,840 --> 00:17:46,380 variable has been used it's pretty 500 00:17:45,299 --> 00:17:48,120 straightforward these days we've got 501 00:17:46,380 --> 00:17:50,460 wonderful tools just right click find 502 00:17:48,120 --> 00:17:52,799 references there's all the places but if 503 00:17:50,460 --> 00:17:55,080 you're doing this with plain text you've 504 00:17:52,799 --> 00:17:57,720 got to do text matches 505 00:17:55,080 --> 00:17:58,860 and that just opens up a whole can of 506 00:17:57,720 --> 00:18:00,000 worms that you now have to figure 507 00:17:58,860 --> 00:18:01,320 through 508 00:18:00,000 --> 00:18:04,440 finally there's some other good stuff 509 00:18:01,320 --> 00:18:07,140 like we can test code and ultimately if 510 00:18:04,440 --> 00:18:09,720 you need to represent that knowledge as 511 00:18:07,140 --> 00:18:11,820 say a Json or yaml document you can do 512 00:18:09,720 --> 00:18:13,980 that from code right 513 00:18:11,820 --> 00:18:15,960 and again speaking about refactoring 514 00:18:13,980 --> 00:18:18,720 text you might think Ryan but I am a 515 00:18:15,960 --> 00:18:21,419 wizard of regular Expressions I got this 516 00:18:18,720 --> 00:18:23,220 I believe the same until I have to do it 517 00:18:21,419 --> 00:18:24,960 and then I find myself doing a whole 518 00:18:23,220 --> 00:18:27,179 bunch of Googling or playing around with 519 00:18:24,960 --> 00:18:29,400 a regex tester right not knocking 520 00:18:27,179 --> 00:18:30,539 regular Expressions they're fantastic I 521 00:18:29,400 --> 00:18:31,919 just think that my little brain can't 522 00:18:30,539 --> 00:18:33,600 keep you know track of them for long 523 00:18:31,919 --> 00:18:36,000 enough to you know remember what I need 524 00:18:33,600 --> 00:18:37,679 to do the next time I need them 525 00:18:36,000 --> 00:18:39,900 so coming back to my context what was 526 00:18:37,679 --> 00:18:41,520 considered as I said the first thing is 527 00:18:39,900 --> 00:18:44,100 I didn't set out to implement data 528 00:18:41,520 --> 00:18:47,280 contracts I had a realization that there 529 00:18:44,100 --> 00:18:49,080 was this misalignment around the 530 00:18:47,280 --> 00:18:50,520 expectations and the understanding of 531 00:18:49,080 --> 00:18:53,460 the information that was being entered 532 00:18:50,520 --> 00:18:54,960 in at the beginning of my the journey 533 00:18:53,460 --> 00:18:57,059 that I this last that I showed you there 534 00:18:54,960 --> 00:18:58,200 and at the end there was a lack of 535 00:18:57,059 --> 00:19:01,140 knowledge there was gaps in our 536 00:18:58,200 --> 00:19:02,520 understanding so first identify a 537 00:19:01,140 --> 00:19:04,980 problem that you believe can be 538 00:19:02,520 --> 00:19:07,500 addressed by establishing and 539 00:19:04,980 --> 00:19:10,679 documenting your consensus 540 00:19:07,500 --> 00:19:12,059 next step is of that big problem find a 541 00:19:10,679 --> 00:19:13,559 small scope let's let's do something 542 00:19:12,059 --> 00:19:14,940 small first rather than trying to boil 543 00:19:13,559 --> 00:19:16,980 the ocean and make sure you've got 544 00:19:14,940 --> 00:19:18,240 allies people that are willing to work 545 00:19:16,980 --> 00:19:20,700 with you on this 546 00:19:18,240 --> 00:19:22,500 secondly identify what your constraints 547 00:19:20,700 --> 00:19:25,080 are every organization is different 548 00:19:22,500 --> 00:19:27,780 every situation is different you may 549 00:19:25,080 --> 00:19:29,460 have more Engineers available to you you 550 00:19:27,780 --> 00:19:32,640 may have few you may have great 551 00:19:29,460 --> 00:19:34,919 engagement through business you may not 552 00:19:32,640 --> 00:19:37,559 the next thing I would say is keep a 553 00:19:34,919 --> 00:19:40,200 people and process Centric right if you 554 00:19:37,559 --> 00:19:42,480 are looking to address a 555 00:19:40,200 --> 00:19:43,500 part of your world that doesn't involve 556 00:19:42,480 --> 00:19:45,539 people 557 00:19:43,500 --> 00:19:47,160 I would say that a data contract isn't 558 00:19:45,539 --> 00:19:49,320 necessarily the way to go about it if 559 00:19:47,160 --> 00:19:52,500 you're looking at interchange between a 560 00:19:49,320 --> 00:19:53,760 system a system B no people involved you 561 00:19:52,500 --> 00:19:56,400 could probably get done what you need to 562 00:19:53,760 --> 00:19:57,780 get done with an API yeah just keep it 563 00:19:56,400 --> 00:19:59,700 simple 564 00:19:57,780 --> 00:20:01,440 next because I was looking I had this 565 00:19:59,700 --> 00:20:03,299 hairbled idea of doing this all as code 566 00:20:01,440 --> 00:20:05,460 I had to think well I'm going to need 567 00:20:03,299 --> 00:20:07,380 some sort of way of modeling all of this 568 00:20:05,460 --> 00:20:08,280 right so I describe this as my meta 569 00:20:07,380 --> 00:20:10,440 schema 570 00:20:08,280 --> 00:20:11,760 and then finally and this is where I'm 571 00:20:10,440 --> 00:20:14,640 going to contradict myself in a couple 572 00:20:11,760 --> 00:20:17,220 of slides I was looking for ways to 573 00:20:14,640 --> 00:20:19,140 maximize uh opportunities for people to 574 00:20:17,220 --> 00:20:20,400 contribute 575 00:20:19,140 --> 00:20:22,320 let's talk about those guiding 576 00:20:20,400 --> 00:20:24,900 principles the primary objective for me 577 00:20:22,320 --> 00:20:26,700 was establishing consensus and the first 578 00:20:24,900 --> 00:20:28,799 outcome that I was looking for was Data 579 00:20:26,700 --> 00:20:30,840 tests as I mentioned you can use data 580 00:20:28,799 --> 00:20:33,059 contracts to create infrastructure 581 00:20:30,840 --> 00:20:34,679 create other tooling but for me it was 582 00:20:33,059 --> 00:20:37,020 really about I want to have confidence 583 00:20:34,679 --> 00:20:39,000 and be able to confidently say to other 584 00:20:37,020 --> 00:20:40,380 people that you can expect the data to 585 00:20:39,000 --> 00:20:42,299 have these characteristics and these 586 00:20:40,380 --> 00:20:45,059 guarantees 587 00:20:42,299 --> 00:20:46,380 then around that meta model when I was 588 00:20:45,059 --> 00:20:48,919 thinking about it what I wanted to 589 00:20:46,380 --> 00:20:51,840 capture was the meaning and 590 00:20:48,919 --> 00:20:54,000 understanding of our data from the UI 591 00:20:51,840 --> 00:20:56,100 all the way down to the database right 592 00:20:54,000 --> 00:20:58,320 it was really about and just having that 593 00:20:56,100 --> 00:21:00,840 end-to-end lineage going on 594 00:20:58,320 --> 00:21:02,700 and understanding and recognizing that 595 00:21:00,840 --> 00:21:05,400 there is schema and their semantics 596 00:21:02,700 --> 00:21:07,760 they're similar but not the same 597 00:21:05,400 --> 00:21:11,039 and truthfully I'm still figuring it out 598 00:21:07,760 --> 00:21:12,660 I was sitting in last night you know one 599 00:21:11,039 --> 00:21:14,160 not being able to fall asleep because I 600 00:21:12,660 --> 00:21:16,200 was worried about catching my flight and 601 00:21:14,160 --> 00:21:17,820 then two thinking I don't like the way 602 00:21:16,200 --> 00:21:20,520 I've named certain things and we'll get 603 00:21:17,820 --> 00:21:23,280 to that in a second here we go here it 604 00:21:20,520 --> 00:21:25,380 is so here's a you know a representation 605 00:21:23,280 --> 00:21:27,299 of my meta schema for my data 606 00:21:25,380 --> 00:21:29,100 contractors code so the black stickies 607 00:21:27,299 --> 00:21:30,600 are what I describe as contract 608 00:21:29,100 --> 00:21:32,460 governance so the version number the 609 00:21:30,600 --> 00:21:34,380 publishing status who are the people 610 00:21:32,460 --> 00:21:35,760 involved and any service level 611 00:21:34,380 --> 00:21:37,919 objectives that I'm trying to achieve 612 00:21:35,760 --> 00:21:40,440 for this data contract and then within 613 00:21:37,919 --> 00:21:44,520 the data contract I've gone for keeping 614 00:21:40,440 --> 00:21:47,340 it again centered around events so an 615 00:21:44,520 --> 00:21:48,960 event occurs say Supply created 616 00:21:47,340 --> 00:21:50,940 and all for that event you may have one 617 00:21:48,960 --> 00:21:52,620 or more entities I don't like that word 618 00:21:50,940 --> 00:21:55,799 I'm still thinking of a better word but 619 00:21:52,620 --> 00:21:57,299 bear with me and an entity has one or 620 00:21:55,799 --> 00:21:59,220 more properties and a property and this 621 00:21:57,299 --> 00:22:02,159 is where I was keeping the the reference 622 00:21:59,220 --> 00:22:04,200 the representation of the semantic and 623 00:22:02,159 --> 00:22:07,080 the schema separate a property is a 624 00:22:04,200 --> 00:22:09,480 composite it has an attribute which is 625 00:22:07,080 --> 00:22:11,400 to speak of the semantics 626 00:22:09,480 --> 00:22:12,960 and that it has a source which speaks of 627 00:22:11,400 --> 00:22:15,900 the schema so I'm living in a relational 628 00:22:12,960 --> 00:22:17,880 world so my source is comprised of a 629 00:22:15,900 --> 00:22:19,440 table or column in a database but if 630 00:22:17,880 --> 00:22:22,080 you're working say with a document 631 00:22:19,440 --> 00:22:24,240 database that might be say a key a 632 00:22:22,080 --> 00:22:26,820 record a collection or if it's hiding 633 00:22:24,240 --> 00:22:29,400 away in an exile document it could be a 634 00:22:26,820 --> 00:22:31,860 workbook a sheet a cell right you know 635 00:22:29,400 --> 00:22:33,000 allowing myself dynamism for the future 636 00:22:31,860 --> 00:22:34,440 potentially 637 00:22:33,000 --> 00:22:36,960 the other thing I want to call out there 638 00:22:34,440 --> 00:22:39,780 is explicitly modeling the semantic type 639 00:22:36,960 --> 00:22:42,240 now the reason that I believe that this 640 00:22:39,780 --> 00:22:44,460 is worthwhile and necessary to do is the 641 00:22:42,240 --> 00:22:47,100 implementation of a semantic type let's 642 00:22:44,460 --> 00:22:49,500 say email address May differ wildly from 643 00:22:47,100 --> 00:22:53,700 one system to the other so if I have 644 00:22:49,500 --> 00:22:55,320 knowledge that system a we have a simple 645 00:22:53,700 --> 00:22:57,659 you know attributes of the semantic type 646 00:22:55,320 --> 00:23:00,360 of email and I need to exchange to 647 00:22:57,659 --> 00:23:02,400 system B I can then write some code that 648 00:23:00,360 --> 00:23:03,900 does that translation for me because I 649 00:23:02,400 --> 00:23:06,179 have that hook that they're both 650 00:23:03,900 --> 00:23:09,179 semantically an email address 651 00:23:06,179 --> 00:23:12,360 for example but bear with me 652 00:23:09,179 --> 00:23:14,100 so how are we going about doing this 653 00:23:12,360 --> 00:23:15,720 I'll take a moment and give you a chance 654 00:23:14,100 --> 00:23:16,860 to look at this 655 00:23:15,720 --> 00:23:19,860 because you're probably seeing that 656 00:23:16,860 --> 00:23:21,419 Excel icon going Ryan what the hell 657 00:23:19,860 --> 00:23:23,039 you're just saying you're doing this as 658 00:23:21,419 --> 00:23:26,820 code but there's an Excel spreadsheet 659 00:23:23,039 --> 00:23:28,740 over there are you crazy probably no 660 00:23:26,820 --> 00:23:31,559 again coming back to my constraints and 661 00:23:28,740 --> 00:23:34,200 principles you may recall I said I was 662 00:23:31,559 --> 00:23:36,720 looking for opportunities uh for other 663 00:23:34,200 --> 00:23:38,640 people to contribute so going on this 664 00:23:36,720 --> 00:23:40,200 journey I have a business analyst you 665 00:23:38,640 --> 00:23:42,720 know that has been linked to the team 666 00:23:40,200 --> 00:23:44,220 they're fantastic Randy couldn't sing 667 00:23:42,720 --> 00:23:45,539 high enough Praises about them and when 668 00:23:44,220 --> 00:23:47,760 I was thinking about yeah let's do this 669 00:23:45,539 --> 00:23:49,559 as code I thought they couldn't Learn 670 00:23:47,760 --> 00:23:51,720 Python you know 671 00:23:49,559 --> 00:23:53,580 they're smart they'll get it but again 672 00:23:51,720 --> 00:23:55,440 the constraint was we needed to move 673 00:23:53,580 --> 00:23:57,600 quickly we don't have a lot of time to 674 00:23:55,440 --> 00:24:00,059 do what we need to get done or rather we 675 00:23:57,600 --> 00:24:02,580 need to prove value very quickly and 676 00:24:00,059 --> 00:24:04,020 thus I came to the realization of well 677 00:24:02,580 --> 00:24:06,059 if they're going to be doing a lot of 678 00:24:04,020 --> 00:24:07,260 the legwork talking to people and you 679 00:24:06,059 --> 00:24:10,080 know having these conversations about 680 00:24:07,260 --> 00:24:14,220 what does this mean and where does it go 681 00:24:10,080 --> 00:24:15,960 spreadsheets acceptable right so 682 00:24:14,220 --> 00:24:17,039 we have the spreadsheet that where 683 00:24:15,960 --> 00:24:18,600 they're capturing all the knowledge 684 00:24:17,039 --> 00:24:19,380 about the events and what's involved in 685 00:24:18,600 --> 00:24:21,240 them 686 00:24:19,380 --> 00:24:23,820 that's but she gets committed to a git 687 00:24:21,240 --> 00:24:25,740 repository where then I've got a crcd 688 00:24:23,820 --> 00:24:29,880 pipeline set up where it reads in the 689 00:24:25,740 --> 00:24:32,340 spreadsheet generates my code models 690 00:24:29,880 --> 00:24:34,380 those models are tested and now we have 691 00:24:32,340 --> 00:24:36,539 new contracts expressed as code models 692 00:24:34,380 --> 00:24:38,159 and then soon in the near future from 693 00:24:36,539 --> 00:24:40,080 those code models I can then generate 694 00:24:38,159 --> 00:24:42,240 all sorts of things right now I'm just 695 00:24:40,080 --> 00:24:45,480 generating very simple SQL based tests 696 00:24:42,240 --> 00:24:48,780 so if an attribute is marked as it must 697 00:24:45,480 --> 00:24:50,640 always be present or it must be unique I 698 00:24:48,780 --> 00:24:52,380 can generate tests that automatically 699 00:24:50,640 --> 00:24:54,240 test that for me 700 00:24:52,380 --> 00:24:56,940 and you could do it you know using Great 701 00:24:54,240 --> 00:24:58,740 Expectations or soda or any other the 702 00:24:56,940 --> 00:25:01,620 testing libraries right now I've just 703 00:24:58,740 --> 00:25:03,240 implemented as basic SQL because I have 704 00:25:01,620 --> 00:25:05,760 yet to make that decision about where to 705 00:25:03,240 --> 00:25:07,440 go next in terms of testing 706 00:25:05,760 --> 00:25:10,860 so who's going to do the work coming 707 00:25:07,440 --> 00:25:12,960 back to this optimistic diagram so 708 00:25:10,860 --> 00:25:14,159 hopefully these people the data 709 00:25:12,960 --> 00:25:17,220 generators 710 00:25:14,159 --> 00:25:18,720 but in my experience being part of a 711 00:25:17,220 --> 00:25:21,600 centralized team 712 00:25:18,720 --> 00:25:24,360 because I'm I'm in the category of data 713 00:25:21,600 --> 00:25:26,700 consumers in reality but I'm having to 714 00:25:24,360 --> 00:25:29,460 derive that process so again your 715 00:25:26,700 --> 00:25:31,260 mileage may vary so if you go out and 716 00:25:29,460 --> 00:25:33,360 read the literature and you see the sort 717 00:25:31,260 --> 00:25:34,740 of idealized state and you think well 718 00:25:33,360 --> 00:25:37,080 hold on that's not going to really be 719 00:25:34,740 --> 00:25:40,460 how it plays out for me don't worry 720 00:25:37,080 --> 00:25:40,460 I'm having this too 721 00:25:40,980 --> 00:25:45,419 so we're the path and conference let's 722 00:25:43,020 --> 00:25:46,980 talk briefly about what we're using to 723 00:25:45,419 --> 00:25:49,320 do all this stuff 724 00:25:46,980 --> 00:25:51,840 so some helpful python libraries pandas 725 00:25:49,320 --> 00:25:53,340 for reading in the spreadsheet what you 726 00:25:51,840 --> 00:25:54,900 may find is that if you do go this route 727 00:25:53,340 --> 00:25:56,520 and hand a spreadsheet to somebody and 728 00:25:54,900 --> 00:25:57,900 say please fill in the spreadsheet you 729 00:25:56,520 --> 00:26:00,000 know you give them a very opinionated 730 00:25:57,900 --> 00:26:02,100 structure you may get back what I 731 00:26:00,000 --> 00:26:04,200 describe as a sparsely populated Excel 732 00:26:02,100 --> 00:26:06,360 spreadsheet so the First Column my 733 00:26:04,200 --> 00:26:07,740 spreadsheet is the event and I had the 734 00:26:06,360 --> 00:26:09,900 expectation that they would write the 735 00:26:07,740 --> 00:26:12,539 event name on every row not the case 736 00:26:09,900 --> 00:26:14,820 they write it on the first row and then 737 00:26:12,539 --> 00:26:17,340 I have five blank rows and then the next 738 00:26:14,820 --> 00:26:19,260 event name appears so long story short 739 00:26:17,340 --> 00:26:21,240 pandas read in the spreadsheet and then 740 00:26:19,260 --> 00:26:22,679 forward full and also pandas is great 741 00:26:21,240 --> 00:26:25,740 just for any other manipulations you 742 00:26:22,679 --> 00:26:29,220 need to do pedantic for defining my 743 00:26:25,740 --> 00:26:31,740 models what I love about python amongst 744 00:26:29,220 --> 00:26:33,299 other things is python is a what I would 745 00:26:31,740 --> 00:26:35,400 describe as a gradually typed language 746 00:26:33,299 --> 00:26:36,779 working on python if you're not sure 747 00:26:35,400 --> 00:26:38,460 what a variable is going to be you don't 748 00:26:36,779 --> 00:26:40,799 need to tell it hey this is going to be 749 00:26:38,460 --> 00:26:42,360 an INT and if it isn't you your 750 00:26:40,799 --> 00:26:44,700 application immediately goes Kaboom 751 00:26:42,360 --> 00:26:47,100 right but however with the introduction 752 00:26:44,700 --> 00:26:49,260 of Thai Pence and live use like pedantic 753 00:26:47,100 --> 00:26:51,299 as you get more certainty about what 754 00:26:49,260 --> 00:26:54,600 you're going to be bringing in 755 00:26:51,299 --> 00:26:57,659 you can model it use it and off you go 756 00:26:54,600 --> 00:26:59,039 I'm using rope because the models that 757 00:26:57,659 --> 00:27:01,740 are initially generated are quite 758 00:26:59,039 --> 00:27:05,159 verbose and rope is a fantastic library 759 00:27:01,740 --> 00:27:07,440 for manipulating your python code and 760 00:27:05,159 --> 00:27:09,659 effectively changing it so you use path 761 00:27:07,440 --> 00:27:11,880 and code to manipulate python code and 762 00:27:09,659 --> 00:27:14,340 then Pi test my pi and black just for 763 00:27:11,880 --> 00:27:16,080 hygiene and you know having confidence 764 00:27:14,340 --> 00:27:18,179 so talking about the generated model 765 00:27:16,080 --> 00:27:20,520 here is an example of one right as you 766 00:27:18,179 --> 00:27:23,940 can see it is a very verbose there's a 767 00:27:20,520 --> 00:27:25,020 lot of inline variable declaration going 768 00:27:23,940 --> 00:27:27,000 on here 769 00:27:25,020 --> 00:27:28,679 and as you may recall one of the reasons 770 00:27:27,000 --> 00:27:30,120 why I want to do this as code is so that 771 00:27:28,679 --> 00:27:33,120 I could get the benefit of being able to 772 00:27:30,120 --> 00:27:35,100 do things like refactoring or finding 773 00:27:33,120 --> 00:27:36,900 references 774 00:27:35,100 --> 00:27:38,220 what I would prefer is something that 775 00:27:36,900 --> 00:27:40,500 looks more like that 776 00:27:38,220 --> 00:27:42,419 because over on the down below there 777 00:27:40,500 --> 00:27:45,240 I've highlighted a variable attribute 778 00:27:42,419 --> 00:27:47,400 company and because that attribute 779 00:27:45,240 --> 00:27:49,260 conceptually appears in multiple on 780 00:27:47,400 --> 00:27:51,000 multiple entities in my code because 781 00:27:49,260 --> 00:27:53,940 I've refactored it and extracted as a 782 00:27:51,000 --> 00:27:57,179 variable I can now in my IDE right click 783 00:27:53,940 --> 00:27:58,860 say find references and on the very far 784 00:27:57,179 --> 00:28:02,039 right hand side you can see all the 785 00:27:58,860 --> 00:28:04,440 other places has been referred to 786 00:28:02,039 --> 00:28:07,260 I've got a little collab notebook here 787 00:28:04,440 --> 00:28:09,840 which demonstrates the use of rope to 788 00:28:07,260 --> 00:28:11,700 extract and refactor extract variables 789 00:28:09,840 --> 00:28:14,220 and refactor your code don't have the 790 00:28:11,700 --> 00:28:16,559 time to go through it all right now 791 00:28:14,220 --> 00:28:18,299 um I'm sharing this because rope wallet 792 00:28:16,559 --> 00:28:20,820 is a fantastic Library it's 793 00:28:18,299 --> 00:28:23,340 documentation as to how to apply it in 794 00:28:20,820 --> 00:28:26,340 terms of recipes I haven't really found 795 00:28:23,340 --> 00:28:28,200 a whole lot to help me out so have a 796 00:28:26,340 --> 00:28:31,919 look have a play but to expand this code 797 00:28:28,200 --> 00:28:33,720 very briefly I'm first finding the in my 798 00:28:31,919 --> 00:28:35,940 abstract syntax tree the nodes because 799 00:28:33,720 --> 00:28:37,620 that's how your code is represented that 800 00:28:35,940 --> 00:28:39,900 I want to manipulate 801 00:28:37,620 --> 00:28:44,820 identifying their character position in 802 00:28:39,900 --> 00:28:47,279 the code and then I am telling um I'm 803 00:28:44,820 --> 00:28:49,500 defining a new variable name for it 804 00:28:47,279 --> 00:28:51,539 and then for each of these variables 805 00:28:49,500 --> 00:28:54,120 that I found I'm using rope to extract 806 00:28:51,539 --> 00:28:55,980 the variable telling rope to identify 807 00:28:54,120 --> 00:28:58,080 the changes it needs to make and then it 808 00:28:55,980 --> 00:28:59,760 makes the changes and thus I go from 809 00:28:58,080 --> 00:29:01,620 this very reverse code to this nicely 810 00:28:59,760 --> 00:29:04,100 condensed code 811 00:29:01,620 --> 00:29:06,480 so some further options as I said 812 00:29:04,100 --> 00:29:08,159 the code that is generated is quite 813 00:29:06,480 --> 00:29:09,419 verbose and there's some more things 814 00:29:08,159 --> 00:29:11,820 that I would like to implement to make 815 00:29:09,419 --> 00:29:13,620 it a little bit easier to work with 816 00:29:11,820 --> 00:29:15,120 um I won't go into all of this but if 817 00:29:13,620 --> 00:29:17,460 you have any thoughts about how I might 818 00:29:15,120 --> 00:29:20,700 achieve this I'd love to hear from you 819 00:29:17,460 --> 00:29:23,520 so some key takeaways you may be a data 820 00:29:20,700 --> 00:29:24,960 producer and you may not know it okay so 821 00:29:23,520 --> 00:29:25,980 make it worthwhile for people to 822 00:29:24,960 --> 00:29:28,620 register 823 00:29:25,980 --> 00:29:30,299 you know with you as a data consumer 824 00:29:28,620 --> 00:29:33,240 and a data contract is a way to go about 825 00:29:30,299 --> 00:29:34,740 doing that and then finally code is 826 00:29:33,240 --> 00:29:36,539 easier to refactor 827 00:29:34,740 --> 00:29:38,100 finder references and generally maintain 828 00:29:36,539 --> 00:29:40,500 than the alternatives 829 00:29:38,100 --> 00:29:42,360 and with that thank you for your time 830 00:29:40,500 --> 00:29:44,539 and I look forward to catching up with 831 00:29:42,360 --> 00:29:44,539 you 832 00:29:49,799 --> 00:29:53,039 thank you Ryan that was really good uh 833 00:29:51,899 --> 00:29:54,840 yeah it turns out when you add people 834 00:29:53,039 --> 00:29:56,460 into the mix it gets a little bit 835 00:29:54,840 --> 00:29:58,380 squishier and Messier and especially 836 00:29:56,460 --> 00:30:00,360 many of them it's a really good sort of 837 00:29:58,380 --> 00:30:01,919 actionable uh or you know solid 838 00:30:00,360 --> 00:30:03,840 strategies for digging into it we 839 00:30:01,919 --> 00:30:08,120 probably have time for maybe a quick one 840 00:30:03,840 --> 00:30:08,120 or two questions do we have any 841 00:30:10,620 --> 00:30:14,399 gonna end this 842 00:30:12,500 --> 00:30:17,360 can I 843 00:30:14,399 --> 00:30:17,360 there you go 844 00:30:27,659 --> 00:30:34,080 hi I was wondering with your contractors 845 00:30:31,320 --> 00:30:37,500 code is it your intention that these the 846 00:30:34,080 --> 00:30:39,179 code will test the inputs to a system in 847 00:30:37,500 --> 00:30:40,380 real time or just in the test 848 00:30:39,179 --> 00:30:42,120 environment or where are you going to 849 00:30:40,380 --> 00:30:44,460 deploy it potentially 850 00:30:42,120 --> 00:30:46,020 um so as I said my world is very batch 851 00:30:44,460 --> 00:30:48,840 driven right now 852 00:30:46,020 --> 00:30:50,640 we do have aspirations for moving some 853 00:30:48,840 --> 00:30:53,100 of our flows to be more event driven now 854 00:30:50,640 --> 00:30:54,720 so things that are critical say the 855 00:30:53,100 --> 00:30:56,279 activation of a gift card you know we 856 00:30:54,720 --> 00:30:57,899 want customers to be able to buy a gift 857 00:30:56,279 --> 00:31:00,059 card and use it immediately rather than 858 00:30:57,899 --> 00:31:01,919 waiting for the current you know 859 00:31:00,059 --> 00:31:04,380 uncomfortably long period of time that 860 00:31:01,919 --> 00:31:07,260 it takes to activate so in that scenario 861 00:31:04,380 --> 00:31:08,760 I do see a very real extension of what I 862 00:31:07,260 --> 00:31:11,340 have to be able to you know receive 863 00:31:08,760 --> 00:31:13,080 events test them validate them and you 864 00:31:11,340 --> 00:31:15,620 know pass them along should they they 865 00:31:13,080 --> 00:31:15,620 fit yeah 866 00:31:15,659 --> 00:31:20,279 cool alrighty thank you um we might call 867 00:31:18,179 --> 00:31:22,380 it there it's time for our break so 868 00:31:20,279 --> 00:31:25,020 firstly before we move on to the next 869 00:31:22,380 --> 00:31:28,679 things a little gift for you 870 00:31:25,020 --> 00:31:30,570 all right thank you very much and 871 00:31:28,679 --> 00:31:36,480 um one more round of applause for Ryan 872 00:31:30,570 --> 00:31:36,480 [Applause]