1 00:00:00,539 --> 00:00:03,539 foreign 2 00:00:08,420 --> 00:00:13,860 last few sessions today the next one is 3 00:00:12,360 --> 00:00:16,379 Dan fan 4 00:00:13,860 --> 00:00:19,380 um he's going to be joining us remotely 5 00:00:16,379 --> 00:00:22,020 here we go by the power of Technology uh 6 00:00:19,380 --> 00:00:23,460 so Dan's uh keen on applying data 7 00:00:22,020 --> 00:00:25,320 engineering and machine learning tools 8 00:00:23,460 --> 00:00:27,599 to solve real world business problems 9 00:00:25,320 --> 00:00:29,580 and there's a open source contributor 10 00:00:27,599 --> 00:00:32,099 and today he's going to be talking about 11 00:00:29,580 --> 00:00:34,920 building trusted data pipelines the 12 00:00:32,099 --> 00:00:36,660 dagster DBT and duck DB which are three 13 00:00:34,920 --> 00:00:38,219 Technologies I'm personally super 14 00:00:36,660 --> 00:00:41,219 interested in 15 00:00:38,219 --> 00:00:43,579 um so would you give Dan a warm welcome 16 00:00:41,219 --> 00:00:43,579 please 17 00:00:45,420 --> 00:00:50,340 thank you everyone can you hear me 18 00:00:49,260 --> 00:00:52,680 um 19 00:00:50,340 --> 00:00:55,559 yes 20 00:00:52,680 --> 00:00:58,680 we can thank you for coming to my talk 21 00:00:55,559 --> 00:01:02,820 on building 3D tested data bylines with 22 00:00:58,680 --> 00:01:07,439 lobster DBT and duckdb and the 3D here 23 00:01:02,820 --> 00:01:12,900 is being uh the inductor the DBT and the 24 00:01:07,439 --> 00:01:15,000 DB and um sorry I'm not be able to join 25 00:01:12,900 --> 00:01:18,119 the conference face-to-face and also 26 00:01:15,000 --> 00:01:22,140 sorry my voice is not good today 27 00:01:18,119 --> 00:01:25,020 um other than that the in today's slide 28 00:01:22,140 --> 00:01:28,920 and the code demo is uh available online 29 00:01:25,020 --> 00:01:33,420 on this uh GitHub link that's you later 30 00:01:28,920 --> 00:01:36,659 you can get and play around if you have 31 00:01:33,420 --> 00:01:36,659 time and 32 00:01:36,840 --> 00:01:43,740 um a bit of myself I'm currently working 33 00:01:39,720 --> 00:01:46,619 as a senior data engineer at investment 34 00:01:43,740 --> 00:01:52,200 data system team at hesta we are a super 35 00:01:46,619 --> 00:01:54,180 fun and I'm also open source contributor 36 00:01:52,200 --> 00:01:57,799 and loop camping 37 00:01:54,180 --> 00:01:59,939 um previously I was a PhD candidate at 38 00:01:57,799 --> 00:02:03,180 Monastery University doing machine 39 00:01:59,939 --> 00:02:05,880 learning for transport and had been 40 00:02:03,180 --> 00:02:08,539 working as a data analyst database 41 00:02:05,880 --> 00:02:10,500 designers and and web development 42 00:02:08,539 --> 00:02:14,340 and um 43 00:02:10,500 --> 00:02:16,440 obvious and opinions uh in this talk on 44 00:02:14,340 --> 00:02:19,500 my own 45 00:02:16,440 --> 00:02:21,720 all right so in this talk 46 00:02:19,500 --> 00:02:25,440 um first I will talk about the construct 47 00:02:21,720 --> 00:02:29,760 of data engineering and why we we need 48 00:02:25,440 --> 00:02:33,440 does the database and then it will uh I 49 00:02:29,760 --> 00:02:35,660 saw on demo and how we can implement 50 00:02:33,440 --> 00:02:39,920 this in 51 00:02:35,660 --> 00:02:39,920 using the 3D stack 52 00:02:40,319 --> 00:02:46,260 so 53 00:02:42,840 --> 00:02:50,160 at the data Engineers our job is to 54 00:02:46,260 --> 00:02:53,580 build highlight data pilot is it it is 55 00:02:50,160 --> 00:02:56,160 it LT or ETL data file like the general 56 00:02:53,580 --> 00:02:59,160 idea here is we move the data from 57 00:02:56,160 --> 00:03:03,420 different sources and then we extract 58 00:02:59,160 --> 00:03:07,319 and load it to a place it whether it is 59 00:03:03,420 --> 00:03:09,540 a data like a tree or other storage and 60 00:03:07,319 --> 00:03:14,340 then from that we can move it into a 61 00:03:09,540 --> 00:03:16,920 data warehouse and in that we can uh you 62 00:03:14,340 --> 00:03:19,440 different tune to transform our data to 63 00:03:16,920 --> 00:03:21,900 a cleaned from draw data to clean data 64 00:03:19,440 --> 00:03:24,780 and after we have the clean data we can 65 00:03:21,900 --> 00:03:28,080 save it for the downstream application 66 00:03:24,780 --> 00:03:31,220 like building bi dashboard or using for 67 00:03:28,080 --> 00:03:34,440 our machine learning model 68 00:03:31,220 --> 00:03:37,620 and when we deployed our data in 69 00:03:34,440 --> 00:03:40,920 production we want to strengthen out 70 00:03:37,620 --> 00:03:43,159 data pilot right because we know that in 71 00:03:40,920 --> 00:03:47,159 the future 72 00:03:43,159 --> 00:03:50,220 will likely to be leaked or even break 73 00:03:47,159 --> 00:03:53,840 so does work 74 00:03:50,220 --> 00:03:59,099 we really in this robust and reliable 75 00:03:53,840 --> 00:04:02,040 reliable data pilots and we have 76 00:03:59,099 --> 00:04:05,700 many interesting talks on these topics 77 00:04:02,040 --> 00:04:08,840 on how to be a robust or reliable data 78 00:04:05,700 --> 00:04:11,519 by like using either uh 79 00:04:08,840 --> 00:04:16,580 aperture spark we have a previous talk 80 00:04:11,519 --> 00:04:16,580 on buy Spot interesting top and 81 00:04:16,680 --> 00:04:23,880 there are also other topics on the using 82 00:04:19,320 --> 00:04:26,639 data break or ship Lab at Etc and this 83 00:04:23,880 --> 00:04:30,419 is very interesting talk and I recommend 84 00:04:26,639 --> 00:04:32,540 you to watch it later if you have time 85 00:04:30,419 --> 00:04:32,540 um 86 00:04:32,580 --> 00:04:40,500 but and the lesson learned from this 87 00:04:35,940 --> 00:04:42,680 talk is that we need to test our data by 88 00:04:40,500 --> 00:04:46,560 lines and this this is very important 89 00:04:42,680 --> 00:04:49,020 and also there's we also need to apply 90 00:04:46,560 --> 00:04:52,440 the depth of practices like code 91 00:04:49,020 --> 00:04:56,360 versioning with GitHub for example and 92 00:04:52,440 --> 00:04:59,820 address this CI CD 93 00:04:56,360 --> 00:05:02,660 or and also we need to add the 94 00:04:59,820 --> 00:05:05,580 infrastructure ad code like we can use 95 00:05:02,660 --> 00:05:08,720 terraformation not for that and 96 00:05:05,580 --> 00:05:12,000 using a container environment 97 00:05:08,720 --> 00:05:15,120 Docker so we have a similar environment 98 00:05:12,000 --> 00:05:18,360 for development and for production 99 00:05:15,120 --> 00:05:21,479 and we also had HR thing like data Linux 100 00:05:18,360 --> 00:05:24,180 or data contract and hand data contract 101 00:05:21,479 --> 00:05:26,160 from the Upstream 102 00:05:24,180 --> 00:05:30,720 um 103 00:05:26,160 --> 00:05:32,300 but in my opinion the robust data pilot 104 00:05:30,720 --> 00:05:37,199 is not enough 105 00:05:32,300 --> 00:05:40,860 yes robot robust and reliable data is 106 00:05:37,199 --> 00:05:44,100 we absolutely need that but in addition 107 00:05:40,860 --> 00:05:46,259 to that we also need to have a control 108 00:05:44,100 --> 00:05:48,180 our data quality 109 00:05:46,259 --> 00:05:50,940 because at the end 110 00:05:48,180 --> 00:05:53,280 what we building our data or data 111 00:05:50,940 --> 00:05:56,100 warehouse is we building the data and we 112 00:05:53,280 --> 00:05:58,080 serve the data for our consumer for 113 00:05:56,100 --> 00:06:00,539 people like data analysts or bi 114 00:05:58,080 --> 00:06:04,259 developers and data scientists 115 00:06:00,539 --> 00:06:07,139 so just this data pilot is 116 00:06:04,259 --> 00:06:10,080 in addition to being robust it also need 117 00:06:07,139 --> 00:06:13,080 to focus on data quality 118 00:06:10,080 --> 00:06:16,800 so this is the 119 00:06:13,080 --> 00:06:19,699 an example like for when we talk about 120 00:06:16,800 --> 00:06:22,199 robust pipeline we talk about 121 00:06:19,699 --> 00:06:24,500 engineering Focus that we want to 122 00:06:22,199 --> 00:06:28,199 strengthen our 123 00:06:24,500 --> 00:06:31,979 pilot to avoid lick or break in the 124 00:06:28,199 --> 00:06:35,240 future when we deploy it why does the 125 00:06:31,979 --> 00:06:39,479 chapter pipeline we also need to uh 126 00:06:35,240 --> 00:06:43,080 focus on the data consumer is if data is 127 00:06:39,479 --> 00:06:47,900 like water we want to have a good or 128 00:06:43,080 --> 00:06:47,900 clean water or clean data 129 00:06:48,720 --> 00:06:54,900 so if 130 00:06:51,199 --> 00:06:55,699 we work at a data Engineers often you 131 00:06:54,900 --> 00:06:58,919 will 132 00:06:55,699 --> 00:07:01,319 hear about the question from other 133 00:06:58,919 --> 00:07:04,380 people from later other list from data 134 00:07:01,319 --> 00:07:05,460 scientists about different data quality 135 00:07:04,380 --> 00:07:07,860 questions 136 00:07:05,460 --> 00:07:11,039 does anybody asked you a lot of 137 00:07:07,860 --> 00:07:13,580 questions on data quality related things 138 00:07:11,039 --> 00:07:16,039 like either stable 139 00:07:13,580 --> 00:07:20,180 compared with the Upstream table 140 00:07:16,039 --> 00:07:24,180 do we have to have this column always as 141 00:07:20,180 --> 00:07:28,560 data come from or how how recently our 142 00:07:24,180 --> 00:07:32,759 data arrived and or either column of 143 00:07:28,560 --> 00:07:35,759 delivery is in expected rain or it is 144 00:07:32,759 --> 00:07:37,440 summarize start statistics for that 145 00:07:35,759 --> 00:07:39,960 column we may send 146 00:07:37,440 --> 00:07:42,960 Etc and 147 00:07:39,960 --> 00:07:45,840 this is really difficult for data 148 00:07:42,960 --> 00:07:50,099 Engineers to answer the question if we 149 00:07:45,840 --> 00:07:53,520 don't have a good data quality testing 150 00:07:50,099 --> 00:07:58,380 and measurement in in place 151 00:07:53,520 --> 00:08:00,960 so is it really critical to test and 152 00:07:58,380 --> 00:08:03,120 measure our data qualities if we need to 153 00:08:00,960 --> 00:08:05,479 have just we need to test our data 154 00:08:03,120 --> 00:08:05,479 quality 155 00:08:07,500 --> 00:08:17,160 so if we look at this modern data stack 156 00:08:13,020 --> 00:08:19,979 I would say with this is a an example 157 00:08:17,160 --> 00:08:22,560 but you can see that we will have a 158 00:08:19,979 --> 00:08:25,319 injection layer using either python 159 00:08:22,560 --> 00:08:28,560 spark or other tool like air by 160 00:08:25,319 --> 00:08:30,479 functions and we use this injection tool 161 00:08:28,560 --> 00:08:34,080 to move the data for different socks 162 00:08:30,479 --> 00:08:36,120 here into a data warehouse data 163 00:08:34,080 --> 00:08:39,240 warehouse and the data warehouse can be 164 00:08:36,120 --> 00:08:42,000 maybe snowflake relative data break or 165 00:08:39,240 --> 00:08:44,820 even other as was enough 166 00:08:42,000 --> 00:08:47,580 and in when we move the data here we can 167 00:08:44,820 --> 00:08:50,700 if tune like DVT for data transformation 168 00:08:47,580 --> 00:08:52,380 and we can transform the data from The 169 00:08:50,700 --> 00:08:56,760 Landing or the raw layer to different 170 00:08:52,380 --> 00:08:59,899 kind of like sticking and mod layer and 171 00:08:56,760 --> 00:09:03,480 the model will be the clean data 172 00:08:59,899 --> 00:09:06,660 that can be served for VR tune like top 173 00:09:03,480 --> 00:09:09,660 lower power bi or can be used for uh 174 00:09:06,660 --> 00:09:13,019 data science or machine learning 175 00:09:09,660 --> 00:09:16,200 building model in python or r 176 00:09:13,019 --> 00:09:18,560 and we can also have some orchestration 177 00:09:16,200 --> 00:09:22,019 tuned like airflow or Darkstar or some 178 00:09:18,560 --> 00:09:25,680 observability like soda or Monte Carlo 179 00:09:22,019 --> 00:09:29,220 here and but the core thing is here as 180 00:09:25,680 --> 00:09:32,160 you can see that is the injection layer 181 00:09:29,220 --> 00:09:34,800 generator Warehouse layer we where we 182 00:09:32,160 --> 00:09:37,260 have the um for computation using 183 00:09:34,800 --> 00:09:39,600 different engine here with the tool for 184 00:09:37,260 --> 00:09:44,640 data transformation and then we have 185 00:09:39,600 --> 00:09:48,720 some orchestration so in this Workshop 186 00:09:44,640 --> 00:09:53,399 we're going to give us a small demo 187 00:09:48,720 --> 00:09:55,320 using for ingestion we will use Python 188 00:09:53,399 --> 00:10:00,779 to move the data from 189 00:09:55,320 --> 00:10:04,380 Amazon F3 those uh a local folder and 190 00:10:00,779 --> 00:10:06,120 then from that we import into a dark DB 191 00:10:04,380 --> 00:10:07,620 database 192 00:10:06,120 --> 00:10:10,620 and 193 00:10:07,620 --> 00:10:12,839 then we will DBT to transform the raw 194 00:10:10,620 --> 00:10:14,220 data to States in layer and then to mod 195 00:10:12,839 --> 00:10:18,959 layer 196 00:10:14,220 --> 00:10:22,260 and inside this one we will use dbth UT 197 00:10:18,959 --> 00:10:24,660 and DBT expectation which is two DBT 198 00:10:22,260 --> 00:10:28,800 packets that you for testing our data 199 00:10:24,660 --> 00:10:29,940 qualities and in addition to that 200 00:10:28,800 --> 00:10:33,000 um 201 00:10:29,940 --> 00:10:36,060 we have uh the duster here for data 202 00:10:33,000 --> 00:10:38,580 augmentation and the good thing for 203 00:10:36,060 --> 00:10:41,580 adapter here is if you see in the DVD 204 00:10:38,580 --> 00:10:45,060 here we do that information in SQL and 205 00:10:41,580 --> 00:10:48,720 in ingestion layer we do uh 206 00:10:45,060 --> 00:10:51,360 parallel in in Python and then the 207 00:10:48,720 --> 00:10:53,640 doctor can work with both of that to to 208 00:10:51,360 --> 00:10:56,360 convert python Pi lie and the 209 00:10:53,640 --> 00:10:59,820 transformation by line in SQL here to 210 00:10:56,360 --> 00:11:03,120 call software asset and later I will 211 00:10:59,820 --> 00:11:05,579 show you uh that in detail 212 00:11:03,120 --> 00:11:09,300 uh 213 00:11:05,579 --> 00:11:13,200 so what is that the 3D we have here the 214 00:11:09,300 --> 00:11:16,380 first one is the wp is in in process SQL 215 00:11:13,200 --> 00:11:20,279 database management it is serverless so 216 00:11:16,380 --> 00:11:22,680 if you know a SQL light it will be 217 00:11:20,279 --> 00:11:24,839 similar to SQL it is serverless so we 218 00:11:22,680 --> 00:11:29,820 don't need to build any kind of server 219 00:11:24,839 --> 00:11:33,060 like progress SQL or MySQL and that is 220 00:11:29,820 --> 00:11:37,620 great because that may be like vector 221 00:11:33,060 --> 00:11:42,120 like column style so it will be fast and 222 00:11:37,620 --> 00:11:45,000 very suitable for analytics workflow 223 00:11:42,120 --> 00:11:48,360 and we can use that could be at a small 224 00:11:45,000 --> 00:11:52,019 Data Warehouse in our case and sit on 225 00:11:48,360 --> 00:11:55,500 top of the WB is DVT which we can use 226 00:11:52,019 --> 00:11:58,680 for data transformation and from that we 227 00:11:55,500 --> 00:12:00,660 if different DVT packet for control or 228 00:11:58,680 --> 00:12:05,279 test our data quality 229 00:12:00,660 --> 00:12:07,920 and we also have Daxter which is a 230 00:12:05,279 --> 00:12:12,540 software defined asset which can have a 231 00:12:07,920 --> 00:12:16,560 two uh manage and Converse all of this 232 00:12:12,540 --> 00:12:19,079 one into a very nice list of data linear 233 00:12:16,560 --> 00:12:21,320 and Global assets which I will show you 234 00:12:19,079 --> 00:12:21,320 later 235 00:12:21,839 --> 00:12:27,060 so all the testing data quality 236 00:12:25,620 --> 00:12:29,820 um 237 00:12:27,060 --> 00:12:32,820 from the high levels we want to transfer 238 00:12:29,820 --> 00:12:34,800 either between the table we might need 239 00:12:32,820 --> 00:12:38,180 to test the relationship between table 240 00:12:34,800 --> 00:12:40,279 for example it is this customer by the 241 00:12:38,180 --> 00:12:43,920 byproduct so 242 00:12:40,279 --> 00:12:47,639 or the relationship between the table of 243 00:12:43,920 --> 00:12:50,339 weekend test if the raw tables with the 244 00:12:47,639 --> 00:12:53,240 clean table is if they have the same 245 00:12:50,339 --> 00:12:56,279 shape or have relationship with that 246 00:12:53,240 --> 00:13:00,899 within one table we can set the local 247 00:12:56,279 --> 00:13:03,000 column cow column axis Etc and at column 248 00:13:00,899 --> 00:13:05,000 level we can test it depends on 249 00:13:03,000 --> 00:13:09,660 different kind of 250 00:13:05,000 --> 00:13:12,300 data on on on that column if it is a 251 00:13:09,660 --> 00:13:15,660 test we can test the pattern of that if 252 00:13:12,300 --> 00:13:18,060 it categorical Text data we can test if 253 00:13:15,660 --> 00:13:20,639 it have belong to a list of accepted 254 00:13:18,060 --> 00:13:24,420 value if it numerically that we can test 255 00:13:20,639 --> 00:13:27,600 if the mean the mass median or the 256 00:13:24,420 --> 00:13:29,820 outlier as well and if it we have the 257 00:13:27,600 --> 00:13:32,700 date in in the in the table we can test 258 00:13:29,820 --> 00:13:33,540 the Regency the range of the other 259 00:13:32,700 --> 00:13:36,300 Square 260 00:13:33,540 --> 00:13:39,420 in in order to do that we can have 261 00:13:36,300 --> 00:13:43,920 generic tests if this is the column in 262 00:13:39,420 --> 00:13:49,079 North noon or is in the unique or not 263 00:13:43,920 --> 00:13:52,980 so we can't use this Liberty definitely 264 00:13:49,079 --> 00:13:54,720 packet to test all of this guy of uh 265 00:13:52,980 --> 00:13:57,240 things in in 266 00:13:54,720 --> 00:13:59,040 our data parolise 267 00:13:57,240 --> 00:14:02,459 so 268 00:13:59,040 --> 00:14:06,860 the first basic one in DVT is the ship 269 00:14:02,459 --> 00:14:06,860 with the four genetic tests 270 00:14:06,980 --> 00:14:13,680 value and relations so here is one 271 00:14:10,380 --> 00:14:16,079 example how to uh to write a test 272 00:14:13,680 --> 00:14:19,260 so the model here the name here is is 273 00:14:16,079 --> 00:14:20,279 the other table and it is what we have 274 00:14:19,260 --> 00:14:22,680 column 275 00:14:20,279 --> 00:14:24,779 namely order ID and it needs to be 276 00:14:22,680 --> 00:14:26,279 unique and not known for for that 277 00:14:24,779 --> 00:14:29,579 primary key 278 00:14:26,279 --> 00:14:32,040 and another column may cause status and 279 00:14:29,579 --> 00:14:36,660 just stated here for the orders it 280 00:14:32,040 --> 00:14:40,800 should be in one of four value here 281 00:14:36,660 --> 00:14:43,560 and the on the third or here a customer 282 00:14:40,800 --> 00:14:46,800 ID is good we have relation to the other 283 00:14:43,560 --> 00:14:50,339 table of customer and in the ID column 284 00:14:46,800 --> 00:14:52,980 so this is one example of the four 285 00:14:50,339 --> 00:14:53,639 generic test in DBT 286 00:14:52,980 --> 00:14:55,980 um 287 00:14:53,639 --> 00:14:58,500 but if we want to test more complex 288 00:14:55,980 --> 00:15:00,480 things we need other packet liability UT 289 00:14:58,500 --> 00:15:01,980 and DBT expectation 290 00:15:00,480 --> 00:15:05,160 so 291 00:15:01,980 --> 00:15:09,120 to do that we inhibiting we just need to 292 00:15:05,160 --> 00:15:11,100 include the packet into a bucket not the 293 00:15:09,120 --> 00:15:13,680 move file and then we install it using 294 00:15:11,100 --> 00:15:16,500 the DVD depth command and then we run 295 00:15:13,680 --> 00:15:18,500 the test using DVD test 296 00:15:16,500 --> 00:15:18,500 um 297 00:15:18,740 --> 00:15:25,260 and for the Liberty UC this this year 298 00:15:23,160 --> 00:15:26,480 Bucket from the Liberty team and they 299 00:15:25,260 --> 00:15:30,060 ship quick 300 00:15:26,480 --> 00:15:32,880 more different tests like 301 00:15:30,060 --> 00:15:36,720 um we can have a generic that like equal 302 00:15:32,880 --> 00:15:38,639 row car equality uh Outlet one of 303 00:15:36,720 --> 00:15:40,620 constant and we can also have the 304 00:15:38,639 --> 00:15:42,240 northern room proportion which is quite 305 00:15:40,620 --> 00:15:44,760 useful 306 00:15:42,240 --> 00:15:47,100 um and we can also have different uh 307 00:15:44,760 --> 00:15:47,880 relations and evenly compilation tests 308 00:15:47,100 --> 00:15:50,399 here 309 00:15:47,880 --> 00:15:52,560 as you can find more information or the 310 00:15:50,399 --> 00:15:55,079 link I put here and also either 311 00:15:52,560 --> 00:15:58,980 reference letter as well 312 00:15:55,079 --> 00:16:01,279 for DPT expectation this is the idea is 313 00:15:58,980 --> 00:16:05,880 we are similar to 314 00:16:01,279 --> 00:16:09,120 a great expectation bucket in in Python 315 00:16:05,880 --> 00:16:12,360 and they divide it into different group 316 00:16:09,120 --> 00:16:13,820 of tests like for table shape we can 317 00:16:12,360 --> 00:16:18,120 test this 318 00:16:13,820 --> 00:16:21,360 if the column exists in in the table we 319 00:16:18,120 --> 00:16:22,220 can test the column cow Roka 320 00:16:21,360 --> 00:16:26,279 Etc 321 00:16:22,220 --> 00:16:29,160 and for the missing value unique value 322 00:16:26,279 --> 00:16:31,019 we can also that they have a various 323 00:16:29,160 --> 00:16:34,620 stats here as well 324 00:16:31,019 --> 00:16:37,259 um for if it is a numeric column we can 325 00:16:34,620 --> 00:16:40,980 test the value of the column is between 326 00:16:37,259 --> 00:16:45,720 something as well 327 00:16:40,980 --> 00:16:48,060 and for the string or the text column we 328 00:16:45,720 --> 00:16:51,779 can test if it's a matching some kind of 329 00:16:48,060 --> 00:16:55,019 pattern and they also have some 330 00:16:51,779 --> 00:16:57,779 basic that thing on distribution like if 331 00:16:55,019 --> 00:17:00,300 the column value within ends standard 332 00:16:57,779 --> 00:17:03,240 deviation which can be used to test the 333 00:17:00,300 --> 00:17:04,919 outliers in the in the numeric column as 334 00:17:03,240 --> 00:17:08,339 well 335 00:17:04,919 --> 00:17:11,760 so there is also a lot more information 336 00:17:08,339 --> 00:17:16,559 in the the documentation and I recommend 337 00:17:11,760 --> 00:17:20,100 you to look at it at a later time 338 00:17:16,559 --> 00:17:25,260 so what we got out to be we're going to 339 00:17:20,100 --> 00:17:28,380 view this by like where we track 340 00:17:25,260 --> 00:17:30,179 the data from battery so the the data 341 00:17:28,380 --> 00:17:32,880 here we have three tables the payment 342 00:17:30,179 --> 00:17:35,820 the order and the customer and we use 343 00:17:32,880 --> 00:17:40,440 Python to a chart that's one and input 344 00:17:35,820 --> 00:17:44,299 into a raw table into that DB and from 345 00:17:40,440 --> 00:17:48,059 that we will DVT to transform that three 346 00:17:44,299 --> 00:17:50,820 table into a stapling layer and from 347 00:17:48,059 --> 00:17:53,700 that sticks layer we transform it into 348 00:17:50,820 --> 00:17:55,679 the the mod layer which is the order 349 00:17:53,700 --> 00:17:56,940 stable and the customer stable at the 350 00:17:55,679 --> 00:18:00,539 end 351 00:17:56,940 --> 00:18:05,160 actually how it look in the data asset 352 00:18:00,539 --> 00:18:07,620 in in Industry thus we can view both SQL 353 00:18:05,160 --> 00:18:09,240 Pipeline and and python by light 354 00:18:07,620 --> 00:18:12,120 together 355 00:18:09,240 --> 00:18:15,299 and this is a closer view I will show 356 00:18:12,120 --> 00:18:16,980 you later in the demo as well and you 357 00:18:15,299 --> 00:18:21,900 can see that this one this part is from 358 00:18:16,980 --> 00:18:26,700 DVD so it have a icon activity here 359 00:18:21,900 --> 00:18:29,340 all right so now is time for terrible 360 00:18:26,700 --> 00:18:32,179 so I will switch 361 00:18:29,340 --> 00:18:32,179 to the 362 00:18:34,320 --> 00:18:38,960 to the code 363 00:18:36,780 --> 00:18:42,000 so 364 00:18:38,960 --> 00:18:44,100 this code is available on GitHub and 365 00:18:42,000 --> 00:18:47,700 found that you can file the link to the 366 00:18:44,100 --> 00:18:50,580 slide as well and just a remote reminder 367 00:18:47,700 --> 00:18:54,240 just either Pi like we're gonna build 368 00:18:50,580 --> 00:18:57,539 using injection python when the shop 369 00:18:54,240 --> 00:19:00,120 into lucky B and transform it and then 370 00:18:57,539 --> 00:19:03,960 you deducted for quotation 371 00:19:00,120 --> 00:19:04,740 so on the left here if you see that 372 00:19:03,960 --> 00:19:07,500 um 373 00:19:04,740 --> 00:19:10,980 we have to main folder one is for DBT 374 00:19:07,500 --> 00:19:14,539 and one for Dexter and first let's go to 375 00:19:10,980 --> 00:19:14,539 DVT so 376 00:19:22,440 --> 00:19:29,039 so the data we have three table raw by 377 00:19:26,100 --> 00:19:32,700 Monroe other customer after we download 378 00:19:29,039 --> 00:19:36,539 it it will be inside this one just have 379 00:19:32,700 --> 00:19:38,700 a quick look on this uh three table 380 00:19:36,539 --> 00:19:41,340 um this one is that customer which ID 381 00:19:38,700 --> 00:19:45,120 first name that name we have the order 382 00:19:41,340 --> 00:19:48,120 for that the customer 383 00:19:45,120 --> 00:19:50,640 which if the ID and then order ID or the 384 00:19:48,120 --> 00:19:53,539 date and then we have the payment for 385 00:19:50,640 --> 00:20:00,000 for this order 386 00:19:53,539 --> 00:20:00,780 and inside the DPT project 387 00:20:00,000 --> 00:20:04,020 um 388 00:20:00,780 --> 00:20:06,720 we have the Define the profile which is 389 00:20:04,020 --> 00:20:09,260 defined in the the profile here as you 390 00:20:06,720 --> 00:20:12,780 can see it will be here and then it will 391 00:20:09,260 --> 00:20:13,919 link to the uh the dot DB 392 00:20:12,780 --> 00:20:17,160 um 393 00:20:13,919 --> 00:20:19,700 database which is actually one file this 394 00:20:17,160 --> 00:20:23,460 lead file which is this file JavaScript 395 00:20:19,700 --> 00:20:25,919 db.db and later we can also have 396 00:20:23,460 --> 00:20:29,100 different targets for production and we 397 00:20:25,919 --> 00:20:32,640 can deploy it with the other Big Data 398 00:20:29,100 --> 00:20:36,240 Warehouse like snowflake or redshift 399 00:20:32,640 --> 00:20:39,720 and at mentioned before for the testing 400 00:20:36,240 --> 00:20:42,120 our data quality we is so a couple of 401 00:20:39,720 --> 00:20:45,960 packaged liability UT and DVT 402 00:20:42,120 --> 00:20:50,419 expectation in the package.gmo 403 00:20:45,960 --> 00:20:50,419 and let's go into the model 404 00:20:51,179 --> 00:20:58,919 so we have the uh this texting where we 405 00:20:54,960 --> 00:21:02,820 have that three sticks in uh table here 406 00:20:58,919 --> 00:21:04,400 and or model and in this schema we can 407 00:21:02,820 --> 00:21:08,400 Define 408 00:21:04,400 --> 00:21:10,679 the testing for our quality test live 409 00:21:08,400 --> 00:21:13,620 for example customer ID is need to be 410 00:21:10,679 --> 00:21:16,140 unique and not known the code of ID in 411 00:21:13,620 --> 00:21:18,200 Injustice in customer tables all right 412 00:21:16,140 --> 00:21:21,000 and 413 00:21:18,200 --> 00:21:23,120 like the stated for that 414 00:21:21,000 --> 00:21:27,179 sticks in order table which should be in 415 00:21:23,120 --> 00:21:29,340 in one of these five Value Place chip 416 00:21:27,179 --> 00:21:31,380 completed 417 00:21:29,340 --> 00:21:34,860 Etc here 418 00:21:31,380 --> 00:21:38,580 similar in the market we can also 419 00:21:34,860 --> 00:21:41,580 control our data quality for example 420 00:21:38,580 --> 00:21:46,740 here is another test for the first name 421 00:21:41,580 --> 00:21:52,020 of the customer table it it needs to be 422 00:21:46,740 --> 00:21:55,500 not known so at least 98 so that is a 423 00:21:52,020 --> 00:21:58,679 percent of that first name column of the 424 00:21:55,500 --> 00:21:59,580 customer table is is it should be not 425 00:21:58,679 --> 00:22:02,159 known 426 00:21:59,580 --> 00:22:06,960 and 427 00:22:02,159 --> 00:22:09,059 five others uh that's here like the 428 00:22:06,960 --> 00:22:11,460 here I want to check if the that 429 00:22:09,059 --> 00:22:14,700 customer table 430 00:22:11,460 --> 00:22:17,460 here with a similar shape at the states 431 00:22:14,700 --> 00:22:19,500 in customers so this one is have this 432 00:22:17,460 --> 00:22:22,080 sorry a similar row count 433 00:22:19,500 --> 00:22:23,580 I also want to check if the table have a 434 00:22:22,080 --> 00:22:27,360 seven column 435 00:22:23,580 --> 00:22:31,500 so now if we want to test it we just 436 00:22:27,360 --> 00:22:35,159 individly we just run DVD test 437 00:22:31,500 --> 00:22:37,320 and it will uh 438 00:22:35,159 --> 00:22:40,799 run 439 00:22:37,320 --> 00:22:43,740 all the tests in our just edify in our 440 00:22:40,799 --> 00:22:47,340 schema here 441 00:22:43,740 --> 00:22:50,340 and you can see that is we have 18 tests 442 00:22:47,340 --> 00:22:52,320 here and it pass so 443 00:22:50,340 --> 00:22:54,299 for example if I change this one from 7 444 00:22:52,320 --> 00:22:58,140 to 10 column 445 00:22:54,299 --> 00:23:00,780 and if I run DBT test 446 00:22:58,140 --> 00:23:03,780 and I select 447 00:23:00,780 --> 00:23:06,860 the customer table 448 00:23:03,780 --> 00:23:06,860 the customer table 449 00:23:10,740 --> 00:23:17,880 it will test only a random test only for 450 00:23:13,620 --> 00:23:22,799 just a customer table and you can see we 451 00:23:17,880 --> 00:23:25,860 have six tests and we have one Arrow 452 00:23:22,799 --> 00:23:29,419 here which is the 453 00:23:25,860 --> 00:23:34,640 um the cow equals 10 here which actually 454 00:23:29,419 --> 00:23:34,640 seven right so if I change it back to 7 455 00:23:36,080 --> 00:23:40,220 3 test customer again 456 00:23:41,700 --> 00:23:48,440 it should uh 457 00:23:44,400 --> 00:23:48,440 success for the sick test 458 00:23:48,539 --> 00:23:52,760 yep 459 00:23:50,100 --> 00:23:52,760 and 460 00:23:55,140 --> 00:24:02,179 another thing here is we think in the 461 00:23:59,039 --> 00:24:05,460 good thing in DVD weekend see the 462 00:24:02,179 --> 00:24:09,059 generate our uh 463 00:24:05,460 --> 00:24:12,059 our documentation and graphs and 464 00:24:09,059 --> 00:24:14,900 after you lose generate the 465 00:24:12,059 --> 00:24:19,820 documentation we can serve the the dock 466 00:24:14,900 --> 00:24:22,320 by the way all these command AI 467 00:24:19,820 --> 00:24:24,380 already put it in the 468 00:24:22,320 --> 00:24:24,380 um 469 00:24:24,419 --> 00:24:30,480 in the readme here as well 470 00:24:27,539 --> 00:24:33,780 so if you see that 471 00:24:30,480 --> 00:24:36,799 you see on the browser let's 472 00:24:33,780 --> 00:24:36,799 go I think 473 00:24:37,440 --> 00:24:41,820 you're done can you hear me 474 00:24:39,840 --> 00:24:45,799 if you click click on jet link it will 475 00:24:41,820 --> 00:24:45,799 show us the documentation 476 00:24:46,320 --> 00:24:53,520 and it will so it's the um 477 00:24:50,940 --> 00:24:56,120 that the table with information here as 478 00:24:53,520 --> 00:24:56,120 well like 479 00:25:02,659 --> 00:25:08,039 another thing is it can also help us to 480 00:25:06,059 --> 00:25:12,200 uh 481 00:25:08,039 --> 00:25:12,200 items if we click on this one 482 00:25:12,539 --> 00:25:17,760 and click on the graph we can solve the 483 00:25:15,720 --> 00:25:20,760 the byline here 484 00:25:17,760 --> 00:25:23,760 but as you can see this the pilot here 485 00:25:20,760 --> 00:25:26,820 only is a sequel in DBT and we cannot 486 00:25:23,760 --> 00:25:33,059 see the Upstream pipeline using python 487 00:25:26,820 --> 00:25:36,120 right so that's what we add Dexter so 488 00:25:33,059 --> 00:25:41,000 let's bite to the second folder here is 489 00:25:36,120 --> 00:25:41,000 Daxter DVT zapper and in here 490 00:25:41,400 --> 00:25:46,940 um let me go to 491 00:25:43,559 --> 00:25:46,940 CD Dexter 492 00:25:47,580 --> 00:25:50,159 and in here 493 00:25:49,080 --> 00:25:54,240 um 494 00:25:50,159 --> 00:25:57,140 we Define the software assets and 495 00:25:54,240 --> 00:25:57,140 the main 496 00:25:57,840 --> 00:26:03,200 code in here but before that let me run 497 00:26:00,539 --> 00:26:07,740 the the dust darkness which will 498 00:26:03,200 --> 00:26:10,520 run the the data server 499 00:26:07,740 --> 00:26:10,520 and 500 00:26:10,620 --> 00:26:16,500 inside the doctor here we we can Define 501 00:26:13,500 --> 00:26:19,200 the asset and we have two main Thai 502 00:26:16,500 --> 00:26:21,059 right the first one is the uh the DVT 503 00:26:19,200 --> 00:26:24,240 which we can load is 504 00:26:21,059 --> 00:26:27,720 by differences it's load asset from DVD 505 00:26:24,240 --> 00:26:30,539 project and we can specify the the part 506 00:26:27,720 --> 00:26:31,679 to the to the father of Liberty which is 507 00:26:30,539 --> 00:26:35,100 this one 508 00:26:31,679 --> 00:26:37,080 uh sorry what is this one the DVT is up 509 00:26:35,100 --> 00:26:40,679 for software and then after that that's 510 00:26:37,080 --> 00:26:43,320 the we will get the metadata from DPT 511 00:26:40,679 --> 00:26:45,120 examples of project and and it put it at 512 00:26:43,320 --> 00:26:47,940 an asset 513 00:26:45,120 --> 00:26:51,960 um another part here we have the the 514 00:26:47,940 --> 00:26:55,260 python which is uh this will load python 515 00:26:51,960 --> 00:26:58,559 from F3 and import into docdb which it 516 00:26:55,260 --> 00:27:01,080 defines In The Raw data here 517 00:26:58,559 --> 00:27:02,580 and in this one you can see that we have 518 00:27:01,080 --> 00:27:04,980 just 519 00:27:02,580 --> 00:27:07,020 a simple code that I tracked the raw 520 00:27:04,980 --> 00:27:10,740 data for example if if they roll it 521 00:27:07,020 --> 00:27:13,500 order here we read the data from F3 and 522 00:27:10,740 --> 00:27:16,380 then we save it in here and then after 523 00:27:13,500 --> 00:27:17,760 that we will create 524 00:27:16,380 --> 00:27:21,900 um 525 00:27:17,760 --> 00:27:27,000 this router folder and insert it into 526 00:27:21,900 --> 00:27:29,220 the uh the documents file 527 00:27:27,000 --> 00:27:32,880 and 528 00:27:29,220 --> 00:27:36,600 inductor we just need to add a decorator 529 00:27:32,880 --> 00:27:39,059 called assets and it will load that and 530 00:27:36,600 --> 00:27:43,020 consider all the pilot asset 531 00:27:39,059 --> 00:27:46,980 so after we run the the dark is we can 532 00:27:43,020 --> 00:27:49,580 view it in the website here 533 00:27:46,980 --> 00:27:49,580 I think 534 00:27:51,539 --> 00:27:56,279 and if we click on 535 00:27:54,480 --> 00:27:59,220 Global asset 536 00:27:56,279 --> 00:28:00,740 you can see that 537 00:27:59,220 --> 00:28:04,980 we have 538 00:28:00,740 --> 00:28:10,380 all our data pilot here with both Python 539 00:28:04,980 --> 00:28:14,460 and SQL pilot in DVT and we can't 540 00:28:10,380 --> 00:28:17,659 also materialize all this table which is 541 00:28:14,460 --> 00:28:20,940 actually will run all the byline in in 542 00:28:17,659 --> 00:28:24,900 this uh 543 00:28:20,940 --> 00:28:27,179 in this one in this graph and then 544 00:28:24,900 --> 00:28:29,039 we can see the run as well 545 00:28:27,179 --> 00:28:31,080 also here if we click on different 546 00:28:29,039 --> 00:28:34,760 tables they are start to materializing 547 00:28:31,080 --> 00:28:37,380 and we can see all the detail and 548 00:28:34,760 --> 00:28:39,840 metadata and other information here as 549 00:28:37,380 --> 00:28:43,080 well and they also have the the rounding 550 00:28:39,840 --> 00:28:44,760 which for different steps that we 551 00:28:43,080 --> 00:28:47,659 um we have 552 00:28:44,760 --> 00:28:47,659 and 553 00:28:50,039 --> 00:28:57,179 last 554 00:28:52,740 --> 00:29:01,200 is and why we're waiting for this one to 555 00:28:57,179 --> 00:29:05,460 uh finish I think it will need a couple 556 00:29:01,200 --> 00:29:09,020 of minutes for it to finish 557 00:29:05,460 --> 00:29:09,020 let's back to the slide 558 00:29:09,360 --> 00:29:18,240 so I just saw you just one example 559 00:29:13,679 --> 00:29:23,520 a way to implement our data by line 560 00:29:18,240 --> 00:29:28,799 uh so with both Python and SQL Pipeline 561 00:29:23,520 --> 00:29:31,140 and add different tests using um DBT ift 562 00:29:28,799 --> 00:29:32,240 and DVD expectation to control the data 563 00:29:31,140 --> 00:29:35,340 quality 564 00:29:32,240 --> 00:29:37,640 but at the end building trust is your 565 00:29:35,340 --> 00:29:41,100 journey and 566 00:29:37,640 --> 00:29:43,980 we will need to have a good roadmap and 567 00:29:41,100 --> 00:29:46,559 especially the most important in my 568 00:29:43,980 --> 00:29:49,860 opinion is very strong leadership with 569 00:29:46,559 --> 00:29:52,080 the right data strategy and that as 570 00:29:49,860 --> 00:29:54,480 previous speaker also mentioned about 571 00:29:52,080 --> 00:29:58,320 the data contract and we need to buy in 572 00:29:54,480 --> 00:29:59,880 of different data stakeholder and data 573 00:29:58,320 --> 00:30:03,000 consumer 574 00:29:59,880 --> 00:30:05,700 and in the technical perspective we also 575 00:30:03,000 --> 00:30:10,500 need to have well-designed data schema 576 00:30:05,700 --> 00:30:13,500 and we need to control the data quality 577 00:30:10,500 --> 00:30:15,600 from Upstream on the way in the data 578 00:30:13,500 --> 00:30:19,380 warehouse or data Lake and on Downstream 579 00:30:15,600 --> 00:30:23,159 as well and a couple of differences that 580 00:30:19,380 --> 00:30:28,100 I use for the talk and thank you very 581 00:30:23,159 --> 00:30:32,940 much for uh for that and 582 00:30:28,100 --> 00:30:36,260 by the way our team are hiring for data 583 00:30:32,940 --> 00:30:36,260 and engineer position 584 00:30:36,419 --> 00:30:42,299 thank you and let's 585 00:30:38,640 --> 00:30:44,159 come back to thank you so much Dan I'm 586 00:30:42,299 --> 00:30:46,860 more here to see if 587 00:30:44,159 --> 00:30:48,539 the doctor already finished so if we 588 00:30:46,860 --> 00:30:51,480 finish it will be buttered right here 589 00:30:48,539 --> 00:30:54,840 yeah like 590 00:30:51,480 --> 00:30:56,520 that's great thank you so much 591 00:30:54,840 --> 00:30:59,120 these are all tools I've been really 592 00:30:56,520 --> 00:30:59,120 intrigued 593 00:31:00,240 --> 00:31:04,080 DBT especially has been one that I 594 00:31:02,279 --> 00:31:06,120 haven't managed to look into as much as 595 00:31:04,080 --> 00:31:07,740 I'd like to so thank you so much for 596 00:31:06,120 --> 00:31:09,720 that that was great 597 00:31:07,740 --> 00:31:11,520 um obviously we're out of time now so we 598 00:31:09,720 --> 00:31:14,880 don't have time for questions but the 599 00:31:11,520 --> 00:31:16,860 next talk will be at 600 00:31:14,880 --> 00:31:19,860 um in 10 minutes at 10 past four thanks 601 00:31:16,860 --> 00:31:23,720 a lot Dan and we'll hopefully see you in 602 00:31:19,860 --> 00:31:23,720 person sometime soon see ya