1
00:00:00,539 --> 00:00:03,539
foreign

2
00:00:08,420 --> 00:00:13,860
last few sessions today the next one is

3
00:00:12,360 --> 00:00:16,379
Dan fan

4
00:00:13,860 --> 00:00:19,380
um he's going to be joining us remotely

5
00:00:16,379 --> 00:00:22,020
here we go by the power of Technology uh

6
00:00:19,380 --> 00:00:23,460
so Dan's uh keen on applying data

7
00:00:22,020 --> 00:00:25,320
engineering and machine learning tools

8
00:00:23,460 --> 00:00:27,599
to solve real world business problems

9
00:00:25,320 --> 00:00:29,580
and there's a open source contributor

10
00:00:27,599 --> 00:00:32,099
and today he's going to be talking about

11
00:00:29,580 --> 00:00:34,920
building trusted data pipelines the

12
00:00:32,099 --> 00:00:36,660
dagster DBT and duck DB which are three

13
00:00:34,920 --> 00:00:38,219
Technologies I'm personally super

14
00:00:36,660 --> 00:00:41,219
interested in

15
00:00:38,219 --> 00:00:43,579
um so would you give Dan a warm welcome

16
00:00:41,219 --> 00:00:43,579
please

17
00:00:45,420 --> 00:00:50,340
thank you everyone can you hear me

18
00:00:49,260 --> 00:00:52,680
um

19
00:00:50,340 --> 00:00:55,559
yes

20
00:00:52,680 --> 00:00:58,680
we can thank you for coming to my talk

21
00:00:55,559 --> 00:01:02,820
on building 3D tested data bylines with

22
00:00:58,680 --> 00:01:07,439
lobster DBT and duckdb and the 3D here

23
00:01:02,820 --> 00:01:12,900
is being uh the inductor the DBT and the

24
00:01:07,439 --> 00:01:15,000
DB and um sorry I'm not be able to join

25
00:01:12,900 --> 00:01:18,119
the conference face-to-face and also

26
00:01:15,000 --> 00:01:22,140
sorry my voice is not good today

27
00:01:18,119 --> 00:01:25,020
um other than that the in today's slide

28
00:01:22,140 --> 00:01:28,920
and the code demo is uh available online

29
00:01:25,020 --> 00:01:33,420
on this uh GitHub link that's you later

30
00:01:28,920 --> 00:01:36,659
you can get and play around if you have

31
00:01:33,420 --> 00:01:36,659
time and

32
00:01:36,840 --> 00:01:43,740
um a bit of myself I'm currently working

33
00:01:39,720 --> 00:01:46,619
as a senior data engineer at investment

34
00:01:43,740 --> 00:01:52,200
data system team at hesta we are a super

35
00:01:46,619 --> 00:01:54,180
fun and I'm also open source contributor

36
00:01:52,200 --> 00:01:57,799
and loop camping

37
00:01:54,180 --> 00:01:59,939
um previously I was a PhD candidate at

38
00:01:57,799 --> 00:02:03,180
Monastery University doing machine

39
00:01:59,939 --> 00:02:05,880
learning for transport and had been

40
00:02:03,180 --> 00:02:08,539
working as a data analyst database

41
00:02:05,880 --> 00:02:10,500
designers and and web development

42
00:02:08,539 --> 00:02:14,340
and um

43
00:02:10,500 --> 00:02:16,440
obvious and opinions uh in this talk on

44
00:02:14,340 --> 00:02:19,500
my own

45
00:02:16,440 --> 00:02:21,720
all right so in this talk

46
00:02:19,500 --> 00:02:25,440
um first I will talk about the construct

47
00:02:21,720 --> 00:02:29,760
of data engineering and why we we need

48
00:02:25,440 --> 00:02:33,440
does the database and then it will uh I

49
00:02:29,760 --> 00:02:35,660
saw on demo and how we can implement

50
00:02:33,440 --> 00:02:39,920
this in

51
00:02:35,660 --> 00:02:39,920
using the 3D stack

52
00:02:40,319 --> 00:02:46,260
so

53
00:02:42,840 --> 00:02:50,160
at the data Engineers our job is to

54
00:02:46,260 --> 00:02:53,580
build highlight data pilot is it it is

55
00:02:50,160 --> 00:02:56,160
it LT or ETL data file like the general

56
00:02:53,580 --> 00:02:59,160
idea here is we move the data from

57
00:02:56,160 --> 00:03:03,420
different sources and then we extract

58
00:02:59,160 --> 00:03:07,319
and load it to a place it whether it is

59
00:03:03,420 --> 00:03:09,540
a data like a tree or other storage and

60
00:03:07,319 --> 00:03:14,340
then from that we can move it into a

61
00:03:09,540 --> 00:03:16,920
data warehouse and in that we can uh you

62
00:03:14,340 --> 00:03:19,440
different tune to transform our data to

63
00:03:16,920 --> 00:03:21,900
a cleaned from draw data to clean data

64
00:03:19,440 --> 00:03:24,780
and after we have the clean data we can

65
00:03:21,900 --> 00:03:28,080
save it for the downstream application

66
00:03:24,780 --> 00:03:31,220
like building bi dashboard or using for

67
00:03:28,080 --> 00:03:34,440
our machine learning model

68
00:03:31,220 --> 00:03:37,620
and when we deployed our data in

69
00:03:34,440 --> 00:03:40,920
production we want to strengthen out

70
00:03:37,620 --> 00:03:43,159
data pilot right because we know that in

71
00:03:40,920 --> 00:03:47,159
the future

72
00:03:43,159 --> 00:03:50,220
will likely to be leaked or even break

73
00:03:47,159 --> 00:03:53,840
so does work

74
00:03:50,220 --> 00:03:59,099
we really in this robust and reliable

75
00:03:53,840 --> 00:04:02,040
reliable data pilots and we have

76
00:03:59,099 --> 00:04:05,700
many interesting talks on these topics

77
00:04:02,040 --> 00:04:08,840
on how to be a robust or reliable data

78
00:04:05,700 --> 00:04:11,519
by like using either uh

79
00:04:08,840 --> 00:04:16,580
aperture spark we have a previous talk

80
00:04:11,519 --> 00:04:16,580
on buy Spot interesting top and

81
00:04:16,680 --> 00:04:23,880
there are also other topics on the using

82
00:04:19,320 --> 00:04:26,639
data break or ship Lab at Etc and this

83
00:04:23,880 --> 00:04:30,419
is very interesting talk and I recommend

84
00:04:26,639 --> 00:04:32,540
you to watch it later if you have time

85
00:04:30,419 --> 00:04:32,540
um

86
00:04:32,580 --> 00:04:40,500
but and the lesson learned from this

87
00:04:35,940 --> 00:04:42,680
talk is that we need to test our data by

88
00:04:40,500 --> 00:04:46,560
lines and this this is very important

89
00:04:42,680 --> 00:04:49,020
and also there's we also need to apply

90
00:04:46,560 --> 00:04:52,440
the depth of practices like code

91
00:04:49,020 --> 00:04:56,360
versioning with GitHub for example and

92
00:04:52,440 --> 00:04:59,820
address this CI CD

93
00:04:56,360 --> 00:05:02,660
or and also we need to add the

94
00:04:59,820 --> 00:05:05,580
infrastructure ad code like we can use

95
00:05:02,660 --> 00:05:08,720
terraformation not for that and

96
00:05:05,580 --> 00:05:12,000
using a container environment

97
00:05:08,720 --> 00:05:15,120
Docker so we have a similar environment

98
00:05:12,000 --> 00:05:18,360
for development and for production

99
00:05:15,120 --> 00:05:21,479
and we also had HR thing like data Linux

100
00:05:18,360 --> 00:05:24,180
or data contract and hand data contract

101
00:05:21,479 --> 00:05:26,160
from the Upstream

102
00:05:24,180 --> 00:05:30,720
um

103
00:05:26,160 --> 00:05:32,300
but in my opinion the robust data pilot

104
00:05:30,720 --> 00:05:37,199
is not enough

105
00:05:32,300 --> 00:05:40,860
yes robot robust and reliable data is

106
00:05:37,199 --> 00:05:44,100
we absolutely need that but in addition

107
00:05:40,860 --> 00:05:46,259
to that we also need to have a control

108
00:05:44,100 --> 00:05:48,180
our data quality

109
00:05:46,259 --> 00:05:50,940
because at the end

110
00:05:48,180 --> 00:05:53,280
what we building our data or data

111
00:05:50,940 --> 00:05:56,100
warehouse is we building the data and we

112
00:05:53,280 --> 00:05:58,080
serve the data for our consumer for

113
00:05:56,100 --> 00:06:00,539
people like data analysts or bi

114
00:05:58,080 --> 00:06:04,259
developers and data scientists

115
00:06:00,539 --> 00:06:07,139
so just this data pilot is

116
00:06:04,259 --> 00:06:10,080
in addition to being robust it also need

117
00:06:07,139 --> 00:06:13,080
to focus on data quality

118
00:06:10,080 --> 00:06:16,800
so this is the

119
00:06:13,080 --> 00:06:19,699
an example like for when we talk about

120
00:06:16,800 --> 00:06:22,199
robust pipeline we talk about

121
00:06:19,699 --> 00:06:24,500
engineering Focus that we want to

122
00:06:22,199 --> 00:06:28,199
strengthen our

123
00:06:24,500 --> 00:06:31,979
pilot to avoid lick or break in the

124
00:06:28,199 --> 00:06:35,240
future when we deploy it why does the

125
00:06:31,979 --> 00:06:39,479
chapter pipeline we also need to uh

126
00:06:35,240 --> 00:06:43,080
focus on the data consumer is if data is

127
00:06:39,479 --> 00:06:47,900
like water we want to have a good or

128
00:06:43,080 --> 00:06:47,900
clean water or clean data

129
00:06:48,720 --> 00:06:54,900
so if

130
00:06:51,199 --> 00:06:55,699
we work at a data Engineers often you

131
00:06:54,900 --> 00:06:58,919
will

132
00:06:55,699 --> 00:07:01,319
hear about the question from other

133
00:06:58,919 --> 00:07:04,380
people from later other list from data

134
00:07:01,319 --> 00:07:05,460
scientists about different data quality

135
00:07:04,380 --> 00:07:07,860
questions

136
00:07:05,460 --> 00:07:11,039
does anybody asked you a lot of

137
00:07:07,860 --> 00:07:13,580
questions on data quality related things

138
00:07:11,039 --> 00:07:16,039
like either stable

139
00:07:13,580 --> 00:07:20,180
compared with the Upstream table

140
00:07:16,039 --> 00:07:24,180
do we have to have this column always as

141
00:07:20,180 --> 00:07:28,560
data come from or how how recently our

142
00:07:24,180 --> 00:07:32,759
data arrived and or either column of

143
00:07:28,560 --> 00:07:35,759
delivery is in expected rain or it is

144
00:07:32,759 --> 00:07:37,440
summarize start statistics for that

145
00:07:35,759 --> 00:07:39,960
column we may send

146
00:07:37,440 --> 00:07:42,960
Etc and

147
00:07:39,960 --> 00:07:45,840
this is really difficult for data

148
00:07:42,960 --> 00:07:50,099
Engineers to answer the question if we

149
00:07:45,840 --> 00:07:53,520
don't have a good data quality testing

150
00:07:50,099 --> 00:07:58,380
and measurement in in place

151
00:07:53,520 --> 00:08:00,960
so is it really critical to test and

152
00:07:58,380 --> 00:08:03,120
measure our data qualities if we need to

153
00:08:00,960 --> 00:08:05,479
have just we need to test our data

154
00:08:03,120 --> 00:08:05,479
quality

155
00:08:07,500 --> 00:08:17,160
so if we look at this modern data stack

156
00:08:13,020 --> 00:08:19,979
I would say with this is a an example

157
00:08:17,160 --> 00:08:22,560
but you can see that we will have a

158
00:08:19,979 --> 00:08:25,319
injection layer using either python

159
00:08:22,560 --> 00:08:28,560
spark or other tool like air by

160
00:08:25,319 --> 00:08:30,479
functions and we use this injection tool

161
00:08:28,560 --> 00:08:34,080
to move the data for different socks

162
00:08:30,479 --> 00:08:36,120
here into a data warehouse data

163
00:08:34,080 --> 00:08:39,240
warehouse and the data warehouse can be

164
00:08:36,120 --> 00:08:42,000
maybe snowflake relative data break or

165
00:08:39,240 --> 00:08:44,820
even other as was enough

166
00:08:42,000 --> 00:08:47,580
and in when we move the data here we can

167
00:08:44,820 --> 00:08:50,700
if tune like DVT for data transformation

168
00:08:47,580 --> 00:08:52,380
and we can transform the data from The

169
00:08:50,700 --> 00:08:56,760
Landing or the raw layer to different

170
00:08:52,380 --> 00:08:59,899
kind of like sticking and mod layer and

171
00:08:56,760 --> 00:09:03,480
the model will be the clean data

172
00:08:59,899 --> 00:09:06,660
that can be served for VR tune like top

173
00:09:03,480 --> 00:09:09,660
lower power bi or can be used for uh

174
00:09:06,660 --> 00:09:13,019
data science or machine learning

175
00:09:09,660 --> 00:09:16,200
building model in python or r

176
00:09:13,019 --> 00:09:18,560
and we can also have some orchestration

177
00:09:16,200 --> 00:09:22,019
tuned like airflow or Darkstar or some

178
00:09:18,560 --> 00:09:25,680
observability like soda or Monte Carlo

179
00:09:22,019 --> 00:09:29,220
here and but the core thing is here as

180
00:09:25,680 --> 00:09:32,160
you can see that is the injection layer

181
00:09:29,220 --> 00:09:34,800
generator Warehouse layer we where we

182
00:09:32,160 --> 00:09:37,260
have the um for computation using

183
00:09:34,800 --> 00:09:39,600
different engine here with the tool for

184
00:09:37,260 --> 00:09:44,640
data transformation and then we have

185
00:09:39,600 --> 00:09:48,720
some orchestration so in this Workshop

186
00:09:44,640 --> 00:09:53,399
we're going to give us a small demo

187
00:09:48,720 --> 00:09:55,320
using for ingestion we will use Python

188
00:09:53,399 --> 00:10:00,779
to move the data from

189
00:09:55,320 --> 00:10:04,380
Amazon F3 those uh a local folder and

190
00:10:00,779 --> 00:10:06,120
then from that we import into a dark DB

191
00:10:04,380 --> 00:10:07,620
database

192
00:10:06,120 --> 00:10:10,620
and

193
00:10:07,620 --> 00:10:12,839
then we will DBT to transform the raw

194
00:10:10,620 --> 00:10:14,220
data to States in layer and then to mod

195
00:10:12,839 --> 00:10:18,959
layer

196
00:10:14,220 --> 00:10:22,260
and inside this one we will use dbth UT

197
00:10:18,959 --> 00:10:24,660
and DBT expectation which is two DBT

198
00:10:22,260 --> 00:10:28,800
packets that you for testing our data

199
00:10:24,660 --> 00:10:29,940
qualities and in addition to that

200
00:10:28,800 --> 00:10:33,000
um

201
00:10:29,940 --> 00:10:36,060
we have uh the duster here for data

202
00:10:33,000 --> 00:10:38,580
augmentation and the good thing for

203
00:10:36,060 --> 00:10:41,580
adapter here is if you see in the DVD

204
00:10:38,580 --> 00:10:45,060
here we do that information in SQL and

205
00:10:41,580 --> 00:10:48,720
in ingestion layer we do uh

206
00:10:45,060 --> 00:10:51,360
parallel in in Python and then the

207
00:10:48,720 --> 00:10:53,640
doctor can work with both of that to to

208
00:10:51,360 --> 00:10:56,360
convert python Pi lie and the

209
00:10:53,640 --> 00:10:59,820
transformation by line in SQL here to

210
00:10:56,360 --> 00:11:03,120
call software asset and later I will

211
00:10:59,820 --> 00:11:05,579
show you uh that in detail

212
00:11:03,120 --> 00:11:09,300
uh

213
00:11:05,579 --> 00:11:13,200
so what is that the 3D we have here the

214
00:11:09,300 --> 00:11:16,380
first one is the wp is in in process SQL

215
00:11:13,200 --> 00:11:20,279
database management it is serverless so

216
00:11:16,380 --> 00:11:22,680
if you know a SQL light it will be

217
00:11:20,279 --> 00:11:24,839
similar to SQL it is serverless so we

218
00:11:22,680 --> 00:11:29,820
don't need to build any kind of server

219
00:11:24,839 --> 00:11:33,060
like progress SQL or MySQL and that is

220
00:11:29,820 --> 00:11:37,620
great because that may be like vector

221
00:11:33,060 --> 00:11:42,120
like column style so it will be fast and

222
00:11:37,620 --> 00:11:45,000
very suitable for analytics workflow

223
00:11:42,120 --> 00:11:48,360
and we can use that could be at a small

224
00:11:45,000 --> 00:11:52,019
Data Warehouse in our case and sit on

225
00:11:48,360 --> 00:11:55,500
top of the WB is DVT which we can use

226
00:11:52,019 --> 00:11:58,680
for data transformation and from that we

227
00:11:55,500 --> 00:12:00,660
if different DVT packet for control or

228
00:11:58,680 --> 00:12:05,279
test our data quality

229
00:12:00,660 --> 00:12:07,920
and we also have Daxter which is a

230
00:12:05,279 --> 00:12:12,540
software defined asset which can have a

231
00:12:07,920 --> 00:12:16,560
two uh manage and Converse all of this

232
00:12:12,540 --> 00:12:19,079
one into a very nice list of data linear

233
00:12:16,560 --> 00:12:21,320
and Global assets which I will show you

234
00:12:19,079 --> 00:12:21,320
later

235
00:12:21,839 --> 00:12:27,060
so all the testing data quality

236
00:12:25,620 --> 00:12:29,820
um

237
00:12:27,060 --> 00:12:32,820
from the high levels we want to transfer

238
00:12:29,820 --> 00:12:34,800
either between the table we might need

239
00:12:32,820 --> 00:12:38,180
to test the relationship between table

240
00:12:34,800 --> 00:12:40,279
for example it is this customer by the

241
00:12:38,180 --> 00:12:43,920
byproduct so

242
00:12:40,279 --> 00:12:47,639
or the relationship between the table of

243
00:12:43,920 --> 00:12:50,339
weekend test if the raw tables with the

244
00:12:47,639 --> 00:12:53,240
clean table is if they have the same

245
00:12:50,339 --> 00:12:56,279
shape or have relationship with that

246
00:12:53,240 --> 00:13:00,899
within one table we can set the local

247
00:12:56,279 --> 00:13:03,000
column cow column axis Etc and at column

248
00:13:00,899 --> 00:13:05,000
level we can test it depends on

249
00:13:03,000 --> 00:13:09,660
different kind of

250
00:13:05,000 --> 00:13:12,300
data on on on that column if it is a

251
00:13:09,660 --> 00:13:15,660
test we can test the pattern of that if

252
00:13:12,300 --> 00:13:18,060
it categorical Text data we can test if

253
00:13:15,660 --> 00:13:20,639
it have belong to a list of accepted

254
00:13:18,060 --> 00:13:24,420
value if it numerically that we can test

255
00:13:20,639 --> 00:13:27,600
if the mean the mass median or the

256
00:13:24,420 --> 00:13:29,820
outlier as well and if it we have the

257
00:13:27,600 --> 00:13:32,700
date in in the in the table we can test

258
00:13:29,820 --> 00:13:33,540
the Regency the range of the other

259
00:13:32,700 --> 00:13:36,300
Square

260
00:13:33,540 --> 00:13:39,420
in in order to do that we can have

261
00:13:36,300 --> 00:13:43,920
generic tests if this is the column in

262
00:13:39,420 --> 00:13:49,079
North noon or is in the unique or not

263
00:13:43,920 --> 00:13:52,980
so we can't use this Liberty definitely

264
00:13:49,079 --> 00:13:54,720
packet to test all of this guy of uh

265
00:13:52,980 --> 00:13:57,240
things in in

266
00:13:54,720 --> 00:13:59,040
our data parolise

267
00:13:57,240 --> 00:14:02,459
so

268
00:13:59,040 --> 00:14:06,860
the first basic one in DVT is the ship

269
00:14:02,459 --> 00:14:06,860
with the four genetic tests

270
00:14:06,980 --> 00:14:13,680
value and relations so here is one

271
00:14:10,380 --> 00:14:16,079
example how to uh to write a test

272
00:14:13,680 --> 00:14:19,260
so the model here the name here is is

273
00:14:16,079 --> 00:14:20,279
the other table and it is what we have

274
00:14:19,260 --> 00:14:22,680
column

275
00:14:20,279 --> 00:14:24,779
namely order ID and it needs to be

276
00:14:22,680 --> 00:14:26,279
unique and not known for for that

277
00:14:24,779 --> 00:14:29,579
primary key

278
00:14:26,279 --> 00:14:32,040
and another column may cause status and

279
00:14:29,579 --> 00:14:36,660
just stated here for the orders it

280
00:14:32,040 --> 00:14:40,800
should be in one of four value here

281
00:14:36,660 --> 00:14:43,560
and the on the third or here a customer

282
00:14:40,800 --> 00:14:46,800
ID is good we have relation to the other

283
00:14:43,560 --> 00:14:50,339
table of customer and in the ID column

284
00:14:46,800 --> 00:14:52,980
so this is one example of the four

285
00:14:50,339 --> 00:14:53,639
generic test in DBT

286
00:14:52,980 --> 00:14:55,980
um

287
00:14:53,639 --> 00:14:58,500
but if we want to test more complex

288
00:14:55,980 --> 00:15:00,480
things we need other packet liability UT

289
00:14:58,500 --> 00:15:01,980
and DBT expectation

290
00:15:00,480 --> 00:15:05,160
so

291
00:15:01,980 --> 00:15:09,120
to do that we inhibiting we just need to

292
00:15:05,160 --> 00:15:11,100
include the packet into a bucket not the

293
00:15:09,120 --> 00:15:13,680
move file and then we install it using

294
00:15:11,100 --> 00:15:16,500
the DVD depth command and then we run

295
00:15:13,680 --> 00:15:18,500
the test using DVD test

296
00:15:16,500 --> 00:15:18,500
um

297
00:15:18,740 --> 00:15:25,260
and for the Liberty UC this this year

298
00:15:23,160 --> 00:15:26,480
Bucket from the Liberty team and they

299
00:15:25,260 --> 00:15:30,060
ship quick

300
00:15:26,480 --> 00:15:32,880
more different tests like

301
00:15:30,060 --> 00:15:36,720
um we can have a generic that like equal

302
00:15:32,880 --> 00:15:38,639
row car equality uh Outlet one of

303
00:15:36,720 --> 00:15:40,620
constant and we can also have the

304
00:15:38,639 --> 00:15:42,240
northern room proportion which is quite

305
00:15:40,620 --> 00:15:44,760
useful

306
00:15:42,240 --> 00:15:47,100
um and we can also have different uh

307
00:15:44,760 --> 00:15:47,880
relations and evenly compilation tests

308
00:15:47,100 --> 00:15:50,399
here

309
00:15:47,880 --> 00:15:52,560
as you can find more information or the

310
00:15:50,399 --> 00:15:55,079
link I put here and also either

311
00:15:52,560 --> 00:15:58,980
reference letter as well

312
00:15:55,079 --> 00:16:01,279
for DPT expectation this is the idea is

313
00:15:58,980 --> 00:16:05,880
we are similar to

314
00:16:01,279 --> 00:16:09,120
a great expectation bucket in in Python

315
00:16:05,880 --> 00:16:12,360
and they divide it into different group

316
00:16:09,120 --> 00:16:13,820
of tests like for table shape we can

317
00:16:12,360 --> 00:16:18,120
test this

318
00:16:13,820 --> 00:16:21,360
if the column exists in in the table we

319
00:16:18,120 --> 00:16:22,220
can test the column cow Roka

320
00:16:21,360 --> 00:16:26,279
Etc

321
00:16:22,220 --> 00:16:29,160
and for the missing value unique value

322
00:16:26,279 --> 00:16:31,019
we can also that they have a various

323
00:16:29,160 --> 00:16:34,620
stats here as well

324
00:16:31,019 --> 00:16:37,259
um for if it is a numeric column we can

325
00:16:34,620 --> 00:16:40,980
test the value of the column is between

326
00:16:37,259 --> 00:16:45,720
something as well

327
00:16:40,980 --> 00:16:48,060
and for the string or the text column we

328
00:16:45,720 --> 00:16:51,779
can test if it's a matching some kind of

329
00:16:48,060 --> 00:16:55,019
pattern and they also have some

330
00:16:51,779 --> 00:16:57,779
basic that thing on distribution like if

331
00:16:55,019 --> 00:17:00,300
the column value within ends standard

332
00:16:57,779 --> 00:17:03,240
deviation which can be used to test the

333
00:17:00,300 --> 00:17:04,919
outliers in the in the numeric column as

334
00:17:03,240 --> 00:17:08,339
well

335
00:17:04,919 --> 00:17:11,760
so there is also a lot more information

336
00:17:08,339 --> 00:17:16,559
in the the documentation and I recommend

337
00:17:11,760 --> 00:17:20,100
you to look at it at a later time

338
00:17:16,559 --> 00:17:25,260
so what we got out to be we're going to

339
00:17:20,100 --> 00:17:28,380
view this by like where we track

340
00:17:25,260 --> 00:17:30,179
the data from battery so the the data

341
00:17:28,380 --> 00:17:32,880
here we have three tables the payment

342
00:17:30,179 --> 00:17:35,820
the order and the customer and we use

343
00:17:32,880 --> 00:17:40,440
Python to a chart that's one and input

344
00:17:35,820 --> 00:17:44,299
into a raw table into that DB and from

345
00:17:40,440 --> 00:17:48,059
that we will DVT to transform that three

346
00:17:44,299 --> 00:17:50,820
table into a stapling layer and from

347
00:17:48,059 --> 00:17:53,700
that sticks layer we transform it into

348
00:17:50,820 --> 00:17:55,679
the the mod layer which is the order

349
00:17:53,700 --> 00:17:56,940
stable and the customer stable at the

350
00:17:55,679 --> 00:18:00,539
end

351
00:17:56,940 --> 00:18:05,160
actually how it look in the data asset

352
00:18:00,539 --> 00:18:07,620
in in Industry thus we can view both SQL

353
00:18:05,160 --> 00:18:09,240
Pipeline and and python by light

354
00:18:07,620 --> 00:18:12,120
together

355
00:18:09,240 --> 00:18:15,299
and this is a closer view I will show

356
00:18:12,120 --> 00:18:16,980
you later in the demo as well and you

357
00:18:15,299 --> 00:18:21,900
can see that this one this part is from

358
00:18:16,980 --> 00:18:26,700
DVD so it have a icon activity here

359
00:18:21,900 --> 00:18:29,340
all right so now is time for terrible

360
00:18:26,700 --> 00:18:32,179
so I will switch

361
00:18:29,340 --> 00:18:32,179
to the

362
00:18:34,320 --> 00:18:38,960
to the code

363
00:18:36,780 --> 00:18:42,000
so

364
00:18:38,960 --> 00:18:44,100
this code is available on GitHub and

365
00:18:42,000 --> 00:18:47,700
found that you can file the link to the

366
00:18:44,100 --> 00:18:50,580
slide as well and just a remote reminder

367
00:18:47,700 --> 00:18:54,240
just either Pi like we're gonna build

368
00:18:50,580 --> 00:18:57,539
using injection python when the shop

369
00:18:54,240 --> 00:19:00,120
into lucky B and transform it and then

370
00:18:57,539 --> 00:19:03,960
you deducted for quotation

371
00:19:00,120 --> 00:19:04,740
so on the left here if you see that

372
00:19:03,960 --> 00:19:07,500
um

373
00:19:04,740 --> 00:19:10,980
we have to main folder one is for DBT

374
00:19:07,500 --> 00:19:14,539
and one for Dexter and first let's go to

375
00:19:10,980 --> 00:19:14,539
DVT so

376
00:19:22,440 --> 00:19:29,039
so the data we have three table raw by

377
00:19:26,100 --> 00:19:32,700
Monroe other customer after we download

378
00:19:29,039 --> 00:19:36,539
it it will be inside this one just have

379
00:19:32,700 --> 00:19:38,700
a quick look on this uh three table

380
00:19:36,539 --> 00:19:41,340
um this one is that customer which ID

381
00:19:38,700 --> 00:19:45,120
first name that name we have the order

382
00:19:41,340 --> 00:19:48,120
for that the customer

383
00:19:45,120 --> 00:19:50,640
which if the ID and then order ID or the

384
00:19:48,120 --> 00:19:53,539
date and then we have the payment for

385
00:19:50,640 --> 00:20:00,000
for this order

386
00:19:53,539 --> 00:20:00,780
and inside the DPT project

387
00:20:00,000 --> 00:20:04,020
um

388
00:20:00,780 --> 00:20:06,720
we have the Define the profile which is

389
00:20:04,020 --> 00:20:09,260
defined in the the profile here as you

390
00:20:06,720 --> 00:20:12,780
can see it will be here and then it will

391
00:20:09,260 --> 00:20:13,919
link to the uh the dot DB

392
00:20:12,780 --> 00:20:17,160
um

393
00:20:13,919 --> 00:20:19,700
database which is actually one file this

394
00:20:17,160 --> 00:20:23,460
lead file which is this file JavaScript

395
00:20:19,700 --> 00:20:25,919
db.db and later we can also have

396
00:20:23,460 --> 00:20:29,100
different targets for production and we

397
00:20:25,919 --> 00:20:32,640
can deploy it with the other Big Data

398
00:20:29,100 --> 00:20:36,240
Warehouse like snowflake or redshift

399
00:20:32,640 --> 00:20:39,720
and at mentioned before for the testing

400
00:20:36,240 --> 00:20:42,120
our data quality we is so a couple of

401
00:20:39,720 --> 00:20:45,960
packaged liability UT and DVT

402
00:20:42,120 --> 00:20:50,419
expectation in the package.gmo

403
00:20:45,960 --> 00:20:50,419
and let's go into the model

404
00:20:51,179 --> 00:20:58,919
so we have the uh this texting where we

405
00:20:54,960 --> 00:21:02,820
have that three sticks in uh table here

406
00:20:58,919 --> 00:21:04,400
and or model and in this schema we can

407
00:21:02,820 --> 00:21:08,400
Define

408
00:21:04,400 --> 00:21:10,679
the testing for our quality test live

409
00:21:08,400 --> 00:21:13,620
for example customer ID is need to be

410
00:21:10,679 --> 00:21:16,140
unique and not known the code of ID in

411
00:21:13,620 --> 00:21:18,200
Injustice in customer tables all right

412
00:21:16,140 --> 00:21:21,000
and

413
00:21:18,200 --> 00:21:23,120
like the stated for that

414
00:21:21,000 --> 00:21:27,179
sticks in order table which should be in

415
00:21:23,120 --> 00:21:29,340
in one of these five Value Place chip

416
00:21:27,179 --> 00:21:31,380
completed

417
00:21:29,340 --> 00:21:34,860
Etc here

418
00:21:31,380 --> 00:21:38,580
similar in the market we can also

419
00:21:34,860 --> 00:21:41,580
control our data quality for example

420
00:21:38,580 --> 00:21:46,740
here is another test for the first name

421
00:21:41,580 --> 00:21:52,020
of the customer table it it needs to be

422
00:21:46,740 --> 00:21:55,500
not known so at least 98 so that is a

423
00:21:52,020 --> 00:21:58,679
percent of that first name column of the

424
00:21:55,500 --> 00:21:59,580
customer table is is it should be not

425
00:21:58,679 --> 00:22:02,159
known

426
00:21:59,580 --> 00:22:06,960
and

427
00:22:02,159 --> 00:22:09,059
five others uh that's here like the

428
00:22:06,960 --> 00:22:11,460
here I want to check if the that

429
00:22:09,059 --> 00:22:14,700
customer table

430
00:22:11,460 --> 00:22:17,460
here with a similar shape at the states

431
00:22:14,700 --> 00:22:19,500
in customers so this one is have this

432
00:22:17,460 --> 00:22:22,080
sorry a similar row count

433
00:22:19,500 --> 00:22:23,580
I also want to check if the table have a

434
00:22:22,080 --> 00:22:27,360
seven column

435
00:22:23,580 --> 00:22:31,500
so now if we want to test it we just

436
00:22:27,360 --> 00:22:35,159
individly we just run DVD test

437
00:22:31,500 --> 00:22:37,320
and it will uh

438
00:22:35,159 --> 00:22:40,799
run

439
00:22:37,320 --> 00:22:43,740
all the tests in our just edify in our

440
00:22:40,799 --> 00:22:47,340
schema here

441
00:22:43,740 --> 00:22:50,340
and you can see that is we have 18 tests

442
00:22:47,340 --> 00:22:52,320
here and it pass so

443
00:22:50,340 --> 00:22:54,299
for example if I change this one from 7

444
00:22:52,320 --> 00:22:58,140
to 10 column

445
00:22:54,299 --> 00:23:00,780
and if I run DBT test

446
00:22:58,140 --> 00:23:03,780
and I select

447
00:23:00,780 --> 00:23:06,860
the customer table

448
00:23:03,780 --> 00:23:06,860
the customer table

449
00:23:10,740 --> 00:23:17,880
it will test only a random test only for

450
00:23:13,620 --> 00:23:22,799
just a customer table and you can see we

451
00:23:17,880 --> 00:23:25,860
have six tests and we have one Arrow

452
00:23:22,799 --> 00:23:29,419
here which is the

453
00:23:25,860 --> 00:23:34,640
um the cow equals 10 here which actually

454
00:23:29,419 --> 00:23:34,640
seven right so if I change it back to 7

455
00:23:36,080 --> 00:23:40,220
3 test customer again

456
00:23:41,700 --> 00:23:48,440
it should uh

457
00:23:44,400 --> 00:23:48,440
success for the sick test

458
00:23:48,539 --> 00:23:52,760
yep

459
00:23:50,100 --> 00:23:52,760
and

460
00:23:55,140 --> 00:24:02,179
another thing here is we think in the

461
00:23:59,039 --> 00:24:05,460
good thing in DVD weekend see the

462
00:24:02,179 --> 00:24:09,059
generate our uh

463
00:24:05,460 --> 00:24:12,059
our documentation and graphs and

464
00:24:09,059 --> 00:24:14,900
after you lose generate the

465
00:24:12,059 --> 00:24:19,820
documentation we can serve the the dock

466
00:24:14,900 --> 00:24:22,320
by the way all these command AI

467
00:24:19,820 --> 00:24:24,380
already put it in the

468
00:24:22,320 --> 00:24:24,380
um

469
00:24:24,419 --> 00:24:30,480
in the readme here as well

470
00:24:27,539 --> 00:24:33,780
so if you see that

471
00:24:30,480 --> 00:24:36,799
you see on the browser let's

472
00:24:33,780 --> 00:24:36,799
go I think

473
00:24:37,440 --> 00:24:41,820
you're done can you hear me

474
00:24:39,840 --> 00:24:45,799
if you click click on jet link it will

475
00:24:41,820 --> 00:24:45,799
show us the documentation

476
00:24:46,320 --> 00:24:53,520
and it will so it's the um

477
00:24:50,940 --> 00:24:56,120
that the table with information here as

478
00:24:53,520 --> 00:24:56,120
well like

479
00:25:02,659 --> 00:25:08,039
another thing is it can also help us to

480
00:25:06,059 --> 00:25:12,200
uh

481
00:25:08,039 --> 00:25:12,200
items if we click on this one

482
00:25:12,539 --> 00:25:17,760
and click on the graph we can solve the

483
00:25:15,720 --> 00:25:20,760
the byline here

484
00:25:17,760 --> 00:25:23,760
but as you can see this the pilot here

485
00:25:20,760 --> 00:25:26,820
only is a sequel in DBT and we cannot

486
00:25:23,760 --> 00:25:33,059
see the Upstream pipeline using python

487
00:25:26,820 --> 00:25:36,120
right so that's what we add Dexter so

488
00:25:33,059 --> 00:25:41,000
let's bite to the second folder here is

489
00:25:36,120 --> 00:25:41,000
Daxter DVT zapper and in here

490
00:25:41,400 --> 00:25:46,940
um let me go to

491
00:25:43,559 --> 00:25:46,940
CD Dexter

492
00:25:47,580 --> 00:25:50,159
and in here

493
00:25:49,080 --> 00:25:54,240
um

494
00:25:50,159 --> 00:25:57,140
we Define the software assets and

495
00:25:54,240 --> 00:25:57,140
the main

496
00:25:57,840 --> 00:26:03,200
code in here but before that let me run

497
00:26:00,539 --> 00:26:07,740
the the dust darkness which will

498
00:26:03,200 --> 00:26:10,520
run the the data server

499
00:26:07,740 --> 00:26:10,520
and

500
00:26:10,620 --> 00:26:16,500
inside the doctor here we we can Define

501
00:26:13,500 --> 00:26:19,200
the asset and we have two main Thai

502
00:26:16,500 --> 00:26:21,059
right the first one is the uh the DVT

503
00:26:19,200 --> 00:26:24,240
which we can load is

504
00:26:21,059 --> 00:26:27,720
by differences it's load asset from DVD

505
00:26:24,240 --> 00:26:30,539
project and we can specify the the part

506
00:26:27,720 --> 00:26:31,679
to the to the father of Liberty which is

507
00:26:30,539 --> 00:26:35,100
this one

508
00:26:31,679 --> 00:26:37,080
uh sorry what is this one the DVT is up

509
00:26:35,100 --> 00:26:40,679
for software and then after that that's

510
00:26:37,080 --> 00:26:43,320
the we will get the metadata from DPT

511
00:26:40,679 --> 00:26:45,120
examples of project and and it put it at

512
00:26:43,320 --> 00:26:47,940
an asset

513
00:26:45,120 --> 00:26:51,960
um another part here we have the the

514
00:26:47,940 --> 00:26:55,260
python which is uh this will load python

515
00:26:51,960 --> 00:26:58,559
from F3 and import into docdb which it

516
00:26:55,260 --> 00:27:01,080
defines In The Raw data here

517
00:26:58,559 --> 00:27:02,580
and in this one you can see that we have

518
00:27:01,080 --> 00:27:04,980
just

519
00:27:02,580 --> 00:27:07,020
a simple code that I tracked the raw

520
00:27:04,980 --> 00:27:10,740
data for example if if they roll it

521
00:27:07,020 --> 00:27:13,500
order here we read the data from F3 and

522
00:27:10,740 --> 00:27:16,380
then we save it in here and then after

523
00:27:13,500 --> 00:27:17,760
that we will create

524
00:27:16,380 --> 00:27:21,900
um

525
00:27:17,760 --> 00:27:27,000
this router folder and insert it into

526
00:27:21,900 --> 00:27:29,220
the uh the documents file

527
00:27:27,000 --> 00:27:32,880
and

528
00:27:29,220 --> 00:27:36,600
inductor we just need to add a decorator

529
00:27:32,880 --> 00:27:39,059
called assets and it will load that and

530
00:27:36,600 --> 00:27:43,020
consider all the pilot asset

531
00:27:39,059 --> 00:27:46,980
so after we run the the dark is we can

532
00:27:43,020 --> 00:27:49,580
view it in the website here

533
00:27:46,980 --> 00:27:49,580
I think

534
00:27:51,539 --> 00:27:56,279
and if we click on

535
00:27:54,480 --> 00:27:59,220
Global asset

536
00:27:56,279 --> 00:28:00,740
you can see that

537
00:27:59,220 --> 00:28:04,980
we have

538
00:28:00,740 --> 00:28:10,380
all our data pilot here with both Python

539
00:28:04,980 --> 00:28:14,460
and SQL pilot in DVT and we can't

540
00:28:10,380 --> 00:28:17,659
also materialize all this table which is

541
00:28:14,460 --> 00:28:20,940
actually will run all the byline in in

542
00:28:17,659 --> 00:28:24,900
this uh

543
00:28:20,940 --> 00:28:27,179
in this one in this graph and then

544
00:28:24,900 --> 00:28:29,039
we can see the run as well

545
00:28:27,179 --> 00:28:31,080
also here if we click on different

546
00:28:29,039 --> 00:28:34,760
tables they are start to materializing

547
00:28:31,080 --> 00:28:37,380
and we can see all the detail and

548
00:28:34,760 --> 00:28:39,840
metadata and other information here as

549
00:28:37,380 --> 00:28:43,080
well and they also have the the rounding

550
00:28:39,840 --> 00:28:44,760
which for different steps that we

551
00:28:43,080 --> 00:28:47,659
um we have

552
00:28:44,760 --> 00:28:47,659
and

553
00:28:50,039 --> 00:28:57,179
last

554
00:28:52,740 --> 00:29:01,200
is and why we're waiting for this one to

555
00:28:57,179 --> 00:29:05,460
uh finish I think it will need a couple

556
00:29:01,200 --> 00:29:09,020
of minutes for it to finish

557
00:29:05,460 --> 00:29:09,020
let's back to the slide

558
00:29:09,360 --> 00:29:18,240
so I just saw you just one example

559
00:29:13,679 --> 00:29:23,520
a way to implement our data by line

560
00:29:18,240 --> 00:29:28,799
uh so with both Python and SQL Pipeline

561
00:29:23,520 --> 00:29:31,140
and add different tests using um DBT ift

562
00:29:28,799 --> 00:29:32,240
and DVD expectation to control the data

563
00:29:31,140 --> 00:29:35,340
quality

564
00:29:32,240 --> 00:29:37,640
but at the end building trust is your

565
00:29:35,340 --> 00:29:41,100
journey and

566
00:29:37,640 --> 00:29:43,980
we will need to have a good roadmap and

567
00:29:41,100 --> 00:29:46,559
especially the most important in my

568
00:29:43,980 --> 00:29:49,860
opinion is very strong leadership with

569
00:29:46,559 --> 00:29:52,080
the right data strategy and that as

570
00:29:49,860 --> 00:29:54,480
previous speaker also mentioned about

571
00:29:52,080 --> 00:29:58,320
the data contract and we need to buy in

572
00:29:54,480 --> 00:29:59,880
of different data stakeholder and data

573
00:29:58,320 --> 00:30:03,000
consumer

574
00:29:59,880 --> 00:30:05,700
and in the technical perspective we also

575
00:30:03,000 --> 00:30:10,500
need to have well-designed data schema

576
00:30:05,700 --> 00:30:13,500
and we need to control the data quality

577
00:30:10,500 --> 00:30:15,600
from Upstream on the way in the data

578
00:30:13,500 --> 00:30:19,380
warehouse or data Lake and on Downstream

579
00:30:15,600 --> 00:30:23,159
as well and a couple of differences that

580
00:30:19,380 --> 00:30:28,100
I use for the talk and thank you very

581
00:30:23,159 --> 00:30:32,940
much for uh for that and

582
00:30:28,100 --> 00:30:36,260
by the way our team are hiring for data

583
00:30:32,940 --> 00:30:36,260
and engineer position

584
00:30:36,419 --> 00:30:42,299
thank you and let's

585
00:30:38,640 --> 00:30:44,159
come back to thank you so much Dan I'm

586
00:30:42,299 --> 00:30:46,860
more here to see if

587
00:30:44,159 --> 00:30:48,539
the doctor already finished so if we

588
00:30:46,860 --> 00:30:51,480
finish it will be buttered right here

589
00:30:48,539 --> 00:30:54,840
yeah like

590
00:30:51,480 --> 00:30:56,520
that's great thank you so much

591
00:30:54,840 --> 00:30:59,120
these are all tools I've been really

592
00:30:56,520 --> 00:30:59,120
intrigued

593
00:31:00,240 --> 00:31:04,080
DBT especially has been one that I

594
00:31:02,279 --> 00:31:06,120
haven't managed to look into as much as

595
00:31:04,080 --> 00:31:07,740
I'd like to so thank you so much for

596
00:31:06,120 --> 00:31:09,720
that that was great

597
00:31:07,740 --> 00:31:11,520
um obviously we're out of time now so we

598
00:31:09,720 --> 00:31:14,880
don't have time for questions but the

599
00:31:11,520 --> 00:31:16,860
next talk will be at

600
00:31:14,880 --> 00:31:19,860
um in 10 minutes at 10 past four thanks

601
00:31:16,860 --> 00:31:23,720
a lot Dan and we'll hopefully see you in

602
00:31:19,860 --> 00:31:23,720
person sometime soon see ya