1
00:00:00,480 --> 00:00:03,480
foreign

2
00:00:09,500 --> 00:00:14,820
next session title is developing culture

3
00:00:12,840 --> 00:00:16,920
to write to write reliable and

4
00:00:14,820 --> 00:00:19,560
performance services at scale our

5
00:00:16,920 --> 00:00:22,560
speaker is harshit

6
00:00:19,560 --> 00:00:24,180
um go ahead take it away

7
00:00:22,560 --> 00:00:27,840
okay

8
00:00:24,180 --> 00:00:31,140
um hi everyone so I hope everyone is

9
00:00:27,840 --> 00:00:33,600
doing well today I will be speaking on

10
00:00:31,140 --> 00:00:35,940
developing culture to write reliable and

11
00:00:33,600 --> 00:00:38,520
performance services at scale where uh I

12
00:00:35,940 --> 00:00:40,559
will be discussing about uh the

13
00:00:38,520 --> 00:00:41,879
observability like how it plays a

14
00:00:40,559 --> 00:00:44,160
crucial role in your software

15
00:00:41,879 --> 00:00:47,460
development cycle and how you can put

16
00:00:44,160 --> 00:00:50,160
that observability using uh like python

17
00:00:47,460 --> 00:00:54,180
where your services are built or using

18
00:00:50,160 --> 00:00:56,579
python uh based Python and can establish

19
00:00:54,180 --> 00:00:58,739
and also I will be highlighting uh the

20
00:00:56,579 --> 00:01:02,160
kind of culture which you can establish

21
00:00:58,739 --> 00:01:04,799
within your team or the organization

22
00:01:02,160 --> 00:01:08,640
so uh before starting with the talk a

23
00:01:04,799 --> 00:01:10,080
bit uh introduction about myself so

24
00:01:08,640 --> 00:01:14,060
currently I am a software engineer

25
00:01:10,080 --> 00:01:16,700
working at the blinket in blinket is the

26
00:01:14,060 --> 00:01:19,500
India's e-commerce which is uh

27
00:01:16,700 --> 00:01:22,799
delivering products in minutes I am from

28
00:01:19,500 --> 00:01:25,320
India and I am a tech speaker and have

29
00:01:22,799 --> 00:01:26,820
been part of multiple conferences in the

30
00:01:25,320 --> 00:01:30,000
past as a speaker

31
00:01:26,820 --> 00:01:33,060
I'm open source contributor and also

32
00:01:30,000 --> 00:01:34,619
like in my free time I try to explore uh

33
00:01:33,060 --> 00:01:35,880
mostly the cloud native based open

34
00:01:34,619 --> 00:01:38,400
source organizations and try to

35
00:01:35,880 --> 00:01:40,860
contribute in that apart from that I've

36
00:01:38,400 --> 00:01:42,360
been like past Google summer of code uh

37
00:01:40,860 --> 00:01:43,860
student also during my undergraduate

38
00:01:42,360 --> 00:01:47,820
studies

39
00:01:43,860 --> 00:01:51,000
yeah so let's begin with uh first like

40
00:01:47,820 --> 00:01:53,399
why do we write software

41
00:01:51,000 --> 00:01:57,060
um we write software to basically solve

42
00:01:53,399 --> 00:01:59,399
problems right uh by solving problems we

43
00:01:57,060 --> 00:02:00,540
are kind of making life easier for the

44
00:01:59,399 --> 00:02:04,380
people

45
00:02:00,540 --> 00:02:06,119
so the software is helping making life

46
00:02:04,380 --> 00:02:08,959
easier for the people there are like two

47
00:02:06,119 --> 00:02:11,220
kinds of software like if we can

48
00:02:08,959 --> 00:02:13,860
highlight it one is like the good

49
00:02:11,220 --> 00:02:16,319
software and the another is bad software

50
00:02:13,860 --> 00:02:19,560
uh good software is kind of implies like

51
00:02:16,319 --> 00:02:21,420
uh you have a good amount of users

52
00:02:19,560 --> 00:02:24,360
traffic coming and you have a good

53
00:02:21,420 --> 00:02:26,340
engagement on your platform uh which can

54
00:02:24,360 --> 00:02:27,900
directly impact your business profits in

55
00:02:26,340 --> 00:02:31,500
terms of Revenue

56
00:02:27,900 --> 00:02:33,959
bad software implies like your code is

57
00:02:31,500 --> 00:02:36,660
actually not working as expected and

58
00:02:33,959 --> 00:02:39,540
there are several like certain down some

59
00:02:36,660 --> 00:02:41,580
down times and which is causing business

60
00:02:39,540 --> 00:02:44,879
loss in terms of Revenue

61
00:02:41,580 --> 00:02:47,540
considering if your software is is a

62
00:02:44,879 --> 00:02:51,720
good software uh you can expect such

63
00:02:47,540 --> 00:02:54,599
systems to have uh like uh like these

64
00:02:51,720 --> 00:02:57,780
softwares can expect a kind of a monthly

65
00:02:54,599 --> 00:03:00,060
active users linearly increasing like uh

66
00:02:57,780 --> 00:03:01,440
see users are using your software and

67
00:03:00,060 --> 00:03:03,480
they are enjoying it they will recommend

68
00:03:01,440 --> 00:03:05,519
to others then this traffic will keep on

69
00:03:03,480 --> 00:03:07,319
increasing this kind of high throughput

70
00:03:05,519 --> 00:03:09,120
must be handled smoothly to make sure

71
00:03:07,319 --> 00:03:11,220
like people are enjoying and also you

72
00:03:09,120 --> 00:03:13,680
are making profits in your business but

73
00:03:11,220 --> 00:03:17,280
turns out like sometimes things don't go

74
00:03:13,680 --> 00:03:19,620
well as expected because uh here system

75
00:03:17,280 --> 00:03:21,659
reliability is important where High

76
00:03:19,620 --> 00:03:24,659
number of user expectations should be

77
00:03:21,659 --> 00:03:28,500
met without any uh frustrations which

78
00:03:24,659 --> 00:03:32,400
can cause loss to the business uh like

79
00:03:28,500 --> 00:03:35,400
in case of bad software uh where like

80
00:03:32,400 --> 00:03:40,200
production issues happen and there are

81
00:03:35,400 --> 00:03:42,060
down times are kind of uh to like severe

82
00:03:40,200 --> 00:03:44,459
and Engineers try to understand and the

83
00:03:42,060 --> 00:03:46,379
problem to resolve it like they try to

84
00:03:44,459 --> 00:03:49,500
understand the problem with the within

85
00:03:46,379 --> 00:03:52,159
their internal systems and if like they

86
00:03:49,500 --> 00:03:55,200
don't follow a best practice it's like

87
00:03:52,159 --> 00:03:58,620
basically it's too late for them to

88
00:03:55,200 --> 00:04:01,319
understand like what's the uh root cause

89
00:03:58,620 --> 00:04:03,420
of it and it's become hard for them to

90
00:04:01,319 --> 00:04:04,440
figure out what's going on and the

91
00:04:03,420 --> 00:04:07,680
problem is already happening in

92
00:04:04,440 --> 00:04:10,739
production systems so uh to bridge this

93
00:04:07,680 --> 00:04:13,980
kind of Gap by making complex systems

94
00:04:10,739 --> 00:04:16,919
more transparent uh we can set up

95
00:04:13,980 --> 00:04:19,799
monitoring of our systems based uh on

96
00:04:16,919 --> 00:04:21,720
the observability driven development

97
00:04:19,799 --> 00:04:25,620
so uh

98
00:04:21,720 --> 00:04:27,419
as we can see uh like monitoring is not

99
00:04:25,620 --> 00:04:28,979
the same thing as observability like

100
00:04:27,419 --> 00:04:31,860
they are very much similar terms but

101
00:04:28,979 --> 00:04:34,560
they are kind of different like uh

102
00:04:31,860 --> 00:04:37,680
if I highlight like if as per the Google

103
00:04:34,560 --> 00:04:39,660
SRE book uh monitoring systems must

104
00:04:37,680 --> 00:04:41,940
answer only two simple questions like

105
00:04:39,660 --> 00:04:44,460
what's broken and why

106
00:04:41,940 --> 00:04:46,139
uh so monitoring is kind of a crucial

107
00:04:44,460 --> 00:04:48,960
thing like it involves building

108
00:04:46,139 --> 00:04:50,580
dashboards setting alerts it lets you

109
00:04:48,960 --> 00:04:53,100
know about how's your microservices

110
00:04:50,580 --> 00:04:55,979
performing and in the long term it helps

111
00:04:53,100 --> 00:04:59,400
you understand uh the traffic growth the

112
00:04:55,979 --> 00:05:01,919
trends and how service is uh utilizing

113
00:04:59,400 --> 00:05:02,759
the uh machine resources on which it is

114
00:05:01,919 --> 00:05:06,240
running

115
00:05:02,759 --> 00:05:08,220
now the comes the part where a ecosystem

116
00:05:06,240 --> 00:05:10,800
where there are multiple systems are

117
00:05:08,220 --> 00:05:13,080
running so this kind of uh system is

118
00:05:10,800 --> 00:05:15,540
like uh the distributed systems so

119
00:05:13,080 --> 00:05:17,460
distributed systems uh here the

120
00:05:15,540 --> 00:05:19,500
monitoring becomes a big Challenge and

121
00:05:17,460 --> 00:05:22,680
requires a deep dive into the internal

122
00:05:19,500 --> 00:05:25,860
States of each system uh as per their

123
00:05:22,680 --> 00:05:29,639
external outputs uh this is where like

124
00:05:25,860 --> 00:05:32,460
uh kind of observability kicks in and it

125
00:05:29,639 --> 00:05:35,280
provides an additional a2m monitoring uh

126
00:05:32,460 --> 00:05:39,900
consider like observability is kind of

127
00:05:35,280 --> 00:05:41,280
uh uh like a code Fitness tracker uh

128
00:05:39,900 --> 00:05:43,680
where like you are counting every

129
00:05:41,280 --> 00:05:45,960
heartbeat and the healthy usage activity

130
00:05:43,680 --> 00:05:49,680
making sure that your software stays in

131
00:05:45,960 --> 00:05:51,300
top shape like a marathon runner

132
00:05:49,680 --> 00:05:54,240
um if there is like no observability

133
00:05:51,300 --> 00:05:57,180
there is no monitoring

134
00:05:54,240 --> 00:06:00,840
so before uh deep diving into

135
00:05:57,180 --> 00:06:04,259
observability uh like it's important uh

136
00:06:00,840 --> 00:06:06,120
to know these terms uh what are the

137
00:06:04,259 --> 00:06:09,139
pillars of the observability because

138
00:06:06,120 --> 00:06:12,000
these will actually help you to avoid

139
00:06:09,139 --> 00:06:14,340
anti-patterns in your software and like

140
00:06:12,000 --> 00:06:16,800
they will be able to help you uh what

141
00:06:14,340 --> 00:06:18,660
kind of uh strategies were missed during

142
00:06:16,800 --> 00:06:21,960
a development phase because if you don't

143
00:06:18,660 --> 00:06:25,080
follow these uh like kinds of uh

144
00:06:21,960 --> 00:06:27,479
practice it reflects like uh the

145
00:06:25,080 --> 00:06:28,680
inability to meet the promised slas or

146
00:06:27,479 --> 00:06:31,319
difficulty in tracking the business

147
00:06:28,680 --> 00:06:34,020
Matrix and considering the poor

148
00:06:31,319 --> 00:06:37,020
performance as a trade-off so

149
00:06:34,020 --> 00:06:38,580
uh these are kind of very much important

150
00:06:37,020 --> 00:06:40,680
pillars for the observability so the

151
00:06:38,580 --> 00:06:42,960
three pillars are like the logging

152
00:06:40,680 --> 00:06:45,419
metrics and the address events I will be

153
00:06:42,960 --> 00:06:48,120
discussing these one by one first uh

154
00:06:45,419 --> 00:06:50,220
before going into the culture uh

155
00:06:48,120 --> 00:06:52,440
building thing and uh like I will be

156
00:06:50,220 --> 00:06:54,900
highlighting this with using small

157
00:06:52,440 --> 00:06:57,120
Snippets using python code

158
00:06:54,900 --> 00:06:59,100
so looking at first the logging let's

159
00:06:57,120 --> 00:07:01,560
discuss about the logging so logging

160
00:06:59,100 --> 00:07:05,400
helps in understanding the behavior of

161
00:07:01,560 --> 00:07:07,139
this service during the runtime uh like

162
00:07:05,400 --> 00:07:08,759
these are the recorded pieces of the

163
00:07:07,139 --> 00:07:11,580
information flowing through the service

164
00:07:08,759 --> 00:07:13,440
and they are kind of uh kind of

165
00:07:11,580 --> 00:07:17,240
typically saved in Json format where the

166
00:07:13,440 --> 00:07:20,340
developers can use some kind of patterns

167
00:07:17,240 --> 00:07:22,080
to match and see how their service is

168
00:07:20,340 --> 00:07:24,000
performing depending on the use case

169
00:07:22,080 --> 00:07:26,340
like how they are using their logs they

170
00:07:24,000 --> 00:07:28,740
are like kind of a full logs level logs

171
00:07:26,340 --> 00:07:31,319
as you can see uh like there is debug

172
00:07:28,740 --> 00:07:33,240
info bonding and uh so debugging stands

173
00:07:31,319 --> 00:07:36,419
for like if you are doing some kind of

174
00:07:33,240 --> 00:07:38,520
debugging things and uh mostly these

175
00:07:36,419 --> 00:07:40,860
types types of logs are used uh at the

176
00:07:38,520 --> 00:07:42,660
in the like in local development then

177
00:07:40,860 --> 00:07:44,580
the second ones one is the information

178
00:07:42,660 --> 00:07:47,160
which is kind of used for the general

179
00:07:44,580 --> 00:07:49,740
purpose logging uh then comes the

180
00:07:47,160 --> 00:07:53,280
bonding logs which actually is used to

181
00:07:49,740 --> 00:07:55,979
tell like uh hey this is like not

182
00:07:53,280 --> 00:07:59,639
critical that much but can be

183
00:07:55,979 --> 00:08:02,880
Troublesome like in near future and the

184
00:07:59,639 --> 00:08:05,160
fourth one is the uh error log which is

185
00:08:02,880 --> 00:08:07,740
like mostly used to signify the errors

186
00:08:05,160 --> 00:08:10,080
in the application uh the best practice

187
00:08:07,740 --> 00:08:12,120
to organize your logs uh for a python

188
00:08:10,080 --> 00:08:15,780
based application is to have like first

189
00:08:12,120 --> 00:08:18,419
the module name to identify quickly from

190
00:08:15,780 --> 00:08:20,819
uh which module uh the error got

191
00:08:18,419 --> 00:08:22,319
reported uh then comes the timestamp

192
00:08:20,819 --> 00:08:25,800
which will tell you about at what time

193
00:08:22,319 --> 00:08:28,139
stand the uh log was reported and then

194
00:08:25,800 --> 00:08:30,120
comes the process ID process ID is kind

195
00:08:28,139 --> 00:08:32,459
of optional it depends whether you want

196
00:08:30,120 --> 00:08:35,580
to add or not it is kind of helpful when

197
00:08:32,459 --> 00:08:38,339
there are multi your systems are running

198
00:08:35,580 --> 00:08:39,719
multiple processes and on the basis of

199
00:08:38,339 --> 00:08:42,839
that you can

200
00:08:39,719 --> 00:08:46,860
uh easily uh like you can tell which

201
00:08:42,839 --> 00:08:49,140
process ID log this thing uh Etc uh then

202
00:08:46,860 --> 00:08:51,240
comes the log level uh log level

203
00:08:49,140 --> 00:08:52,980
basically again it tells about if it is

204
00:08:51,240 --> 00:08:54,600
info bonding error and the at the last

205
00:08:52,980 --> 00:08:57,959
the message basically tells you about

206
00:08:54,600 --> 00:09:01,620
like uh detail about what's happening in

207
00:08:57,959 --> 00:09:05,040
your uh like in the system

208
00:09:01,620 --> 00:09:07,260
so here's a small example where I have

209
00:09:05,040 --> 00:09:09,480
set up a logger so you can see there's a

210
00:09:07,260 --> 00:09:14,700
configure logger I have set a level of

211
00:09:09,480 --> 00:09:17,580
info that means uh so uh there as I uh

212
00:09:14,700 --> 00:09:20,820
represent it in my previous slide uh so

213
00:09:17,580 --> 00:09:22,740
the order of logs is like debug info

214
00:09:20,820 --> 00:09:24,660
bonding and level if so here I have

215
00:09:22,740 --> 00:09:27,420
started my set level from the info level

216
00:09:24,660 --> 00:09:30,540
so it means that my during whenever I

217
00:09:27,420 --> 00:09:33,600
will be running my system like uh in

218
00:09:30,540 --> 00:09:36,480
production it won't be logging those

219
00:09:33,600 --> 00:09:38,459
logs which are at debug level those are

220
00:09:36,480 --> 00:09:40,740
mostly for the local development purpose

221
00:09:38,459 --> 00:09:43,019
and they won't be getting uh getting

222
00:09:40,740 --> 00:09:46,560
stored on the production system so it

223
00:09:43,019 --> 00:09:47,779
will be logging from info to uh warning

224
00:09:46,560 --> 00:09:50,940
and then error

225
00:09:47,779 --> 00:09:52,800
uh so you can see like at the right side

226
00:09:50,940 --> 00:09:54,480
on the top level you can see that's a

227
00:09:52,800 --> 00:09:57,420
standard format of the log which comes

228
00:09:54,480 --> 00:09:59,700
however uh the best practice is to

229
00:09:57,420 --> 00:10:02,160
follow the Json structure logs which can

230
00:09:59,700 --> 00:10:05,220
be helpful in plotting the monitoring

231
00:10:02,160 --> 00:10:08,279
panels in form of graphs to understand

232
00:10:05,220 --> 00:10:11,220
the trend of your law like the system of

233
00:10:08,279 --> 00:10:12,720
the API calls uh logs again as I

234
00:10:11,220 --> 00:10:15,540
mentioned earlier like they can

235
00:10:12,720 --> 00:10:18,300
developers can use these logs to match

236
00:10:15,540 --> 00:10:20,880
on some kind of pattern and they can uh

237
00:10:18,300 --> 00:10:24,240
aggregate the logs data on the basis of

238
00:10:20,880 --> 00:10:26,220
some time intervals uh that is I think I

239
00:10:24,240 --> 00:10:28,680
would say is kind of sample their data

240
00:10:26,220 --> 00:10:34,019
and then they can create their own

241
00:10:28,680 --> 00:10:36,000
monitoring panels uh a good example of a

242
00:10:34,019 --> 00:10:39,660
monitoring panel can be like let's say

243
00:10:36,000 --> 00:10:43,800
you have an API which kind of logs uh

244
00:10:39,660 --> 00:10:45,240
status codes uh 5x64 X6 or 2x6 you can

245
00:10:43,800 --> 00:10:47,399
plot this trend by using the status

246
00:10:45,240 --> 00:10:49,860
quote of the API which you are logging

247
00:10:47,399 --> 00:10:52,140
after request has been performed

248
00:10:49,860 --> 00:10:55,079
and you can like understand the behavior

249
00:10:52,140 --> 00:10:57,200
of you pay how many uh 4x6 you are

250
00:10:55,079 --> 00:10:57,200
getting

251
00:10:57,660 --> 00:11:04,079
so uh this is another concept like let's

252
00:11:01,560 --> 00:11:05,940
say uh you have a distributed

253
00:11:04,079 --> 00:11:08,100
environment when distributed systems

254
00:11:05,940 --> 00:11:09,720
there can be scenario like multiple

255
00:11:08,100 --> 00:11:12,120
instances are running you are your

256
00:11:09,720 --> 00:11:15,899
system is handling a lot of requests

257
00:11:12,120 --> 00:11:18,180
from the users at scale now imagine you

258
00:11:15,899 --> 00:11:20,399
get an issue raised that uh one of the

259
00:11:18,180 --> 00:11:22,500
users getting affected using your

260
00:11:20,399 --> 00:11:25,140
software you want to debug the root

261
00:11:22,500 --> 00:11:27,899
cause of it uh directly looking into the

262
00:11:25,140 --> 00:11:30,060
request uh will be like very difficult

263
00:11:27,899 --> 00:11:32,640
to understand because imagine like you

264
00:11:30,060 --> 00:11:35,100
are getting millions of requests and you

265
00:11:32,640 --> 00:11:38,339
are checking just for one user or the

266
00:11:35,100 --> 00:11:40,019
like one like quite a set of users you

267
00:11:38,339 --> 00:11:43,440
are checking on their effect being

268
00:11:40,019 --> 00:11:45,899
affected so uh kind of to make sure you

269
00:11:43,440 --> 00:11:47,760
are looking into the right request uh we

270
00:11:45,899 --> 00:11:49,260
use the concept of the trace ID Trace

271
00:11:47,760 --> 00:11:52,620
IDs are actually helpful to track

272
00:11:49,260 --> 00:11:55,200
specific requests from start till the

273
00:11:52,620 --> 00:11:57,060
end uh reflecting like how your system

274
00:11:55,200 --> 00:11:59,220
process that particular request which

275
00:11:57,060 --> 00:12:00,839
was received by system till the

276
00:11:59,220 --> 00:12:01,980
acknowledge which was sent to the client

277
00:12:00,839 --> 00:12:04,140
side

278
00:12:01,980 --> 00:12:06,660
at request level Trace ID is like always

279
00:12:04,140 --> 00:12:11,160
unique you can just simply look over the

280
00:12:06,660 --> 00:12:13,140
trace ID for the user and fetch logs uh

281
00:12:11,160 --> 00:12:14,760
like and it will be very simple to

282
00:12:13,140 --> 00:12:18,300
understand like what's affecting the

283
00:12:14,760 --> 00:12:19,740
user here at the uh like this is a small

284
00:12:18,300 --> 00:12:23,640
piece of code where I'm trying to

285
00:12:19,740 --> 00:12:25,320
simulate the uh two requests and with

286
00:12:23,640 --> 00:12:28,380
different Trace IDs at the right side

287
00:12:25,320 --> 00:12:31,140
you can see uh the request one is

288
00:12:28,380 --> 00:12:33,660
actually kind of a one session uh where

289
00:12:31,140 --> 00:12:35,660
uh the trace ID is unique for that

290
00:12:33,660 --> 00:12:39,420
particular session then the second

291
00:12:35,660 --> 00:12:41,579
request was like another kind of request

292
00:12:39,420 --> 00:12:44,940
and another session where the trace ID

293
00:12:41,579 --> 00:12:47,579
is unique for that uh case also

294
00:12:44,940 --> 00:12:48,899
so here uh by adding Trace ID it

295
00:12:47,579 --> 00:12:51,060
actually reduced the mean time to reduce

296
00:12:48,899 --> 00:12:52,800
to resolve a production issue issue that

297
00:12:51,060 --> 00:12:56,579
this is like kind of a another best

298
00:12:52,800 --> 00:13:00,959
practice which you can use to kind of

299
00:12:56,579 --> 00:13:02,220
reduce the uh time to debug production

300
00:13:00,959 --> 00:13:04,500
issues

301
00:13:02,220 --> 00:13:06,600
but there are certain limitations of the

302
00:13:04,500 --> 00:13:08,940
logging like uh extensive logging can

303
00:13:06,600 --> 00:13:11,339
generate a large volumes of data leading

304
00:13:08,940 --> 00:13:13,139
to storage challenges and these storage

305
00:13:11,339 --> 00:13:15,899
challenges can gradually increase the

306
00:13:13,139 --> 00:13:19,260
cost of running the infra which is not a

307
00:13:15,899 --> 00:13:21,240
good thing uh careless logging practice

308
00:13:19,260 --> 00:13:23,100
can lead to sensitive information leaks

309
00:13:21,240 --> 00:13:25,700
and which can raise some security

310
00:13:23,100 --> 00:13:29,100
concerns which is again not a good idea

311
00:13:25,700 --> 00:13:32,160
uh log noise is another thing like if

312
00:13:29,100 --> 00:13:34,220
you don't follow or uh within your team

313
00:13:32,160 --> 00:13:38,579
you don't uh

314
00:13:34,220 --> 00:13:40,200
kind of establish some standards uh like

315
00:13:38,579 --> 00:13:42,180
your log should be of this kind of

316
00:13:40,200 --> 00:13:44,639
format and they can be a random format

317
00:13:42,180 --> 00:13:46,800
they can be sometimes a bit noisy with

318
00:13:44,639 --> 00:13:48,420
access information which is sometimes

319
00:13:46,800 --> 00:13:50,040
not helpful when you are debugging a

320
00:13:48,420 --> 00:13:52,800
production issue

321
00:13:50,040 --> 00:13:54,720
also logging does not provide a

322
00:13:52,800 --> 00:13:57,240
quantitative measurement of the system

323
00:13:54,720 --> 00:14:00,899
Behavior like which Quantum measurements

324
00:13:57,240 --> 00:14:03,180
are like the CPU or the memory or system

325
00:14:00,899 --> 00:14:05,760
requires and these things can actually

326
00:14:03,180 --> 00:14:08,279
help in the resource planning for

327
00:14:05,760 --> 00:14:09,660
running your systems at optimal infra

328
00:14:08,279 --> 00:14:12,079
cost

329
00:14:09,660 --> 00:14:15,180
so uh

330
00:14:12,079 --> 00:14:17,760
this is where like Matrix comes to the

331
00:14:15,180 --> 00:14:20,040
rescue uh metrics are kind of the

332
00:14:17,760 --> 00:14:21,779
quantitative measurement of the systems

333
00:14:20,040 --> 00:14:24,540
to understand how system is performing

334
00:14:21,779 --> 00:14:27,139
it provides a numerical and statistical

335
00:14:24,540 --> 00:14:30,480
insights making it easier to track

336
00:14:27,139 --> 00:14:33,360
performance detect anomalies and measure

337
00:14:30,480 --> 00:14:35,700
Trends it also kind of helps in resource

338
00:14:33,360 --> 00:14:39,360
planning as I mentioned like in my

339
00:14:35,700 --> 00:14:42,720
previous slide uh where you can give it

340
00:14:39,360 --> 00:14:46,139
can give you a better picture of how uh

341
00:14:42,720 --> 00:14:48,839
your CPU is like

342
00:14:46,139 --> 00:14:52,680
system on which your

343
00:14:48,839 --> 00:14:55,199
is like instance

344
00:14:52,680 --> 00:14:58,079
how much is CPU it is consuming how much

345
00:14:55,199 --> 00:14:59,940
is the memory and a lots of things

346
00:14:58,079 --> 00:15:03,720
others

347
00:14:59,940 --> 00:15:05,220
Etc so apart from these uh there are

348
00:15:03,720 --> 00:15:07,199
some four golden signals which are very

349
00:15:05,220 --> 00:15:09,240
much important for your software which I

350
00:15:07,199 --> 00:15:10,800
think I should cover uh one is like the

351
00:15:09,240 --> 00:15:12,899
latency which defines about like how

352
00:15:10,800 --> 00:15:14,760
system is performing at the granular

353
00:15:12,899 --> 00:15:18,060
level and how much requests are taking

354
00:15:14,760 --> 00:15:19,800
to get processed by the server then the

355
00:15:18,060 --> 00:15:23,880
traffic throughput basically defines

356
00:15:19,800 --> 00:15:26,339
like uh how much request your systems is

357
00:15:23,880 --> 00:15:29,339
uh receiving like per minute or the per

358
00:15:26,339 --> 00:15:31,560
second uh then comes the error rate

359
00:15:29,339 --> 00:15:34,199
which defines about the again the 5x

360
00:15:31,560 --> 00:15:36,600
errors in your application and that can

361
00:15:34,199 --> 00:15:38,160
be due to any recent deployment or can

362
00:15:36,600 --> 00:15:40,440
be malfunctioning of the external

363
00:15:38,160 --> 00:15:42,860
service or the database on which your

364
00:15:40,440 --> 00:15:45,720
service is actually dependent

365
00:15:42,860 --> 00:15:47,279
it comes the saturation thing saturation

366
00:15:45,720 --> 00:15:50,579
is the main thing which tells you about

367
00:15:47,279 --> 00:15:53,639
the uh quantitative measurement CPU

368
00:15:50,579 --> 00:15:55,920
memory disk eye Ops Etc

369
00:15:53,639 --> 00:15:57,660
so uh let's look into one of the

370
00:15:55,920 --> 00:16:00,660
examples so here is like one of the

371
00:15:57,660 --> 00:16:03,000
example which I have uh actually picked

372
00:16:00,660 --> 00:16:04,500
from the official documents of the new

373
00:16:03,000 --> 00:16:07,019
relics so New Relic is kind of a third

374
00:16:04,500 --> 00:16:09,660
party tool which is used for plotting

375
00:16:07,019 --> 00:16:10,740
the metrics for your services it's the

376
00:16:09,660 --> 00:16:14,100
kind of

377
00:16:10,740 --> 00:16:16,440
APM based third party Tool uh here you

378
00:16:14,100 --> 00:16:19,260
can see like it gives a proper summary

379
00:16:16,440 --> 00:16:23,279
of your service throughput uh error

380
00:16:19,260 --> 00:16:26,940
rates and how much your uh kind of

381
00:16:23,279 --> 00:16:28,860
overall service if like service apis are

382
00:16:26,940 --> 00:16:30,360
taking time the transaction time

383
00:16:28,860 --> 00:16:32,699
actually

384
00:16:30,360 --> 00:16:36,139
so this is the like the com overview how

385
00:16:32,699 --> 00:16:39,120
it looks uh golden signals uh basically

386
00:16:36,139 --> 00:16:40,680
uh there can be like more granular like

387
00:16:39,120 --> 00:16:44,820
your metrics can be improved in a mobile

388
00:16:40,680 --> 00:16:47,519
way uh here let's say uh it has one like

389
00:16:44,820 --> 00:16:49,980
a very small example of to make sure how

390
00:16:47,519 --> 00:16:52,680
things are working so like as I

391
00:16:49,980 --> 00:16:55,440
mentioned uh so the New Relic thing uh

392
00:16:52,680 --> 00:16:57,720
you can not only just see the throughput

393
00:16:55,440 --> 00:17:00,480
error rate or the uh

394
00:16:57,720 --> 00:17:04,199
transactions but you can also

395
00:17:00,480 --> 00:17:09,419
see uh segments let's say you have one

396
00:17:04,199 --> 00:17:12,000
API you want uh to have metrics at some

397
00:17:09,419 --> 00:17:14,939
pieces of code for which your API is

398
00:17:12,000 --> 00:17:18,360
dependent on uh let's say uh this is one

399
00:17:14,939 --> 00:17:19,860
of the uh code flow there here is like

400
00:17:18,360 --> 00:17:23,160
this is the conference hall manager

401
00:17:19,860 --> 00:17:24,780
where uh I'm using I'm checking there's

402
00:17:23,160 --> 00:17:26,819
like two methods like book and the

403
00:17:24,780 --> 00:17:29,460
occupied book is actually telling like

404
00:17:26,819 --> 00:17:31,679
your conference Hall is available for

405
00:17:29,460 --> 00:17:34,620
booking or not or the confidence always

406
00:17:31,679 --> 00:17:38,360
actually occupied or not uh so

407
00:17:34,620 --> 00:17:41,760
considering this is working at uh

408
00:17:38,360 --> 00:17:44,100
that's a million of like a traffic

409
00:17:41,760 --> 00:17:47,880
throughput is in millions let's say so

410
00:17:44,100 --> 00:17:49,980
things kind of get uh very difficult to

411
00:17:47,880 --> 00:17:51,780
understand like how much this piece of

412
00:17:49,980 --> 00:17:54,539
wood white might be taking this is where

413
00:17:51,780 --> 00:17:57,000
like I can use these uh function traces

414
00:17:54,539 --> 00:18:00,000
and I can get the average transaction

415
00:17:57,000 --> 00:18:02,640
calls and how much time it is uh taking

416
00:18:00,000 --> 00:18:05,820
uh this is like just for understanding

417
00:18:02,640 --> 00:18:08,280
purpose example but uh if your service

418
00:18:05,820 --> 00:18:11,400
is having a business layer and you want

419
00:18:08,280 --> 00:18:14,640
to actually uh look into like how much

420
00:18:11,400 --> 00:18:16,919
your algorithm is uh doing this piece of

421
00:18:14,640 --> 00:18:19,320
work uh how much time it is taking how

422
00:18:16,919 --> 00:18:21,960
much memory it is being utilized for X

423
00:18:19,320 --> 00:18:24,299
calls per minute then these kind of

424
00:18:21,960 --> 00:18:27,480
function traces are really helpful

425
00:18:24,299 --> 00:18:30,360
another example is like using stat CD

426
00:18:27,480 --> 00:18:33,419
where stats D is a another tool which in

427
00:18:30,360 --> 00:18:37,679
Python which I can use uh let's say I

428
00:18:33,419 --> 00:18:41,340
want some function to have uh kind let's

429
00:18:37,679 --> 00:18:43,200
say I want to have a

430
00:18:41,340 --> 00:18:45,419
like to

431
00:18:43,200 --> 00:18:48,960
report a latency of a function running

432
00:18:45,419 --> 00:18:51,360
and then I want to see so you can use

433
00:18:48,960 --> 00:18:53,580
the latency calculation like the stats

434
00:18:51,360 --> 00:18:55,559
the timer can be used to measure the

435
00:18:53,580 --> 00:18:57,539
latency of a particular piece of

436
00:18:55,559 --> 00:19:00,480
function which is being run in your code

437
00:18:57,539 --> 00:19:04,080
and apart from that let's say uh there's

438
00:19:00,480 --> 00:19:05,340
an uh instance which is like in your

439
00:19:04,080 --> 00:19:07,799
business like there's a logic in your

440
00:19:05,340 --> 00:19:10,140
business layer uh where uh it is

441
00:19:07,799 --> 00:19:12,419
dependent on it kinds of emitting some

442
00:19:10,140 --> 00:19:15,720
kind of messages or the events to a

443
00:19:12,419 --> 00:19:17,880
queue let's say this queue is your sqsq

444
00:19:15,720 --> 00:19:19,440
so you want to know like how much how

445
00:19:17,880 --> 00:19:21,240
much are the successful enqueues and how

446
00:19:19,440 --> 00:19:23,039
many are the field in queues you can

447
00:19:21,240 --> 00:19:26,640
easily uh

448
00:19:23,039 --> 00:19:28,740
uh emit kind that kind of data using the

449
00:19:26,640 --> 00:19:32,340
stats Decline and you can just plot it

450
00:19:28,740 --> 00:19:34,740
uh on your weather like uh plotted and

451
00:19:32,340 --> 00:19:36,299
you can visualize your complete data how

452
00:19:34,740 --> 00:19:37,679
things are working and you can

453
00:19:36,299 --> 00:19:42,660
understand if there's something going

454
00:19:37,679 --> 00:19:45,419
wrong then you can uh like do some like

455
00:19:42,660 --> 00:19:48,600
can outline some action levels on it

456
00:19:45,419 --> 00:19:51,120
so this was like about the metrics uh

457
00:19:48,600 --> 00:19:52,740
limitations are like there are some few

458
00:19:51,120 --> 00:19:56,280
kinds of limitations in The Matrix as

459
00:19:52,740 --> 00:19:58,740
well uh so they they kind of provide a

460
00:19:56,280 --> 00:20:02,580
very limited context and they don't

461
00:19:58,740 --> 00:20:04,260
provide very rich context like uh like

462
00:20:02,580 --> 00:20:06,000
Matrix mostly focus on the numerical

463
00:20:04,260 --> 00:20:08,580
values and they provide insights into

464
00:20:06,000 --> 00:20:10,980
like Trends they kind of what they are

465
00:20:08,580 --> 00:20:13,620
lacking actually is the uh information

466
00:20:10,980 --> 00:20:15,539
necessary to fully understand like the

467
00:20:13,620 --> 00:20:17,640
reason behind these certain values why

468
00:20:15,539 --> 00:20:21,059
this is happening why latency is so much

469
00:20:17,640 --> 00:20:23,940
why uh my messages didn't got enqueued

470
00:20:21,059 --> 00:20:27,000
while it got failed uh these kinds of

471
00:20:23,940 --> 00:20:29,100
things which our metrics don't answer

472
00:20:27,000 --> 00:20:31,260
sometimes like another thing is like the

473
00:20:29,100 --> 00:20:33,000
metric overload thing uh where uh

474
00:20:31,260 --> 00:20:34,559
sometimes like tracking too many metrics

475
00:20:33,000 --> 00:20:36,179
can lead to some information overload

476
00:20:34,559 --> 00:20:37,200
making it difficult to focus on what's

477
00:20:36,179 --> 00:20:39,720
important

478
00:20:37,200 --> 00:20:42,240
and over optimization is like only

479
00:20:39,720 --> 00:20:44,940
solely dependent on The Matrix decision

480
00:20:42,240 --> 00:20:48,900
making can lead to over optimization

481
00:20:44,940 --> 00:20:50,580
this is where uh events come uh in the

482
00:20:48,900 --> 00:20:51,720
picture so events are kind of the

483
00:20:50,580 --> 00:20:53,760
fundamental component of the

484
00:20:51,720 --> 00:20:56,160
observability but they slightly provide

485
00:20:53,760 --> 00:20:58,320
a different purpose compared to logs uh

486
00:20:56,160 --> 00:21:00,120
they kind of provide a they kind of

487
00:20:58,320 --> 00:21:01,740
provide a rich information like they

488
00:21:00,120 --> 00:21:05,520
will actually tell you like why the

489
00:21:01,740 --> 00:21:07,380
latency was increased and why the uh

490
00:21:05,520 --> 00:21:09,360
basically the

491
00:21:07,380 --> 00:21:11,760
messages we are getting in keyword why

492
00:21:09,360 --> 00:21:13,740
they are getting failed and so on so

493
00:21:11,760 --> 00:21:15,539
they kind of include metadata and

494
00:21:13,740 --> 00:21:19,140
structured data like timestamps event

495
00:21:15,539 --> 00:21:21,120
types additional attributes Etc so these

496
00:21:19,140 --> 00:21:23,580
kind like these events are actually

497
00:21:21,120 --> 00:21:25,380
helpful when you want to track even more

498
00:21:23,580 --> 00:21:27,299
granular level of business related

499
00:21:25,380 --> 00:21:30,299
events like how many users are able to

500
00:21:27,299 --> 00:21:32,940
view the products how many uh products

501
00:21:30,299 --> 00:21:35,039
are getting uh added to the card for

502
00:21:32,940 --> 00:21:38,520
example like in case of e-commerce

503
00:21:35,039 --> 00:21:41,280
application uh Etc these events can be

504
00:21:38,520 --> 00:21:43,860
used for the analytical purpose to make

505
00:21:41,280 --> 00:21:45,600
decision making and drive business and

506
00:21:43,860 --> 00:21:48,299
it will help so help you understand like

507
00:21:45,600 --> 00:21:51,600
what is actually uh impacting the

508
00:21:48,299 --> 00:21:54,059
business and how you can improve it

509
00:21:51,600 --> 00:21:56,280
these events can be pushed like in a

510
00:21:54,059 --> 00:21:57,539
column now databases where you can which

511
00:21:56,280 --> 00:21:59,880
are actually used for the analytical

512
00:21:57,539 --> 00:22:01,679
purposes and you can understand the

513
00:21:59,880 --> 00:22:03,780
internal states of the application by

514
00:22:01,679 --> 00:22:06,059
querying on the large events data set

515
00:22:03,780 --> 00:22:08,280
some of the examples of the columnar

516
00:22:06,059 --> 00:22:10,380
databases are like Apache Cassandra and

517
00:22:08,280 --> 00:22:13,080
Amazon redshift

518
00:22:10,380 --> 00:22:16,799
so uh one of the example of the events

519
00:22:13,080 --> 00:22:19,740
is like this uh where I will be uh like

520
00:22:16,799 --> 00:22:21,539
this is a kind of a schema uh for like

521
00:22:19,740 --> 00:22:23,940
booking events for the pycon conference

522
00:22:21,539 --> 00:22:25,740
and uh it has this user ID event type

523
00:22:23,940 --> 00:22:28,200
action action will tell you about the

524
00:22:25,740 --> 00:22:29,880
order placed Auto canceled then there's

525
00:22:28,200 --> 00:22:34,799
ticket type student professional

526
00:22:29,880 --> 00:22:36,179
hobbyist whether so uh kind of uh this

527
00:22:34,799 --> 00:22:39,000
this is again like a small piece of code

528
00:22:36,179 --> 00:22:40,980
where I'm using a traditional RDS just

529
00:22:39,000 --> 00:22:42,780
for sake of example however like when

530
00:22:40,980 --> 00:22:45,539
you're working on the scale

531
00:22:42,780 --> 00:22:47,700
um I would recommend like uh Corona

532
00:22:45,539 --> 00:22:52,080
databases are like much better for this

533
00:22:47,700 --> 00:22:54,539
use case uh then comes like here at the

534
00:22:52,080 --> 00:22:56,159
right bottom you can see the uh ticket

535
00:22:54,539 --> 00:22:57,720
booking event I am creating where I'm

536
00:22:56,159 --> 00:23:00,900
passing the request context request

537
00:22:57,720 --> 00:23:03,720
context will be having the user ID and

538
00:23:00,900 --> 00:23:06,900
the uh other metadata which is required

539
00:23:03,720 --> 00:23:09,000
for the reporting the events and in the

540
00:23:06,900 --> 00:23:10,380
set event attributes I am kind of

541
00:23:09,000 --> 00:23:12,360
actually

542
00:23:10,380 --> 00:23:14,880
setting like what was the action where

543
00:23:12,360 --> 00:23:16,860
the order was placed or the canceled or

544
00:23:14,880 --> 00:23:19,740
what was the ticket type was it student

545
00:23:16,860 --> 00:23:22,020
professional or hobbyist and at the end

546
00:23:19,740 --> 00:23:25,080
like I'm using the emit method which is

547
00:23:22,020 --> 00:23:29,640
kind of emitting my complete data into

548
00:23:25,080 --> 00:23:32,820
the RDS so this is like the overall uh

549
00:23:29,640 --> 00:23:34,080
example of the events uh now we have

550
00:23:32,820 --> 00:23:35,760
covered like almost all the three

551
00:23:34,080 --> 00:23:37,919
pillars of the observability now let's

552
00:23:35,760 --> 00:23:40,919
take a look into how to build and drive

553
00:23:37,919 --> 00:23:44,640
that culture within your team

554
00:23:40,919 --> 00:23:47,820
uh first thing comes like uh education

555
00:23:44,640 --> 00:23:50,820
like educating the team is uh very much

556
00:23:47,820 --> 00:23:53,640
important for uh like

557
00:23:50,820 --> 00:23:55,799
very much important you have to teach

558
00:23:53,640 --> 00:23:57,960
your team of the importance of the

559
00:23:55,799 --> 00:23:59,820
observability and how it contributes

560
00:23:57,960 --> 00:24:02,700
contributes to building reliable and

561
00:23:59,820 --> 00:24:04,980
maintainable systems uh set clear goals

562
00:24:02,700 --> 00:24:07,020
within your team for the obsibility with

563
00:24:04,980 --> 00:24:09,659
and uh

564
00:24:07,020 --> 00:24:11,400
like discuss on things like what aspects

565
00:24:09,659 --> 00:24:13,799
of your system do you want to Monitor

566
00:24:11,400 --> 00:24:16,380
and what key metrics are the business

567
00:24:13,799 --> 00:24:19,200
critical to make sure your services up

568
00:24:16,380 --> 00:24:22,580
and running and doing like solving

569
00:24:19,200 --> 00:24:25,679
business problems as expected

570
00:24:22,580 --> 00:24:27,360
use of a second thing is like using

571
00:24:25,679 --> 00:24:29,880
right tools and standardize the events

572
00:24:27,360 --> 00:24:31,740
format like use of possibility right

573
00:24:29,880 --> 00:24:34,320
like right tools for the observability

574
00:24:31,740 --> 00:24:36,179
is important it can be a good investment

575
00:24:34,320 --> 00:24:38,940
that uh which can help you capture

576
00:24:36,179 --> 00:24:41,100
events logs metrics effectively uh these

577
00:24:38,940 --> 00:24:43,740
I have already discussed in my few like

578
00:24:41,100 --> 00:24:45,780
previous slides uh choose tools that

579
00:24:43,740 --> 00:24:47,340
kind of support visualization data

580
00:24:45,780 --> 00:24:50,159
alerting and the analysis of

581
00:24:47,340 --> 00:24:53,280
observability data these tools should be

582
00:24:50,159 --> 00:24:55,020
like encouraged and so that developers

583
00:24:53,280 --> 00:24:57,059
can maintain their code and also

584
00:24:55,020 --> 00:24:59,159
instrument their code as well even

585
00:24:57,059 --> 00:25:02,220
should follow a standardized format this

586
00:24:59,159 --> 00:25:03,780
consistency AIDS in later analysis and

587
00:25:02,220 --> 00:25:05,659
troubleshooting during production issues

588
00:25:03,780 --> 00:25:09,780
very much easily

589
00:25:05,659 --> 00:25:11,400
uh add automated alerts on the basis of

590
00:25:09,780 --> 00:25:13,220
their thresholds like once you have

591
00:25:11,400 --> 00:25:16,799
multiple panels ready you can add

592
00:25:13,220 --> 00:25:19,679
automatic alerts to detect anomalies in

593
00:25:16,799 --> 00:25:22,320
your metrics or make sure because like

594
00:25:19,679 --> 00:25:24,179
uh to make sure like there's nothing uh

595
00:25:22,320 --> 00:25:26,460
production impacting as such

596
00:25:24,179 --> 00:25:29,179
those alerts should be also relevant for

597
00:25:26,460 --> 00:25:29,179
the team as well

598
00:25:29,240 --> 00:25:34,080
uh this is another important thing post

599
00:25:31,980 --> 00:25:35,640
incident reviews uh conducting post

600
00:25:34,080 --> 00:25:37,380
incident reviews is a good practice

601
00:25:35,640 --> 00:25:39,360
every production incident should have a

602
00:25:37,380 --> 00:25:41,580
report known as the RCA which stands for

603
00:25:39,360 --> 00:25:43,440
root cause analysis which tells about

604
00:25:41,580 --> 00:25:46,080
production issue how system working

605
00:25:43,440 --> 00:25:48,419
which component of the system failed how

606
00:25:46,080 --> 00:25:50,880
it got fixed and outlining the action

607
00:25:48,419 --> 00:25:53,159
items to prevent the issue in the future

608
00:25:50,880 --> 00:25:55,620
overall RCA helps in understanding the

609
00:25:53,159 --> 00:25:58,700
root causes and helps identifying the

610
00:25:55,620 --> 00:25:58,700
areas for the Improvement

611
00:25:59,419 --> 00:26:06,480
uh at the last uh also lead by the

612
00:26:03,179 --> 00:26:07,740
example and celebrate success as a

613
00:26:06,480 --> 00:26:09,779
leader of the team you should

614
00:26:07,740 --> 00:26:12,960
demonstrate the observative practices in

615
00:26:09,779 --> 00:26:14,460
your own work show the value of

616
00:26:12,960 --> 00:26:16,380
observability through the real life

617
00:26:14,460 --> 00:26:19,440
examples and some success stories

618
00:26:16,380 --> 00:26:22,159
sharing Tech blogs or share some

619
00:26:19,440 --> 00:26:24,960
learnings which you have recently solved

620
00:26:22,159 --> 00:26:26,820
uh create some documentations and

621
00:26:24,960 --> 00:26:30,299
resources outline some of the beauty

622
00:26:26,820 --> 00:26:32,100
practices tools and their usage make it

623
00:26:30,299 --> 00:26:34,260
easy for like team members to access

624
00:26:32,100 --> 00:26:37,500
those docs and refer them whenever

625
00:26:34,260 --> 00:26:39,179
required at last like don't forget to

626
00:26:37,500 --> 00:26:42,419
celebrate success because you are doing

627
00:26:39,179 --> 00:26:44,159
a much of the hard work and celebrate

628
00:26:42,419 --> 00:26:45,900
where observability driven practice have

629
00:26:44,159 --> 00:26:49,080
actually led to the quicker production

630
00:26:45,900 --> 00:26:51,679
issue resolution or enhance the system

631
00:26:49,080 --> 00:26:51,679
performance

632
00:26:52,279 --> 00:26:57,900
uh that's it uh I would like to like end

633
00:26:55,980 --> 00:27:00,179
this with a note like remember that

634
00:26:57,900 --> 00:27:02,460
building an observability driven culture

635
00:27:00,179 --> 00:27:04,080
takes time and commitment and it

636
00:27:02,460 --> 00:27:05,640
requires an ongoing effort and the

637
00:27:04,080 --> 00:27:08,480
continuous Improvement

638
00:27:05,640 --> 00:27:08,480
thank you so much

639
00:27:09,050 --> 00:27:13,980
[Applause]

640
00:27:12,480 --> 00:27:16,799
thank you for your time

641
00:27:13,980 --> 00:27:19,559
we have space for exactly one question

642
00:27:16,799 --> 00:27:23,299
before we have to take a break

643
00:27:19,559 --> 00:27:23,299
um if someone monster is the hands

644
00:27:33,299 --> 00:27:37,440
can't see any questions in the audience

645
00:27:35,039 --> 00:27:38,540
currently uh thank you so much can we

646
00:27:37,440 --> 00:27:45,359
get another round of applause

647
00:27:38,540 --> 00:27:45,359
[Applause]