1 00:00:00,480 --> 00:00:03,480 foreign 2 00:00:09,500 --> 00:00:14,820 next session title is developing culture 3 00:00:12,840 --> 00:00:16,920 to write to write reliable and 4 00:00:14,820 --> 00:00:19,560 performance services at scale our 5 00:00:16,920 --> 00:00:22,560 speaker is harshit 6 00:00:19,560 --> 00:00:24,180 um go ahead take it away 7 00:00:22,560 --> 00:00:27,840 okay 8 00:00:24,180 --> 00:00:31,140 um hi everyone so I hope everyone is 9 00:00:27,840 --> 00:00:33,600 doing well today I will be speaking on 10 00:00:31,140 --> 00:00:35,940 developing culture to write reliable and 11 00:00:33,600 --> 00:00:38,520 performance services at scale where uh I 12 00:00:35,940 --> 00:00:40,559 will be discussing about uh the 13 00:00:38,520 --> 00:00:41,879 observability like how it plays a 14 00:00:40,559 --> 00:00:44,160 crucial role in your software 15 00:00:41,879 --> 00:00:47,460 development cycle and how you can put 16 00:00:44,160 --> 00:00:50,160 that observability using uh like python 17 00:00:47,460 --> 00:00:54,180 where your services are built or using 18 00:00:50,160 --> 00:00:56,579 python uh based Python and can establish 19 00:00:54,180 --> 00:00:58,739 and also I will be highlighting uh the 20 00:00:56,579 --> 00:01:02,160 kind of culture which you can establish 21 00:00:58,739 --> 00:01:04,799 within your team or the organization 22 00:01:02,160 --> 00:01:08,640 so uh before starting with the talk a 23 00:01:04,799 --> 00:01:10,080 bit uh introduction about myself so 24 00:01:08,640 --> 00:01:14,060 currently I am a software engineer 25 00:01:10,080 --> 00:01:16,700 working at the blinket in blinket is the 26 00:01:14,060 --> 00:01:19,500 India's e-commerce which is uh 27 00:01:16,700 --> 00:01:22,799 delivering products in minutes I am from 28 00:01:19,500 --> 00:01:25,320 India and I am a tech speaker and have 29 00:01:22,799 --> 00:01:26,820 been part of multiple conferences in the 30 00:01:25,320 --> 00:01:30,000 past as a speaker 31 00:01:26,820 --> 00:01:33,060 I'm open source contributor and also 32 00:01:30,000 --> 00:01:34,619 like in my free time I try to explore uh 33 00:01:33,060 --> 00:01:35,880 mostly the cloud native based open 34 00:01:34,619 --> 00:01:38,400 source organizations and try to 35 00:01:35,880 --> 00:01:40,860 contribute in that apart from that I've 36 00:01:38,400 --> 00:01:42,360 been like past Google summer of code uh 37 00:01:40,860 --> 00:01:43,860 student also during my undergraduate 38 00:01:42,360 --> 00:01:47,820 studies 39 00:01:43,860 --> 00:01:51,000 yeah so let's begin with uh first like 40 00:01:47,820 --> 00:01:53,399 why do we write software 41 00:01:51,000 --> 00:01:57,060 um we write software to basically solve 42 00:01:53,399 --> 00:01:59,399 problems right uh by solving problems we 43 00:01:57,060 --> 00:02:00,540 are kind of making life easier for the 44 00:01:59,399 --> 00:02:04,380 people 45 00:02:00,540 --> 00:02:06,119 so the software is helping making life 46 00:02:04,380 --> 00:02:08,959 easier for the people there are like two 47 00:02:06,119 --> 00:02:11,220 kinds of software like if we can 48 00:02:08,959 --> 00:02:13,860 highlight it one is like the good 49 00:02:11,220 --> 00:02:16,319 software and the another is bad software 50 00:02:13,860 --> 00:02:19,560 uh good software is kind of implies like 51 00:02:16,319 --> 00:02:21,420 uh you have a good amount of users 52 00:02:19,560 --> 00:02:24,360 traffic coming and you have a good 53 00:02:21,420 --> 00:02:26,340 engagement on your platform uh which can 54 00:02:24,360 --> 00:02:27,900 directly impact your business profits in 55 00:02:26,340 --> 00:02:31,500 terms of Revenue 56 00:02:27,900 --> 00:02:33,959 bad software implies like your code is 57 00:02:31,500 --> 00:02:36,660 actually not working as expected and 58 00:02:33,959 --> 00:02:39,540 there are several like certain down some 59 00:02:36,660 --> 00:02:41,580 down times and which is causing business 60 00:02:39,540 --> 00:02:44,879 loss in terms of Revenue 61 00:02:41,580 --> 00:02:47,540 considering if your software is is a 62 00:02:44,879 --> 00:02:51,720 good software uh you can expect such 63 00:02:47,540 --> 00:02:54,599 systems to have uh like uh like these 64 00:02:51,720 --> 00:02:57,780 softwares can expect a kind of a monthly 65 00:02:54,599 --> 00:03:00,060 active users linearly increasing like uh 66 00:02:57,780 --> 00:03:01,440 see users are using your software and 67 00:03:00,060 --> 00:03:03,480 they are enjoying it they will recommend 68 00:03:01,440 --> 00:03:05,519 to others then this traffic will keep on 69 00:03:03,480 --> 00:03:07,319 increasing this kind of high throughput 70 00:03:05,519 --> 00:03:09,120 must be handled smoothly to make sure 71 00:03:07,319 --> 00:03:11,220 like people are enjoying and also you 72 00:03:09,120 --> 00:03:13,680 are making profits in your business but 73 00:03:11,220 --> 00:03:17,280 turns out like sometimes things don't go 74 00:03:13,680 --> 00:03:19,620 well as expected because uh here system 75 00:03:17,280 --> 00:03:21,659 reliability is important where High 76 00:03:19,620 --> 00:03:24,659 number of user expectations should be 77 00:03:21,659 --> 00:03:28,500 met without any uh frustrations which 78 00:03:24,659 --> 00:03:32,400 can cause loss to the business uh like 79 00:03:28,500 --> 00:03:35,400 in case of bad software uh where like 80 00:03:32,400 --> 00:03:40,200 production issues happen and there are 81 00:03:35,400 --> 00:03:42,060 down times are kind of uh to like severe 82 00:03:40,200 --> 00:03:44,459 and Engineers try to understand and the 83 00:03:42,060 --> 00:03:46,379 problem to resolve it like they try to 84 00:03:44,459 --> 00:03:49,500 understand the problem with the within 85 00:03:46,379 --> 00:03:52,159 their internal systems and if like they 86 00:03:49,500 --> 00:03:55,200 don't follow a best practice it's like 87 00:03:52,159 --> 00:03:58,620 basically it's too late for them to 88 00:03:55,200 --> 00:04:01,319 understand like what's the uh root cause 89 00:03:58,620 --> 00:04:03,420 of it and it's become hard for them to 90 00:04:01,319 --> 00:04:04,440 figure out what's going on and the 91 00:04:03,420 --> 00:04:07,680 problem is already happening in 92 00:04:04,440 --> 00:04:10,739 production systems so uh to bridge this 93 00:04:07,680 --> 00:04:13,980 kind of Gap by making complex systems 94 00:04:10,739 --> 00:04:16,919 more transparent uh we can set up 95 00:04:13,980 --> 00:04:19,799 monitoring of our systems based uh on 96 00:04:16,919 --> 00:04:21,720 the observability driven development 97 00:04:19,799 --> 00:04:25,620 so uh 98 00:04:21,720 --> 00:04:27,419 as we can see uh like monitoring is not 99 00:04:25,620 --> 00:04:28,979 the same thing as observability like 100 00:04:27,419 --> 00:04:31,860 they are very much similar terms but 101 00:04:28,979 --> 00:04:34,560 they are kind of different like uh 102 00:04:31,860 --> 00:04:37,680 if I highlight like if as per the Google 103 00:04:34,560 --> 00:04:39,660 SRE book uh monitoring systems must 104 00:04:37,680 --> 00:04:41,940 answer only two simple questions like 105 00:04:39,660 --> 00:04:44,460 what's broken and why 106 00:04:41,940 --> 00:04:46,139 uh so monitoring is kind of a crucial 107 00:04:44,460 --> 00:04:48,960 thing like it involves building 108 00:04:46,139 --> 00:04:50,580 dashboards setting alerts it lets you 109 00:04:48,960 --> 00:04:53,100 know about how's your microservices 110 00:04:50,580 --> 00:04:55,979 performing and in the long term it helps 111 00:04:53,100 --> 00:04:59,400 you understand uh the traffic growth the 112 00:04:55,979 --> 00:05:01,919 trends and how service is uh utilizing 113 00:04:59,400 --> 00:05:02,759 the uh machine resources on which it is 114 00:05:01,919 --> 00:05:06,240 running 115 00:05:02,759 --> 00:05:08,220 now the comes the part where a ecosystem 116 00:05:06,240 --> 00:05:10,800 where there are multiple systems are 117 00:05:08,220 --> 00:05:13,080 running so this kind of uh system is 118 00:05:10,800 --> 00:05:15,540 like uh the distributed systems so 119 00:05:13,080 --> 00:05:17,460 distributed systems uh here the 120 00:05:15,540 --> 00:05:19,500 monitoring becomes a big Challenge and 121 00:05:17,460 --> 00:05:22,680 requires a deep dive into the internal 122 00:05:19,500 --> 00:05:25,860 States of each system uh as per their 123 00:05:22,680 --> 00:05:29,639 external outputs uh this is where like 124 00:05:25,860 --> 00:05:32,460 uh kind of observability kicks in and it 125 00:05:29,639 --> 00:05:35,280 provides an additional a2m monitoring uh 126 00:05:32,460 --> 00:05:39,900 consider like observability is kind of 127 00:05:35,280 --> 00:05:41,280 uh uh like a code Fitness tracker uh 128 00:05:39,900 --> 00:05:43,680 where like you are counting every 129 00:05:41,280 --> 00:05:45,960 heartbeat and the healthy usage activity 130 00:05:43,680 --> 00:05:49,680 making sure that your software stays in 131 00:05:45,960 --> 00:05:51,300 top shape like a marathon runner 132 00:05:49,680 --> 00:05:54,240 um if there is like no observability 133 00:05:51,300 --> 00:05:57,180 there is no monitoring 134 00:05:54,240 --> 00:06:00,840 so before uh deep diving into 135 00:05:57,180 --> 00:06:04,259 observability uh like it's important uh 136 00:06:00,840 --> 00:06:06,120 to know these terms uh what are the 137 00:06:04,259 --> 00:06:09,139 pillars of the observability because 138 00:06:06,120 --> 00:06:12,000 these will actually help you to avoid 139 00:06:09,139 --> 00:06:14,340 anti-patterns in your software and like 140 00:06:12,000 --> 00:06:16,800 they will be able to help you uh what 141 00:06:14,340 --> 00:06:18,660 kind of uh strategies were missed during 142 00:06:16,800 --> 00:06:21,960 a development phase because if you don't 143 00:06:18,660 --> 00:06:25,080 follow these uh like kinds of uh 144 00:06:21,960 --> 00:06:27,479 practice it reflects like uh the 145 00:06:25,080 --> 00:06:28,680 inability to meet the promised slas or 146 00:06:27,479 --> 00:06:31,319 difficulty in tracking the business 147 00:06:28,680 --> 00:06:34,020 Matrix and considering the poor 148 00:06:31,319 --> 00:06:37,020 performance as a trade-off so 149 00:06:34,020 --> 00:06:38,580 uh these are kind of very much important 150 00:06:37,020 --> 00:06:40,680 pillars for the observability so the 151 00:06:38,580 --> 00:06:42,960 three pillars are like the logging 152 00:06:40,680 --> 00:06:45,419 metrics and the address events I will be 153 00:06:42,960 --> 00:06:48,120 discussing these one by one first uh 154 00:06:45,419 --> 00:06:50,220 before going into the culture uh 155 00:06:48,120 --> 00:06:52,440 building thing and uh like I will be 156 00:06:50,220 --> 00:06:54,900 highlighting this with using small 157 00:06:52,440 --> 00:06:57,120 Snippets using python code 158 00:06:54,900 --> 00:06:59,100 so looking at first the logging let's 159 00:06:57,120 --> 00:07:01,560 discuss about the logging so logging 160 00:06:59,100 --> 00:07:05,400 helps in understanding the behavior of 161 00:07:01,560 --> 00:07:07,139 this service during the runtime uh like 162 00:07:05,400 --> 00:07:08,759 these are the recorded pieces of the 163 00:07:07,139 --> 00:07:11,580 information flowing through the service 164 00:07:08,759 --> 00:07:13,440 and they are kind of uh kind of 165 00:07:11,580 --> 00:07:17,240 typically saved in Json format where the 166 00:07:13,440 --> 00:07:20,340 developers can use some kind of patterns 167 00:07:17,240 --> 00:07:22,080 to match and see how their service is 168 00:07:20,340 --> 00:07:24,000 performing depending on the use case 169 00:07:22,080 --> 00:07:26,340 like how they are using their logs they 170 00:07:24,000 --> 00:07:28,740 are like kind of a full logs level logs 171 00:07:26,340 --> 00:07:31,319 as you can see uh like there is debug 172 00:07:28,740 --> 00:07:33,240 info bonding and uh so debugging stands 173 00:07:31,319 --> 00:07:36,419 for like if you are doing some kind of 174 00:07:33,240 --> 00:07:38,520 debugging things and uh mostly these 175 00:07:36,419 --> 00:07:40,860 types types of logs are used uh at the 176 00:07:38,520 --> 00:07:42,660 in the like in local development then 177 00:07:40,860 --> 00:07:44,580 the second ones one is the information 178 00:07:42,660 --> 00:07:47,160 which is kind of used for the general 179 00:07:44,580 --> 00:07:49,740 purpose logging uh then comes the 180 00:07:47,160 --> 00:07:53,280 bonding logs which actually is used to 181 00:07:49,740 --> 00:07:55,979 tell like uh hey this is like not 182 00:07:53,280 --> 00:07:59,639 critical that much but can be 183 00:07:55,979 --> 00:08:02,880 Troublesome like in near future and the 184 00:07:59,639 --> 00:08:05,160 fourth one is the uh error log which is 185 00:08:02,880 --> 00:08:07,740 like mostly used to signify the errors 186 00:08:05,160 --> 00:08:10,080 in the application uh the best practice 187 00:08:07,740 --> 00:08:12,120 to organize your logs uh for a python 188 00:08:10,080 --> 00:08:15,780 based application is to have like first 189 00:08:12,120 --> 00:08:18,419 the module name to identify quickly from 190 00:08:15,780 --> 00:08:20,819 uh which module uh the error got 191 00:08:18,419 --> 00:08:22,319 reported uh then comes the timestamp 192 00:08:20,819 --> 00:08:25,800 which will tell you about at what time 193 00:08:22,319 --> 00:08:28,139 stand the uh log was reported and then 194 00:08:25,800 --> 00:08:30,120 comes the process ID process ID is kind 195 00:08:28,139 --> 00:08:32,459 of optional it depends whether you want 196 00:08:30,120 --> 00:08:35,580 to add or not it is kind of helpful when 197 00:08:32,459 --> 00:08:38,339 there are multi your systems are running 198 00:08:35,580 --> 00:08:39,719 multiple processes and on the basis of 199 00:08:38,339 --> 00:08:42,839 that you can 200 00:08:39,719 --> 00:08:46,860 uh easily uh like you can tell which 201 00:08:42,839 --> 00:08:49,140 process ID log this thing uh Etc uh then 202 00:08:46,860 --> 00:08:51,240 comes the log level uh log level 203 00:08:49,140 --> 00:08:52,980 basically again it tells about if it is 204 00:08:51,240 --> 00:08:54,600 info bonding error and the at the last 205 00:08:52,980 --> 00:08:57,959 the message basically tells you about 206 00:08:54,600 --> 00:09:01,620 like uh detail about what's happening in 207 00:08:57,959 --> 00:09:05,040 your uh like in the system 208 00:09:01,620 --> 00:09:07,260 so here's a small example where I have 209 00:09:05,040 --> 00:09:09,480 set up a logger so you can see there's a 210 00:09:07,260 --> 00:09:14,700 configure logger I have set a level of 211 00:09:09,480 --> 00:09:17,580 info that means uh so uh there as I uh 212 00:09:14,700 --> 00:09:20,820 represent it in my previous slide uh so 213 00:09:17,580 --> 00:09:22,740 the order of logs is like debug info 214 00:09:20,820 --> 00:09:24,660 bonding and level if so here I have 215 00:09:22,740 --> 00:09:27,420 started my set level from the info level 216 00:09:24,660 --> 00:09:30,540 so it means that my during whenever I 217 00:09:27,420 --> 00:09:33,600 will be running my system like uh in 218 00:09:30,540 --> 00:09:36,480 production it won't be logging those 219 00:09:33,600 --> 00:09:38,459 logs which are at debug level those are 220 00:09:36,480 --> 00:09:40,740 mostly for the local development purpose 221 00:09:38,459 --> 00:09:43,019 and they won't be getting uh getting 222 00:09:40,740 --> 00:09:46,560 stored on the production system so it 223 00:09:43,019 --> 00:09:47,779 will be logging from info to uh warning 224 00:09:46,560 --> 00:09:50,940 and then error 225 00:09:47,779 --> 00:09:52,800 uh so you can see like at the right side 226 00:09:50,940 --> 00:09:54,480 on the top level you can see that's a 227 00:09:52,800 --> 00:09:57,420 standard format of the log which comes 228 00:09:54,480 --> 00:09:59,700 however uh the best practice is to 229 00:09:57,420 --> 00:10:02,160 follow the Json structure logs which can 230 00:09:59,700 --> 00:10:05,220 be helpful in plotting the monitoring 231 00:10:02,160 --> 00:10:08,279 panels in form of graphs to understand 232 00:10:05,220 --> 00:10:11,220 the trend of your law like the system of 233 00:10:08,279 --> 00:10:12,720 the API calls uh logs again as I 234 00:10:11,220 --> 00:10:15,540 mentioned earlier like they can 235 00:10:12,720 --> 00:10:18,300 developers can use these logs to match 236 00:10:15,540 --> 00:10:20,880 on some kind of pattern and they can uh 237 00:10:18,300 --> 00:10:24,240 aggregate the logs data on the basis of 238 00:10:20,880 --> 00:10:26,220 some time intervals uh that is I think I 239 00:10:24,240 --> 00:10:28,680 would say is kind of sample their data 240 00:10:26,220 --> 00:10:34,019 and then they can create their own 241 00:10:28,680 --> 00:10:36,000 monitoring panels uh a good example of a 242 00:10:34,019 --> 00:10:39,660 monitoring panel can be like let's say 243 00:10:36,000 --> 00:10:43,800 you have an API which kind of logs uh 244 00:10:39,660 --> 00:10:45,240 status codes uh 5x64 X6 or 2x6 you can 245 00:10:43,800 --> 00:10:47,399 plot this trend by using the status 246 00:10:45,240 --> 00:10:49,860 quote of the API which you are logging 247 00:10:47,399 --> 00:10:52,140 after request has been performed 248 00:10:49,860 --> 00:10:55,079 and you can like understand the behavior 249 00:10:52,140 --> 00:10:57,200 of you pay how many uh 4x6 you are 250 00:10:55,079 --> 00:10:57,200 getting 251 00:10:57,660 --> 00:11:04,079 so uh this is another concept like let's 252 00:11:01,560 --> 00:11:05,940 say uh you have a distributed 253 00:11:04,079 --> 00:11:08,100 environment when distributed systems 254 00:11:05,940 --> 00:11:09,720 there can be scenario like multiple 255 00:11:08,100 --> 00:11:12,120 instances are running you are your 256 00:11:09,720 --> 00:11:15,899 system is handling a lot of requests 257 00:11:12,120 --> 00:11:18,180 from the users at scale now imagine you 258 00:11:15,899 --> 00:11:20,399 get an issue raised that uh one of the 259 00:11:18,180 --> 00:11:22,500 users getting affected using your 260 00:11:20,399 --> 00:11:25,140 software you want to debug the root 261 00:11:22,500 --> 00:11:27,899 cause of it uh directly looking into the 262 00:11:25,140 --> 00:11:30,060 request uh will be like very difficult 263 00:11:27,899 --> 00:11:32,640 to understand because imagine like you 264 00:11:30,060 --> 00:11:35,100 are getting millions of requests and you 265 00:11:32,640 --> 00:11:38,339 are checking just for one user or the 266 00:11:35,100 --> 00:11:40,019 like one like quite a set of users you 267 00:11:38,339 --> 00:11:43,440 are checking on their effect being 268 00:11:40,019 --> 00:11:45,899 affected so uh kind of to make sure you 269 00:11:43,440 --> 00:11:47,760 are looking into the right request uh we 270 00:11:45,899 --> 00:11:49,260 use the concept of the trace ID Trace 271 00:11:47,760 --> 00:11:52,620 IDs are actually helpful to track 272 00:11:49,260 --> 00:11:55,200 specific requests from start till the 273 00:11:52,620 --> 00:11:57,060 end uh reflecting like how your system 274 00:11:55,200 --> 00:11:59,220 process that particular request which 275 00:11:57,060 --> 00:12:00,839 was received by system till the 276 00:11:59,220 --> 00:12:01,980 acknowledge which was sent to the client 277 00:12:00,839 --> 00:12:04,140 side 278 00:12:01,980 --> 00:12:06,660 at request level Trace ID is like always 279 00:12:04,140 --> 00:12:11,160 unique you can just simply look over the 280 00:12:06,660 --> 00:12:13,140 trace ID for the user and fetch logs uh 281 00:12:11,160 --> 00:12:14,760 like and it will be very simple to 282 00:12:13,140 --> 00:12:18,300 understand like what's affecting the 283 00:12:14,760 --> 00:12:19,740 user here at the uh like this is a small 284 00:12:18,300 --> 00:12:23,640 piece of code where I'm trying to 285 00:12:19,740 --> 00:12:25,320 simulate the uh two requests and with 286 00:12:23,640 --> 00:12:28,380 different Trace IDs at the right side 287 00:12:25,320 --> 00:12:31,140 you can see uh the request one is 288 00:12:28,380 --> 00:12:33,660 actually kind of a one session uh where 289 00:12:31,140 --> 00:12:35,660 uh the trace ID is unique for that 290 00:12:33,660 --> 00:12:39,420 particular session then the second 291 00:12:35,660 --> 00:12:41,579 request was like another kind of request 292 00:12:39,420 --> 00:12:44,940 and another session where the trace ID 293 00:12:41,579 --> 00:12:47,579 is unique for that uh case also 294 00:12:44,940 --> 00:12:48,899 so here uh by adding Trace ID it 295 00:12:47,579 --> 00:12:51,060 actually reduced the mean time to reduce 296 00:12:48,899 --> 00:12:52,800 to resolve a production issue issue that 297 00:12:51,060 --> 00:12:56,579 this is like kind of a another best 298 00:12:52,800 --> 00:13:00,959 practice which you can use to kind of 299 00:12:56,579 --> 00:13:02,220 reduce the uh time to debug production 300 00:13:00,959 --> 00:13:04,500 issues 301 00:13:02,220 --> 00:13:06,600 but there are certain limitations of the 302 00:13:04,500 --> 00:13:08,940 logging like uh extensive logging can 303 00:13:06,600 --> 00:13:11,339 generate a large volumes of data leading 304 00:13:08,940 --> 00:13:13,139 to storage challenges and these storage 305 00:13:11,339 --> 00:13:15,899 challenges can gradually increase the 306 00:13:13,139 --> 00:13:19,260 cost of running the infra which is not a 307 00:13:15,899 --> 00:13:21,240 good thing uh careless logging practice 308 00:13:19,260 --> 00:13:23,100 can lead to sensitive information leaks 309 00:13:21,240 --> 00:13:25,700 and which can raise some security 310 00:13:23,100 --> 00:13:29,100 concerns which is again not a good idea 311 00:13:25,700 --> 00:13:32,160 uh log noise is another thing like if 312 00:13:29,100 --> 00:13:34,220 you don't follow or uh within your team 313 00:13:32,160 --> 00:13:38,579 you don't uh 314 00:13:34,220 --> 00:13:40,200 kind of establish some standards uh like 315 00:13:38,579 --> 00:13:42,180 your log should be of this kind of 316 00:13:40,200 --> 00:13:44,639 format and they can be a random format 317 00:13:42,180 --> 00:13:46,800 they can be sometimes a bit noisy with 318 00:13:44,639 --> 00:13:48,420 access information which is sometimes 319 00:13:46,800 --> 00:13:50,040 not helpful when you are debugging a 320 00:13:48,420 --> 00:13:52,800 production issue 321 00:13:50,040 --> 00:13:54,720 also logging does not provide a 322 00:13:52,800 --> 00:13:57,240 quantitative measurement of the system 323 00:13:54,720 --> 00:14:00,899 Behavior like which Quantum measurements 324 00:13:57,240 --> 00:14:03,180 are like the CPU or the memory or system 325 00:14:00,899 --> 00:14:05,760 requires and these things can actually 326 00:14:03,180 --> 00:14:08,279 help in the resource planning for 327 00:14:05,760 --> 00:14:09,660 running your systems at optimal infra 328 00:14:08,279 --> 00:14:12,079 cost 329 00:14:09,660 --> 00:14:15,180 so uh 330 00:14:12,079 --> 00:14:17,760 this is where like Matrix comes to the 331 00:14:15,180 --> 00:14:20,040 rescue uh metrics are kind of the 332 00:14:17,760 --> 00:14:21,779 quantitative measurement of the systems 333 00:14:20,040 --> 00:14:24,540 to understand how system is performing 334 00:14:21,779 --> 00:14:27,139 it provides a numerical and statistical 335 00:14:24,540 --> 00:14:30,480 insights making it easier to track 336 00:14:27,139 --> 00:14:33,360 performance detect anomalies and measure 337 00:14:30,480 --> 00:14:35,700 Trends it also kind of helps in resource 338 00:14:33,360 --> 00:14:39,360 planning as I mentioned like in my 339 00:14:35,700 --> 00:14:42,720 previous slide uh where you can give it 340 00:14:39,360 --> 00:14:46,139 can give you a better picture of how uh 341 00:14:42,720 --> 00:14:48,839 your CPU is like 342 00:14:46,139 --> 00:14:52,680 system on which your 343 00:14:48,839 --> 00:14:55,199 is like instance 344 00:14:52,680 --> 00:14:58,079 how much is CPU it is consuming how much 345 00:14:55,199 --> 00:14:59,940 is the memory and a lots of things 346 00:14:58,079 --> 00:15:03,720 others 347 00:14:59,940 --> 00:15:05,220 Etc so apart from these uh there are 348 00:15:03,720 --> 00:15:07,199 some four golden signals which are very 349 00:15:05,220 --> 00:15:09,240 much important for your software which I 350 00:15:07,199 --> 00:15:10,800 think I should cover uh one is like the 351 00:15:09,240 --> 00:15:12,899 latency which defines about like how 352 00:15:10,800 --> 00:15:14,760 system is performing at the granular 353 00:15:12,899 --> 00:15:18,060 level and how much requests are taking 354 00:15:14,760 --> 00:15:19,800 to get processed by the server then the 355 00:15:18,060 --> 00:15:23,880 traffic throughput basically defines 356 00:15:19,800 --> 00:15:26,339 like uh how much request your systems is 357 00:15:23,880 --> 00:15:29,339 uh receiving like per minute or the per 358 00:15:26,339 --> 00:15:31,560 second uh then comes the error rate 359 00:15:29,339 --> 00:15:34,199 which defines about the again the 5x 360 00:15:31,560 --> 00:15:36,600 errors in your application and that can 361 00:15:34,199 --> 00:15:38,160 be due to any recent deployment or can 362 00:15:36,600 --> 00:15:40,440 be malfunctioning of the external 363 00:15:38,160 --> 00:15:42,860 service or the database on which your 364 00:15:40,440 --> 00:15:45,720 service is actually dependent 365 00:15:42,860 --> 00:15:47,279 it comes the saturation thing saturation 366 00:15:45,720 --> 00:15:50,579 is the main thing which tells you about 367 00:15:47,279 --> 00:15:53,639 the uh quantitative measurement CPU 368 00:15:50,579 --> 00:15:55,920 memory disk eye Ops Etc 369 00:15:53,639 --> 00:15:57,660 so uh let's look into one of the 370 00:15:55,920 --> 00:16:00,660 examples so here is like one of the 371 00:15:57,660 --> 00:16:03,000 example which I have uh actually picked 372 00:16:00,660 --> 00:16:04,500 from the official documents of the new 373 00:16:03,000 --> 00:16:07,019 relics so New Relic is kind of a third 374 00:16:04,500 --> 00:16:09,660 party tool which is used for plotting 375 00:16:07,019 --> 00:16:10,740 the metrics for your services it's the 376 00:16:09,660 --> 00:16:14,100 kind of 377 00:16:10,740 --> 00:16:16,440 APM based third party Tool uh here you 378 00:16:14,100 --> 00:16:19,260 can see like it gives a proper summary 379 00:16:16,440 --> 00:16:23,279 of your service throughput uh error 380 00:16:19,260 --> 00:16:26,940 rates and how much your uh kind of 381 00:16:23,279 --> 00:16:28,860 overall service if like service apis are 382 00:16:26,940 --> 00:16:30,360 taking time the transaction time 383 00:16:28,860 --> 00:16:32,699 actually 384 00:16:30,360 --> 00:16:36,139 so this is the like the com overview how 385 00:16:32,699 --> 00:16:39,120 it looks uh golden signals uh basically 386 00:16:36,139 --> 00:16:40,680 uh there can be like more granular like 387 00:16:39,120 --> 00:16:44,820 your metrics can be improved in a mobile 388 00:16:40,680 --> 00:16:47,519 way uh here let's say uh it has one like 389 00:16:44,820 --> 00:16:49,980 a very small example of to make sure how 390 00:16:47,519 --> 00:16:52,680 things are working so like as I 391 00:16:49,980 --> 00:16:55,440 mentioned uh so the New Relic thing uh 392 00:16:52,680 --> 00:16:57,720 you can not only just see the throughput 393 00:16:55,440 --> 00:17:00,480 error rate or the uh 394 00:16:57,720 --> 00:17:04,199 transactions but you can also 395 00:17:00,480 --> 00:17:09,419 see uh segments let's say you have one 396 00:17:04,199 --> 00:17:12,000 API you want uh to have metrics at some 397 00:17:09,419 --> 00:17:14,939 pieces of code for which your API is 398 00:17:12,000 --> 00:17:18,360 dependent on uh let's say uh this is one 399 00:17:14,939 --> 00:17:19,860 of the uh code flow there here is like 400 00:17:18,360 --> 00:17:23,160 this is the conference hall manager 401 00:17:19,860 --> 00:17:24,780 where uh I'm using I'm checking there's 402 00:17:23,160 --> 00:17:26,819 like two methods like book and the 403 00:17:24,780 --> 00:17:29,460 occupied book is actually telling like 404 00:17:26,819 --> 00:17:31,679 your conference Hall is available for 405 00:17:29,460 --> 00:17:34,620 booking or not or the confidence always 406 00:17:31,679 --> 00:17:38,360 actually occupied or not uh so 407 00:17:34,620 --> 00:17:41,760 considering this is working at uh 408 00:17:38,360 --> 00:17:44,100 that's a million of like a traffic 409 00:17:41,760 --> 00:17:47,880 throughput is in millions let's say so 410 00:17:44,100 --> 00:17:49,980 things kind of get uh very difficult to 411 00:17:47,880 --> 00:17:51,780 understand like how much this piece of 412 00:17:49,980 --> 00:17:54,539 wood white might be taking this is where 413 00:17:51,780 --> 00:17:57,000 like I can use these uh function traces 414 00:17:54,539 --> 00:18:00,000 and I can get the average transaction 415 00:17:57,000 --> 00:18:02,640 calls and how much time it is uh taking 416 00:18:00,000 --> 00:18:05,820 uh this is like just for understanding 417 00:18:02,640 --> 00:18:08,280 purpose example but uh if your service 418 00:18:05,820 --> 00:18:11,400 is having a business layer and you want 419 00:18:08,280 --> 00:18:14,640 to actually uh look into like how much 420 00:18:11,400 --> 00:18:16,919 your algorithm is uh doing this piece of 421 00:18:14,640 --> 00:18:19,320 work uh how much time it is taking how 422 00:18:16,919 --> 00:18:21,960 much memory it is being utilized for X 423 00:18:19,320 --> 00:18:24,299 calls per minute then these kind of 424 00:18:21,960 --> 00:18:27,480 function traces are really helpful 425 00:18:24,299 --> 00:18:30,360 another example is like using stat CD 426 00:18:27,480 --> 00:18:33,419 where stats D is a another tool which in 427 00:18:30,360 --> 00:18:37,679 Python which I can use uh let's say I 428 00:18:33,419 --> 00:18:41,340 want some function to have uh kind let's 429 00:18:37,679 --> 00:18:43,200 say I want to have a 430 00:18:41,340 --> 00:18:45,419 like to 431 00:18:43,200 --> 00:18:48,960 report a latency of a function running 432 00:18:45,419 --> 00:18:51,360 and then I want to see so you can use 433 00:18:48,960 --> 00:18:53,580 the latency calculation like the stats 434 00:18:51,360 --> 00:18:55,559 the timer can be used to measure the 435 00:18:53,580 --> 00:18:57,539 latency of a particular piece of 436 00:18:55,559 --> 00:19:00,480 function which is being run in your code 437 00:18:57,539 --> 00:19:04,080 and apart from that let's say uh there's 438 00:19:00,480 --> 00:19:05,340 an uh instance which is like in your 439 00:19:04,080 --> 00:19:07,799 business like there's a logic in your 440 00:19:05,340 --> 00:19:10,140 business layer uh where uh it is 441 00:19:07,799 --> 00:19:12,419 dependent on it kinds of emitting some 442 00:19:10,140 --> 00:19:15,720 kind of messages or the events to a 443 00:19:12,419 --> 00:19:17,880 queue let's say this queue is your sqsq 444 00:19:15,720 --> 00:19:19,440 so you want to know like how much how 445 00:19:17,880 --> 00:19:21,240 much are the successful enqueues and how 446 00:19:19,440 --> 00:19:23,039 many are the field in queues you can 447 00:19:21,240 --> 00:19:26,640 easily uh 448 00:19:23,039 --> 00:19:28,740 uh emit kind that kind of data using the 449 00:19:26,640 --> 00:19:32,340 stats Decline and you can just plot it 450 00:19:28,740 --> 00:19:34,740 uh on your weather like uh plotted and 451 00:19:32,340 --> 00:19:36,299 you can visualize your complete data how 452 00:19:34,740 --> 00:19:37,679 things are working and you can 453 00:19:36,299 --> 00:19:42,660 understand if there's something going 454 00:19:37,679 --> 00:19:45,419 wrong then you can uh like do some like 455 00:19:42,660 --> 00:19:48,600 can outline some action levels on it 456 00:19:45,419 --> 00:19:51,120 so this was like about the metrics uh 457 00:19:48,600 --> 00:19:52,740 limitations are like there are some few 458 00:19:51,120 --> 00:19:56,280 kinds of limitations in The Matrix as 459 00:19:52,740 --> 00:19:58,740 well uh so they they kind of provide a 460 00:19:56,280 --> 00:20:02,580 very limited context and they don't 461 00:19:58,740 --> 00:20:04,260 provide very rich context like uh like 462 00:20:02,580 --> 00:20:06,000 Matrix mostly focus on the numerical 463 00:20:04,260 --> 00:20:08,580 values and they provide insights into 464 00:20:06,000 --> 00:20:10,980 like Trends they kind of what they are 465 00:20:08,580 --> 00:20:13,620 lacking actually is the uh information 466 00:20:10,980 --> 00:20:15,539 necessary to fully understand like the 467 00:20:13,620 --> 00:20:17,640 reason behind these certain values why 468 00:20:15,539 --> 00:20:21,059 this is happening why latency is so much 469 00:20:17,640 --> 00:20:23,940 why uh my messages didn't got enqueued 470 00:20:21,059 --> 00:20:27,000 while it got failed uh these kinds of 471 00:20:23,940 --> 00:20:29,100 things which our metrics don't answer 472 00:20:27,000 --> 00:20:31,260 sometimes like another thing is like the 473 00:20:29,100 --> 00:20:33,000 metric overload thing uh where uh 474 00:20:31,260 --> 00:20:34,559 sometimes like tracking too many metrics 475 00:20:33,000 --> 00:20:36,179 can lead to some information overload 476 00:20:34,559 --> 00:20:37,200 making it difficult to focus on what's 477 00:20:36,179 --> 00:20:39,720 important 478 00:20:37,200 --> 00:20:42,240 and over optimization is like only 479 00:20:39,720 --> 00:20:44,940 solely dependent on The Matrix decision 480 00:20:42,240 --> 00:20:48,900 making can lead to over optimization 481 00:20:44,940 --> 00:20:50,580 this is where uh events come uh in the 482 00:20:48,900 --> 00:20:51,720 picture so events are kind of the 483 00:20:50,580 --> 00:20:53,760 fundamental component of the 484 00:20:51,720 --> 00:20:56,160 observability but they slightly provide 485 00:20:53,760 --> 00:20:58,320 a different purpose compared to logs uh 486 00:20:56,160 --> 00:21:00,120 they kind of provide a they kind of 487 00:20:58,320 --> 00:21:01,740 provide a rich information like they 488 00:21:00,120 --> 00:21:05,520 will actually tell you like why the 489 00:21:01,740 --> 00:21:07,380 latency was increased and why the uh 490 00:21:05,520 --> 00:21:09,360 basically the 491 00:21:07,380 --> 00:21:11,760 messages we are getting in keyword why 492 00:21:09,360 --> 00:21:13,740 they are getting failed and so on so 493 00:21:11,760 --> 00:21:15,539 they kind of include metadata and 494 00:21:13,740 --> 00:21:19,140 structured data like timestamps event 495 00:21:15,539 --> 00:21:21,120 types additional attributes Etc so these 496 00:21:19,140 --> 00:21:23,580 kind like these events are actually 497 00:21:21,120 --> 00:21:25,380 helpful when you want to track even more 498 00:21:23,580 --> 00:21:27,299 granular level of business related 499 00:21:25,380 --> 00:21:30,299 events like how many users are able to 500 00:21:27,299 --> 00:21:32,940 view the products how many uh products 501 00:21:30,299 --> 00:21:35,039 are getting uh added to the card for 502 00:21:32,940 --> 00:21:38,520 example like in case of e-commerce 503 00:21:35,039 --> 00:21:41,280 application uh Etc these events can be 504 00:21:38,520 --> 00:21:43,860 used for the analytical purpose to make 505 00:21:41,280 --> 00:21:45,600 decision making and drive business and 506 00:21:43,860 --> 00:21:48,299 it will help so help you understand like 507 00:21:45,600 --> 00:21:51,600 what is actually uh impacting the 508 00:21:48,299 --> 00:21:54,059 business and how you can improve it 509 00:21:51,600 --> 00:21:56,280 these events can be pushed like in a 510 00:21:54,059 --> 00:21:57,539 column now databases where you can which 511 00:21:56,280 --> 00:21:59,880 are actually used for the analytical 512 00:21:57,539 --> 00:22:01,679 purposes and you can understand the 513 00:21:59,880 --> 00:22:03,780 internal states of the application by 514 00:22:01,679 --> 00:22:06,059 querying on the large events data set 515 00:22:03,780 --> 00:22:08,280 some of the examples of the columnar 516 00:22:06,059 --> 00:22:10,380 databases are like Apache Cassandra and 517 00:22:08,280 --> 00:22:13,080 Amazon redshift 518 00:22:10,380 --> 00:22:16,799 so uh one of the example of the events 519 00:22:13,080 --> 00:22:19,740 is like this uh where I will be uh like 520 00:22:16,799 --> 00:22:21,539 this is a kind of a schema uh for like 521 00:22:19,740 --> 00:22:23,940 booking events for the pycon conference 522 00:22:21,539 --> 00:22:25,740 and uh it has this user ID event type 523 00:22:23,940 --> 00:22:28,200 action action will tell you about the 524 00:22:25,740 --> 00:22:29,880 order placed Auto canceled then there's 525 00:22:28,200 --> 00:22:34,799 ticket type student professional 526 00:22:29,880 --> 00:22:36,179 hobbyist whether so uh kind of uh this 527 00:22:34,799 --> 00:22:39,000 this is again like a small piece of code 528 00:22:36,179 --> 00:22:40,980 where I'm using a traditional RDS just 529 00:22:39,000 --> 00:22:42,780 for sake of example however like when 530 00:22:40,980 --> 00:22:45,539 you're working on the scale 531 00:22:42,780 --> 00:22:47,700 um I would recommend like uh Corona 532 00:22:45,539 --> 00:22:52,080 databases are like much better for this 533 00:22:47,700 --> 00:22:54,539 use case uh then comes like here at the 534 00:22:52,080 --> 00:22:56,159 right bottom you can see the uh ticket 535 00:22:54,539 --> 00:22:57,720 booking event I am creating where I'm 536 00:22:56,159 --> 00:23:00,900 passing the request context request 537 00:22:57,720 --> 00:23:03,720 context will be having the user ID and 538 00:23:00,900 --> 00:23:06,900 the uh other metadata which is required 539 00:23:03,720 --> 00:23:09,000 for the reporting the events and in the 540 00:23:06,900 --> 00:23:10,380 set event attributes I am kind of 541 00:23:09,000 --> 00:23:12,360 actually 542 00:23:10,380 --> 00:23:14,880 setting like what was the action where 543 00:23:12,360 --> 00:23:16,860 the order was placed or the canceled or 544 00:23:14,880 --> 00:23:19,740 what was the ticket type was it student 545 00:23:16,860 --> 00:23:22,020 professional or hobbyist and at the end 546 00:23:19,740 --> 00:23:25,080 like I'm using the emit method which is 547 00:23:22,020 --> 00:23:29,640 kind of emitting my complete data into 548 00:23:25,080 --> 00:23:32,820 the RDS so this is like the overall uh 549 00:23:29,640 --> 00:23:34,080 example of the events uh now we have 550 00:23:32,820 --> 00:23:35,760 covered like almost all the three 551 00:23:34,080 --> 00:23:37,919 pillars of the observability now let's 552 00:23:35,760 --> 00:23:40,919 take a look into how to build and drive 553 00:23:37,919 --> 00:23:44,640 that culture within your team 554 00:23:40,919 --> 00:23:47,820 uh first thing comes like uh education 555 00:23:44,640 --> 00:23:50,820 like educating the team is uh very much 556 00:23:47,820 --> 00:23:53,640 important for uh like 557 00:23:50,820 --> 00:23:55,799 very much important you have to teach 558 00:23:53,640 --> 00:23:57,960 your team of the importance of the 559 00:23:55,799 --> 00:23:59,820 observability and how it contributes 560 00:23:57,960 --> 00:24:02,700 contributes to building reliable and 561 00:23:59,820 --> 00:24:04,980 maintainable systems uh set clear goals 562 00:24:02,700 --> 00:24:07,020 within your team for the obsibility with 563 00:24:04,980 --> 00:24:09,659 and uh 564 00:24:07,020 --> 00:24:11,400 like discuss on things like what aspects 565 00:24:09,659 --> 00:24:13,799 of your system do you want to Monitor 566 00:24:11,400 --> 00:24:16,380 and what key metrics are the business 567 00:24:13,799 --> 00:24:19,200 critical to make sure your services up 568 00:24:16,380 --> 00:24:22,580 and running and doing like solving 569 00:24:19,200 --> 00:24:25,679 business problems as expected 570 00:24:22,580 --> 00:24:27,360 use of a second thing is like using 571 00:24:25,679 --> 00:24:29,880 right tools and standardize the events 572 00:24:27,360 --> 00:24:31,740 format like use of possibility right 573 00:24:29,880 --> 00:24:34,320 like right tools for the observability 574 00:24:31,740 --> 00:24:36,179 is important it can be a good investment 575 00:24:34,320 --> 00:24:38,940 that uh which can help you capture 576 00:24:36,179 --> 00:24:41,100 events logs metrics effectively uh these 577 00:24:38,940 --> 00:24:43,740 I have already discussed in my few like 578 00:24:41,100 --> 00:24:45,780 previous slides uh choose tools that 579 00:24:43,740 --> 00:24:47,340 kind of support visualization data 580 00:24:45,780 --> 00:24:50,159 alerting and the analysis of 581 00:24:47,340 --> 00:24:53,280 observability data these tools should be 582 00:24:50,159 --> 00:24:55,020 like encouraged and so that developers 583 00:24:53,280 --> 00:24:57,059 can maintain their code and also 584 00:24:55,020 --> 00:24:59,159 instrument their code as well even 585 00:24:57,059 --> 00:25:02,220 should follow a standardized format this 586 00:24:59,159 --> 00:25:03,780 consistency AIDS in later analysis and 587 00:25:02,220 --> 00:25:05,659 troubleshooting during production issues 588 00:25:03,780 --> 00:25:09,780 very much easily 589 00:25:05,659 --> 00:25:11,400 uh add automated alerts on the basis of 590 00:25:09,780 --> 00:25:13,220 their thresholds like once you have 591 00:25:11,400 --> 00:25:16,799 multiple panels ready you can add 592 00:25:13,220 --> 00:25:19,679 automatic alerts to detect anomalies in 593 00:25:16,799 --> 00:25:22,320 your metrics or make sure because like 594 00:25:19,679 --> 00:25:24,179 uh to make sure like there's nothing uh 595 00:25:22,320 --> 00:25:26,460 production impacting as such 596 00:25:24,179 --> 00:25:29,179 those alerts should be also relevant for 597 00:25:26,460 --> 00:25:29,179 the team as well 598 00:25:29,240 --> 00:25:34,080 uh this is another important thing post 599 00:25:31,980 --> 00:25:35,640 incident reviews uh conducting post 600 00:25:34,080 --> 00:25:37,380 incident reviews is a good practice 601 00:25:35,640 --> 00:25:39,360 every production incident should have a 602 00:25:37,380 --> 00:25:41,580 report known as the RCA which stands for 603 00:25:39,360 --> 00:25:43,440 root cause analysis which tells about 604 00:25:41,580 --> 00:25:46,080 production issue how system working 605 00:25:43,440 --> 00:25:48,419 which component of the system failed how 606 00:25:46,080 --> 00:25:50,880 it got fixed and outlining the action 607 00:25:48,419 --> 00:25:53,159 items to prevent the issue in the future 608 00:25:50,880 --> 00:25:55,620 overall RCA helps in understanding the 609 00:25:53,159 --> 00:25:58,700 root causes and helps identifying the 610 00:25:55,620 --> 00:25:58,700 areas for the Improvement 611 00:25:59,419 --> 00:26:06,480 uh at the last uh also lead by the 612 00:26:03,179 --> 00:26:07,740 example and celebrate success as a 613 00:26:06,480 --> 00:26:09,779 leader of the team you should 614 00:26:07,740 --> 00:26:12,960 demonstrate the observative practices in 615 00:26:09,779 --> 00:26:14,460 your own work show the value of 616 00:26:12,960 --> 00:26:16,380 observability through the real life 617 00:26:14,460 --> 00:26:19,440 examples and some success stories 618 00:26:16,380 --> 00:26:22,159 sharing Tech blogs or share some 619 00:26:19,440 --> 00:26:24,960 learnings which you have recently solved 620 00:26:22,159 --> 00:26:26,820 uh create some documentations and 621 00:26:24,960 --> 00:26:30,299 resources outline some of the beauty 622 00:26:26,820 --> 00:26:32,100 practices tools and their usage make it 623 00:26:30,299 --> 00:26:34,260 easy for like team members to access 624 00:26:32,100 --> 00:26:37,500 those docs and refer them whenever 625 00:26:34,260 --> 00:26:39,179 required at last like don't forget to 626 00:26:37,500 --> 00:26:42,419 celebrate success because you are doing 627 00:26:39,179 --> 00:26:44,159 a much of the hard work and celebrate 628 00:26:42,419 --> 00:26:45,900 where observability driven practice have 629 00:26:44,159 --> 00:26:49,080 actually led to the quicker production 630 00:26:45,900 --> 00:26:51,679 issue resolution or enhance the system 631 00:26:49,080 --> 00:26:51,679 performance 632 00:26:52,279 --> 00:26:57,900 uh that's it uh I would like to like end 633 00:26:55,980 --> 00:27:00,179 this with a note like remember that 634 00:26:57,900 --> 00:27:02,460 building an observability driven culture 635 00:27:00,179 --> 00:27:04,080 takes time and commitment and it 636 00:27:02,460 --> 00:27:05,640 requires an ongoing effort and the 637 00:27:04,080 --> 00:27:08,480 continuous Improvement 638 00:27:05,640 --> 00:27:08,480 thank you so much 639 00:27:09,050 --> 00:27:13,980 [Applause] 640 00:27:12,480 --> 00:27:16,799 thank you for your time 641 00:27:13,980 --> 00:27:19,559 we have space for exactly one question 642 00:27:16,799 --> 00:27:23,299 before we have to take a break 643 00:27:19,559 --> 00:27:23,299 um if someone monster is the hands 644 00:27:33,299 --> 00:27:37,440 can't see any questions in the audience 645 00:27:35,039 --> 00:27:38,540 currently uh thank you so much can we 646 00:27:37,440 --> 00:27:45,359 get another round of applause 647 00:27:38,540 --> 00:27:45,359 [Applause]