Squash Production Defects Quickly - The Power of Structured Logging in Rails


Summarized using AI

Squash Production Defects Quickly - The Power of Structured Logging in Rails

John Gallagher • September 13, 2024 • Sarajevo, Bosnia and Herzegovina • Talk

In the talk titled "Squash Production Defects Quickly - The Power of Structured Logging in Rails," John Gallagher focuses on enhancing the observability of Rails applications using structured logging, a feature introduced in Rails 8. The session is framed around the common issue developers face: debugging production incidents without sufficient visibility into application performance and errors.

Key Points:
- Observability Challenges: Gallagher begins with a relatable anecdote of receiving urgent notifications about password reset issues, illustrating the frustrations faced when logs fail to provide insights into application defects.
- The Current State of Logging: Traditional text-based logging creates disorganization and makes it difficult to track down issues efficiently. Gallagher notes his survey of 50 engineers, revealing that a significant number encounter similar visibility issues monthly or weekly.
- Five-Step Process to Improve Observability:
- Determine the question that needs answering.
- Identify the necessary data to address that question.
- Build instrumentation to collect the data.
- Utilize the logs and graphs derived from that data.
- Continuously improve the logging process.

  • Introduction to Structured Logging: Structured logging allows for a more organized logging approach, making it easier to query and analyze logs, compared to traditional, unstructured logs. Gallagher emphasizes the ability to filter and group logs by job class and duration to find solutions more quickly.
  • Implementation with Semantic Logger: Gallagher advocates for using the Semantic Logger library for structured logging in Rails apps. He explains how to format logs as JSON for better data manipulation and visualization in observability tools.
  • Real-World Application: By applying structured logging, Gallagher shares real examples where he identified a job queue delay that was impacting password reset emails, leading to a proactive resolution without significant downtime for users.
  • Results Achieved: Over 18 months, implementing these strategies led to an impressive 98% reduction in downtime and a 20-fold increase in the speed of fixing bugs.

In conclusion, Gallagher's talk empowers developers to transition from reactive to proactive incident management through structured logging, ultimately fostering a culture of observability in software development.

Overall, the talk stresses the importance of systematic logging in enhancing application performance and reducing the time and effort spent on debugging production issues.

Squash Production Defects Quickly - The Power of Structured Logging in Rails
John Gallagher • Sarajevo, Bosnia and Herzegovina • Talk

Date: September 13, 2024
Published: January 13, 2025
Announced: unknown

Rails apps can be a black box. Ever seen a defect where you just can't figure out what's going on?This talk will give you practical steps to improve the observability of your Rails app, taking the time to understand and fix defects from hours or days to minutes.

Rails 8 brings an exciting new feature: built-in structured logging. This talk will delve into the transformative impact of structured logging on application observability and fixing defects.

Structured logging, as a cornerstone of observability, offers a more organized and query-able way to handle logs compared to traditional text-based logs. This session will guide you through the nuances of structured logging in Rails, demonstrating how it can be leveraged to gain better insights into your application's behavior.

This talk will be a deep technical dive into how to make structured logging work with an existing Rails app.

EuRuKo 2024

00:00:10.160 so I'm here to talk about how to squash
00:00:12.679 production bugs
00:00:14.839 quickly I work at a company called
00:00:17.039 Bigger Pockets and this is based on a
00:00:20.199 true story names have been changed to
00:00:23.240 protect the
00:00:24.960 innocent so stop me if this sounds
00:00:27.119 familiar it's Friday afternoon and I'm
00:00:29.800 on call and I get a slack notification
00:00:33.200 it's from the support Channel H oh uh
00:00:36.800 and another one looks like customers
00:00:39.840 cannot receive their password reset
00:00:42.360 emails uh that's a bit unusual oh and
00:00:44.760 another one oh uh
00:00:46.879 okay uh let's look at the logs
00:00:50.120 then huh it's not very good is it um
00:00:54.480 what about the code let's check the
00:00:56.559 code uh that seems fine
00:01:00.600 uh and that doesn't really help me
00:01:02.199 meanwhile the CTO is getting a bit
00:01:06.600 anxious and uh customers are complaining
00:01:09.520 uh they're taking to Twitter they're
00:01:11.560 locked out and they can't get into their
00:01:14.400 accounts fine so I drop everything that
00:01:16.840 I'm doing and I pair with a
00:01:19.159 colleague and these mailer jobs are
00:01:21.320 getting sent in a background queue so
00:01:23.320 let's add some error logging so we add
00:01:26.439 the error logging we spend 20 minutes
00:01:28.680 trying to figure out where to put it and
00:01:30.680 whilst we're doing
00:01:32.119 that oh uh that's a bit
00:01:35.320 odd it resets itself everything's fine
00:01:38.920 again okay well I mean we'll just fix it
00:01:41.960 later right uh we'll put it on the
00:01:44.759 backlog uh one second that's not the
00:01:47.960 backlog yeah no that's not the backlog
00:01:50.759 either uh this is the backlog we'll put
00:01:52.960 it on the
00:01:54.000 backlog done uh and that's definitely
00:01:56.680 going to get done right because we all
00:01:58.520 know what happens to things on the
00:02:00.799 backlog uh but it was a one-off issue
00:02:03.360 okay no problem uh we'll just uh wait a
00:02:07.320 minute two months later uh oh yep it's
00:02:11.400 back again Groundhog Day same problem
00:02:15.920 reset emails uh support team are getting
00:02:18.360 kind of overwhelmed but we added the
00:02:21.920 logging cool so it's fine we'll just
00:02:25.840 search uh in the logs oh dear still no
00:02:30.200 error so we'll look at the sidekick
00:02:32.560 admin dashboard right that'll
00:02:35.360 um yeah that'll help us somehow I don't
00:02:38.720 know and so we got into this vicious
00:02:41.080 cycle we would add more logging and we
00:02:43.680 ended up just banging our heads against
00:02:45.239 a brick wall we would look at the bug an
00:02:48.040 hour it kind of fix itself we have more
00:02:50.480 logging it goes to the backlog and
00:02:53.080 customers were getting annoyed and it
00:02:54.680 was a bit embarrassing so I started to
00:02:56.760 think maybe this is just us maybe we're
00:02:58.840 the only company who has has this
00:03:00.760 problem so I asked 50 Engineers I put
00:03:04.319 out a Ser survey asking them this
00:03:07.519 question how often do you encounter
00:03:09.400 issues where you lack visibility into
00:03:12.120 your applications performance or errors
00:03:15.040 and the result kind of shocked me now
00:03:18.120 this is only 50 people this is not
00:03:20.840 people who use rails even this is just
00:03:23.120 across all
00:03:24.080 languages 50% of people say they
00:03:27.120 experience this every month and a third
00:03:30.159 of people say they experience it every
00:03:32.439 week okay so maybe we're not on our
00:03:36.000 own so next I want to talk a bit about
00:03:38.680 feelings how does it feel when you're in
00:03:40.840 this situation well for me first
00:03:43.680 reaction is I'm annoyed this should be
00:03:46.560 easy isn't Ruby supposed to be this
00:03:49.239 joyful language this doesn't feel very
00:03:51.159 joyful we're in 2024 and I can't figure
00:03:53.799 out what's going wrong in a production
00:03:56.360 incident this is not great and then I
00:03:58.799 feel a bit sad
00:04:00.400 I kind of get a bit bored and I'm just
00:04:02.200 kind of going through the logs kind of
00:04:04.959 looking at the error tracking looking at
00:04:06.680 at stack tray still can't figure it out
00:04:09.920 and then the time moves on and pressure
00:04:12.439 starts to kick in this is an urgent
00:04:14.560 issue it needs to be resolved and then I
00:04:17.160 start to think hang on maybe this is
00:04:18.799 just me maybe I'm just a rubbish
00:04:21.199 engineer and hang on I've been doing
00:04:23.160 this for more than 10 years I've got to
00:04:24.840 go to my manager and say oh sorry I
00:04:27.280 can't figure this out it's not great
00:04:30.680 so if you've ever felt like this I've
00:04:33.039 got some good news for you the problem
00:04:34.960 is not you and you're not
00:04:37.199 alone and in order to combat this I came
00:04:39.639 up with this five-step process and here
00:04:43.039 it is the first thing is what is the
00:04:46.320 question you want to ask you decide what
00:04:50.600 data would help you answer that question
00:04:53.440 you build some instrumentation and
00:04:55.520 deploy it you use the graphs or the logs
00:04:59.479 and finally you improve and it's a
00:05:01.440 little cycle you can just keep going
00:05:03.320 round and round this cycle and if you
00:05:05.160 turn it on its side it's kind of got
00:05:07.160 some nice little steps so I call this
00:05:10.160 the steps to OB observable software or
00:05:13.199 SOS for short so the first thing we're
00:05:16.440 going to do is what question do we want
00:05:19.080 answered in this case I've taken here
00:05:21.520 why are password reset emails not being
00:05:25.199 sent now that's a pretty generic
00:05:27.479 question it's pretty vague so we need to
00:05:29.960 come up with some hypotheses why might
00:05:32.319 they not being sent well here are some
00:05:34.080 three reasons just off the top of my
00:05:35.720 head maybe they're failing maybe they're
00:05:37.440 timing out or maybe they're being
00:05:38.800 delayed well they're probably not
00:05:40.160 failing because we don't see any errors
00:05:41.919 in our error tracking software same
00:05:44.600 thing for timing out so my guess is that
00:05:47.560 somehow they're being
00:05:49.759 delayed so what might this look like so
00:05:52.680 here's the theory maybe this is what a
00:05:54.280 healthy queue looks like the green boxes
00:05:57.759 are the mailer jobs and the gray boxes
00:06:00.919 are other jobs so it's nice and healthy
00:06:02.919 they're interspersed so it won't be long
00:06:04.639 until somebody gets a mail sent to them
00:06:07.000 and then this might be maybe what's
00:06:09.720 going on that there's lots of other jobs
00:06:12.120 being created and that's pushing
00:06:14.240 password reset emails off the end of the
00:06:17.400 off the end of the the time Horizon if
00:06:19.280 you like and I've actually experienced
00:06:21.840 this I recently booked a flight with an
00:06:24.880 airline who I shall not name and the
00:06:27.759 exact same problem happened I signed up
00:06:29.680 for the account I gave them my email
00:06:31.919 address and they said you'll get a pass
00:06:33.840 you'll get an activation email shortly I
00:06:36.479 waited 1 Hour 2 hours 3 hours I went to
00:06:38.800 bed I got up the next morning and there
00:06:41.360 it was in my inbox after 4 hours and uh
00:06:45.800 it timed out after 10 minutes so that
00:06:47.759 was really useful wasn't it anyway so
00:06:51.160 this might be what's happening so how
00:06:53.280 can we ask a more specific question
00:06:55.319 around that well which jobs are taking
00:06:59.840 time in that same queue okay I think we
00:07:03.879 can answer that through some logging and
00:07:05.479 through through some graphs so let's
00:07:07.360 first of all decide which data to
00:07:10.080 collect and when I'm doing this I tend
00:07:11.879 to look at these four different
00:07:13.479 dimensions which is the event that we
00:07:16.000 want we want to look at jobs that have
00:07:19.240 been performed right I think of logs as
00:07:22.560 events a tiny little events it's just
00:07:24.520 saying this
00:07:25.960 happened second of all we want to filter
00:07:28.240 by the job cu
00:07:30.400 we want to group by the job class and we
00:07:32.280 want to look at the duration of that
00:07:35.000 okay seems pretty reasonable so how do
00:07:38.160 we build that so I want to take you
00:07:40.360 through right from having no structured
00:07:42.479 logging and really no observability at
00:07:44.360 all through how I've done that so here
00:07:47.120 are the four main areas first of all I'm
00:07:49.120 going to talk a little bit about which
00:07:50.759 data type to use there's a lot of
00:07:52.360 misinformation about these out there
00:07:54.560 should we use plain text or structured
00:07:57.080 I'm pretty sure you can guess the answer
00:07:58.800 because it cludes in the title of the
00:08:00.280 presentation and finally I want to talk
00:08:02.759 about which logging library to use and
00:08:04.360 how to customize it so here are the
00:08:06.639 three data type these are not pillars if
00:08:09.960 anybody says these are the three pillars
00:08:12.120 observability close the browser Tab and
00:08:14.759 go and look at somebody else these are
00:08:16.520 not pillars they're three data types
00:08:20.000 traces are the best and metrics are the
00:08:22.520 worst and logs are okay we're going to
00:08:25.680 start out with logs purely because
00:08:27.919 everybody in this room probably knows
00:08:29.879 what logging is and it's also being
00:08:32.599 improved in rails 8 so let's be a bit
00:08:36.080 pragmatic here second all should we have
00:08:38.399 plain or structured we're going to use
00:08:40.360 structured logs and let me explain why
00:08:43.320 so here's what a plain text log looks
00:08:45.640 like in code I'm sure you've seen this
00:08:47.720 logger doino string with a little bit of
00:08:51.320 data interpolated in the string okay so
00:08:54.519 if we used it in our observability tool
00:08:57.320 whether that's New Relic data dog apps
00:08:59.640 or any one of these here's what it might
00:09:01.560 look like a little bit so let's say we
00:09:03.839 want to search for mailer jobs we'd have
00:09:06.200 to use a regular expression okay fuzzy
00:09:10.399 find cool so how do I now see any job
00:09:13.399 that's taken more than one
00:09:15.560 second Ah that's right I can't it's
00:09:18.880 impossible here's what a structured log
00:09:21.440 on the other hand looks like you've got
00:09:23.959 a logger and you've got a structured
00:09:26.720 thing which is in this case a hash so
00:09:28.920 it's a bit like like just having a
00:09:30.399 database imagine if you only stored
00:09:32.480 strings in a database some people do it
00:09:35.519 uh it's not such a great idea you
00:09:37.279 couldn't filter search query by anything
00:09:40.360 so this gives you all these different
00:09:42.600 attributes here's what it now looks like
00:09:44.640 in our UI we can make columns with these
00:09:47.200 attributes in and then we can filter on
00:09:50.120 these
00:09:51.040 attributes here's how we can filter by
00:09:53.640 it and this is great this is what we
00:09:56.000 want nice so the next thing is which
00:09:58.519 logging Library there are many ones in
00:10:00.920 Ruby here are some of my criteria to
00:10:03.360 choose uh it's got to be able to do
00:10:05.880 structured logging obviously it's got to
00:10:08.000 integrate with rails great documentation
00:10:10.000 that's a bit of a personal standard of
00:10:12.640 mine and then it's got to be relatively
00:10:15.560 mature now
00:10:17.519 obviously we we all know that we're
00:10:19.480 incredibly grateful to all these authors
00:10:21.279 for making open- Source gems that being
00:10:24.480 when you compare them all semantic
00:10:26.600 logger for me is the best option here
00:10:32.839 so we're going to take that we're going
00:10:34.640 to install the rails bindings for
00:10:36.639 semantic logger we're going to format
00:10:39.480 those logs as
00:10:40.920 Json and when we do those two things we
00:10:43.839 get some automatic logging of rails out
00:10:47.399 of the box things by default it's not
00:10:50.200 brilliant it's just some absolute
00:10:53.399 Basics so here are the libraries that it
00:10:56.079 logs by default and of these we really
00:10:59.480 care about the active job logging and
00:11:02.399 here's what it looks like in our logs
00:11:04.680 this is kind of adjacent object so you
00:11:06.360 can see the attributes that it
00:11:09.920 outputs here's the event name here's the
00:11:13.480 job class and Q and finally here's the
00:11:17.040 duration so it's all good we can move on
00:11:19.880 with our lives we can just install it
00:11:22.320 and we're all good to go right well not
00:11:24.160 quite because this approach has many
00:11:26.360 different weaknesses first one is
00:11:28.920 there's no conventions there's some
00:11:31.120 missing attributes and API requests are
00:11:33.360 not logged and lots more besides but I
00:11:35.920 just want to focus on those three so
00:11:38.519 we're all here partly because rails is
00:11:41.320 convention over configuration that was
00:11:43.360 its big selling point it's delivered on
00:11:45.440 that except I would argue for logs which
00:11:48.760 is a little bit kind of left in the
00:11:50.360 weeds so any of these attribute names is
00:11:54.040 is fine right we can have us hash
00:11:56.079 structured however we want uh let's use
00:11:59.240 job class no no I like job. class no I I
00:12:02.000 want message. class and so we end up
00:12:04.320 with a mess and we all know where that
00:12:06.399 kind of leads us so enter open Telemetry
00:12:09.680 we can use this amazing cloud native
00:12:11.760 Foundation project now there is an Noel
00:12:14.360 library for Ruby but it's just not ready
00:12:16.440 for production use yet it's not that
00:12:18.240 mature instead we can look at the
00:12:20.120 semantic conventions for open Telemetry
00:12:23.279 which is on their
00:12:24.480 website here is an example the name of
00:12:27.320 the queue it's maybe not that semantic
00:12:30.199 messaging. destination. name we can
00:12:32.959 maybe Alias that in our observability
00:12:35.040 tools so something a little bit more
00:12:37.560 friendly so just as a reminder here's
00:12:39.880 what semantic logo gives you out of the
00:12:41.480 box and here's what it would look like
00:12:43.320 with these semantic standards from open
00:12:45.959 Telemetry so let's change the code to
00:12:49.279 help us do that so first of all we're
00:12:51.800 going to find the log log subscriber
00:12:54.440 within semantic logger that's a little
00:12:56.000 bit hacky we're going to copy and paste
00:12:58.279 that into our code and then we're going
00:13:00.959 to change the payload event of the event
00:13:04.000 formatter class in that in that job
00:13:06.839 subscribe it so you can see here don't
00:13:08.560 worry about the details but I've
00:13:10.199 restructured the hash to rename the
00:13:12.800 attributes and then finally we swap it
00:13:15.720 out in in an initializer and this is
00:13:18.920 what we
00:13:20.800 get okay pretty good so let's talk about
00:13:23.800 missing attributes here are the absolute
00:13:27.399 Basics I would argue to be able to
00:13:29.920 instrument and observe a rails app and
00:13:32.440 none of these are given to us out of the
00:13:34.720 box so I'm just going to show you how to
00:13:36.959 do two of these headers and user agent
00:13:41.040 so we can do this by adding into
00:13:43.279 application. yaml uhrb log tags and
00:13:47.519 there are two ways of doing this you can
00:13:49.040 do it through a
00:13:50.199 Lambda or you can do it through kind of
00:13:53.279 a plain hash approach the plain hash
00:13:55.360 approach will take the request and call
00:13:58.320 this method on it and rename it
00:14:00.800 basically tag it with those
00:14:02.880 bugs and here's what we get we can see
00:14:05.959 the extra attributes down here
00:14:09.079 fantastic so then the final thing is we
00:14:11.639 need to talk about missing API requests
00:14:14.240 so any API request very often they have
00:14:17.199 these kind of plain text loggers which
00:14:18.959 will spew all of this nonsense to your
00:14:21.040 instrumentation to which is useless so
00:14:23.120 let's assume we're using Faraday which
00:14:25.279 as we all know is the best HDT P client
00:14:28.680 yeah not along and so we Define some
00:14:31.720 middleware in
00:14:33.199 Faraday we're going to register that
00:14:35.759 middleware and then use it and so now we
00:14:39.120 get structured logs for our API requests
00:14:42.360 and here's what it looks
00:14:44.759 like
00:14:46.560 cool so the next thing we need to do is
00:14:49.560 send it to our observability tool so the
00:14:52.320 nice thing about semantic logger is it
00:14:54.560 had a has app penders so you can send
00:14:56.959 your logs to single or M multiple
00:14:59.720 sources and because app signal is a very
00:15:02.680 generous sponsor of this conference I've
00:15:04.720 given an example of app signal right
00:15:06.920 here you can tell this is all the code
00:15:08.680 to make it work it's pretty simple it's
00:15:11.120 pretty clean and here's what it might
00:15:13.519 look like in the app signal interface
00:15:16.440 you've got all your attributes on the
00:15:17.880 left and the values on the right wow so
00:15:21.399 okay that was a lot if you want any more
00:15:23.759 information you can go to this QR code
00:15:25.800 I'll be showing the same QR code at the
00:15:27.639 end of the presentation this has all the
00:15:29.839 slides on and a few goodies for
00:15:32.519 you so the next bit is a bit where I get
00:15:35.759 really excited we've built this
00:15:37.160 instrumentation we've deployed it now we
00:15:39.360 need to look at the graphs and it's
00:15:41.560 graphs the where this stuff really
00:15:43.680 starts to uh get interesting so as a
00:15:46.680 reminder here are our four
00:15:50.399 dimensions and we're we've now outputed
00:15:53.360 all of this to logs so let's set the
00:15:55.079 time range and we want to see all the
00:15:57.240 logs inside the que okay cool we filter
00:16:02.839 by the queue next we want the jobs
00:16:06.399 performed so now we've added that filter
00:16:09.000 in you can see the logs have shrunk down
00:16:12.000 next thing we want to do group by
00:16:13.759 duration of job
00:16:15.880 class and we got this nice little
00:16:18.480 graph pretty cool the final thing we
00:16:21.880 need to do is improve so I I'm quite a
00:16:25.560 critical person I'm quite a
00:16:27.040 perfectionist so I like to look at what
00:16:29.160 I've done and just rip it to shreds in
00:16:30.800 my own brain so it might well be we've
00:16:33.160 sent the Rong attributes by accident it
00:16:35.720 might be something's gone wrong with the
00:16:37.120 deploy or it might just be we look at
00:16:39.120 the graph and say well this is kind of
00:16:40.880 okay but it's not really what I wanted
00:16:43.319 and so this is your opportunity to
00:16:44.639 iterate to improve it to deploy it again
00:16:47.639 so we're just moving these very small
00:16:49.360 increments instead of having this
00:16:51.079 three-month project to say log
00:16:53.240 everything we're just going to go in
00:16:54.680 little little uh chunks here and then
00:16:56.639 we're going to show our colleagues and
00:16:58.240 my experience of this is once you start
00:17:00.040 to show your colleagues what's possible
00:17:01.959 they'll say things like I'm not I don't
00:17:03.759 really care about all of this logging
00:17:05.079 stuff John but what I do care about is
00:17:08.400 something else and then you can work
00:17:10.079 with them to add what they care about
00:17:12.120 and then your colleagues start to get
00:17:13.480 interested in this and this is how we've
00:17:15.480 created a culture of observability
00:17:18.199 within Bigger Pockets you can learn what
00:17:20.600 they want what the business
00:17:22.679 needs and so next just as some examples
00:17:25.520 we could add job latency or maybe
00:17:27.240 request ID or maybe I P now we've been
00:17:30.039 round this cycle many times at Bigger
00:17:31.960 Pockets and I want to show you a
00:17:33.720 snapshot of what our experience now
00:17:35.880 looks like with the previous previous
00:17:39.760 error so I'm still on call but this time
00:17:42.480 it's Friday morning not afternoon and I
00:17:45.400 get a little
00:17:47.200 notification this isn't from a support
00:17:49.240 Channel or an annoyed customer or my
00:17:51.400 boss it's from a monitor within the
00:17:53.679 observability tool and it's saying hang
00:17:56.039 on the latency of this queue is rising
00:17:59.159 uh you might want to do something about
00:18:00.760 this so we click a link in that slack
00:18:02.919 message and this is what we
00:18:04.880 see that's a bit strange what's that
00:18:07.080 blue line ah it's within the five
00:18:09.559 minutes queue oh wow it's up to 15
00:18:12.080 minutes duration now it should be
00:18:14.120 completing within five it's breached its
00:18:16.559 SLI that's not good so let's ask a
00:18:20.960 question which jobs are taking the most
00:18:24.440 time okay so that's something we can
00:18:27.480 answer now that look for the jobs
00:18:30.520 performed group basically in that same
00:18:33.400 queue Group by job class sum the
00:18:36.760 duration and this is what we get we can
00:18:39.120 see this massive Blue Bar Rising I
00:18:41.799 wonder what that is we hover over ah
00:18:44.919 interesting it's the analytics update
00:18:47.640 user visits job that's our answer we're
00:18:50.360 now maybe 60 seconds into the
00:18:53.400 incident
00:18:55.120 interesting I wonder what's in queing
00:18:57.200 those jobs hm
00:18:59.400 let's search for jobs enced in that same
00:19:02.760 queue and let's Group by HTTP resource
00:19:06.480 now this is something I added into the
00:19:08.200 logs this is a combination of the
00:19:09.600 controller and the
00:19:11.559 action let's Group by all logs and here
00:19:15.799 we go the green bar is obviously the
00:19:17.919 problem here and that is oh profile show
00:19:22.080 we're two minutes in we now know the
00:19:23.960 problem is something to do with profile
00:19:26.720 show cool now we have this whole world
00:19:30.000 of other questions we can answer ask
00:19:32.360 okay uh I wonder what IP is hitting that
00:19:34.480 action it's an interesting question
00:19:37.440 we're going to filter By Request handled
00:19:40.480 this is another event I added in that's
00:19:42.799 just basically you can do it using the
00:19:44.360 log tags technique I showed you
00:19:47.039 earlier and we're going to look at
00:19:49.559 anything that hit the profile show
00:19:51.440 controller we're going to group by
00:19:54.360 IP hauh that's kind of interesting
00:19:58.799 that's the IP address why is our IP
00:20:01.200 address hitting us so many times hm
00:20:04.240 probably because it's scraping us so
00:20:07.000 let's just recap what we've learned
00:20:08.640 there's a scraper at the IP address it
00:20:10.880 hits profile show that enes a whole load
00:20:14.039 of analytics jobs and that delays
00:20:16.440 password reset emails not a fictional
00:20:18.679 example a real example from a real
00:20:20.640 production
00:20:22.440 codebase cool so what's the fix it's
00:20:24.840 pretty obvious we go into our
00:20:26.520 infrastructure we use cloud flare
00:20:29.200 and we block the IP and the really cool
00:20:31.919 thing is we can now look at the
00:20:33.919 graph oh okay it's coming down all
00:20:37.120 sorted five minutes in and we're all
00:20:39.799 done how does this feel it feels joyful
00:20:43.640 I'm happy I've spent five minutes I've
00:20:46.799 can still retain the context of what I
00:20:48.400 was working on before no customers have
00:20:50.159 been impacted my boss is happy and we've
00:20:53.400 recovered the
00:20:55.200 website this is kind of a bit of a
00:20:57.280 superpower I can feel like I can tackle
00:20:59.760 really anything my rail app throws at me
00:21:02.559 and it also becomes addictive the thing
00:21:05.080 I didn't expect when I started this
00:21:06.799 journey two years ago was the moment you
00:21:09.159 start asking questions you start to get
00:21:11.960 curious about completely unrelated
00:21:13.919 things in the app recently I was in a
00:21:16.200 dashboard and I saw a flood of four 404s
00:21:19.480 that's a bit strange that's kind of like
00:21:22.240 a quarter of all our traffic is 404s
00:21:25.159 clicked down and realized ah somebody's
00:21:27.720 going through all our profiles pages
00:21:30.440 with just starting with a AAAA a aab as
00:21:33.400 usernames and we're getting all this Flo
00:21:35.440 of 404s we did the same thing blocked
00:21:38.039 them and all of a sudden our response
00:21:40.400 times are 19% faster just by accident
00:21:44.600 another Tuesday at the office and so
00:21:47.760 here are the results that we've got over
00:21:49.159 18 months our downtime is reduced by
00:21:52.799 98% our 500 errors are reduced by 83% we
00:21:57.120 still have hundreds of them so we're not
00:21:59.120 there yet and most of all we can fix
00:22:02.039 bugs 20 times faster so I've got some
00:22:05.600 resources for you uh there is a Playbook
00:22:08.200 here you can subscribe to my newsletter
00:22:10.679 or free of course there's a uh a cheap
00:22:13.279 sidekick course there I'm also running
00:22:16.200 an eight six week Hands-On Workshop in
00:22:19.360 October it's uh 60% off until 9 p.m.
00:22:23.320 tonight there two plans there and the
00:22:26.640 code is Yuro
00:22:29.159 and thank you for your time and
00:22:31.080 attention Manny
Explore all talks recorded at EuRuKo 2024
+39