API Optimization Tale: Monitor, Fix and Deploy on Friday

Summarized using AI

API Optimization Tale: Monitor, Fix and Deploy on Friday

Maciej Rzasa • October 21, 2021 • online • Talk

In the talk titled 'API Optimization Tale: Monitor, Fix and Deploy on Friday', Maciek Rzasa shares his experiences navigating the intricacies of deploying code in a complex system during a major transition from REST to GraphQL in their monolithic application. This presentation details the risks involved in making late deployments, particularly focusing on a disastrous Friday deployment that led to significant performance issues.

Key Points Discussed:

- Background and Context: Maciek introduces the team he’s part of, which is responsible for service extraction within a large Rails application of over a million lines of code. The billing extraction team’s goal is to transition specific parts of the monolith into independent services without disrupting ongoing operations.

- Step-by-Step Extraction Strategy: The team adopted a gradual extraction method, using feature flags to switch between the original REST API and their experimental GraphQL API. This approach included implementing safe fallbacks if failures occurred.

- Monitoring and Performance Issues: The need for comprehensive monitoring became evident when the team faced performance bottlenecks and timeouts after some initial failed testing. They employed tools like New Relic and developed a more refined monitoring system that logged request metrics and errors to troubleshoot effectively.

- Optimization Techniques: Maciek outlines various technical optimizations employed, including:

- Caching: To avoid excessive queries, the team implemented caching for billing records, which significantly reduced the number of requests to the service.

- Utilizing GraphQL: By switching to GraphQL, they were able to customize queries, reducing the amount of data returned, hence optimizing response times.

- Addressing Query Performance: Through careful analysis, they identified heavy queries and minimized them. For instance, they avoided repeated queries in a loop by preloading necessary data.

- Learning from Mistakes: The talk illustrates the significance of domain knowledge vs. technical knowledge and the mistakes made during implementation, including a severe production failure due to poor parameter sanitization, which the team learned to prevent in future deployments through comprehensive testing.

- Safe Deployment Culture: Maciek concludes with a discussion on fostering a safe deployment culture where mistakes are part of the learning process, emphasizing that it’s acceptable to learn from failures if one is dedicated to addressing them.

The overarching message is that effective API optimization and deployment strategies require vigilance, knowledge, and the capacity to iterate based on failures and successes. By applying known patterns in innovative environments, developers can navigate the complexities of modern software architecture successfully.

API Optimization Tale: Monitor, Fix and Deploy on Friday
Maciej Rzasa • online • Talk

Date: October 21, 2021
Published: unknown
Announced: unknown

I saw a green build on a Friday afternoon. I knew I need to push it to production before the weekend. My gut told me it was a trap. I had already stayed late to revert a broken deploy. I knew the risk.

In the middle of a service extraction project, we decided to migrate from REST to GraphQL and optimize API usage. My deploy was a part of this radical change.

Why was I deploying so late? How did we measure the migration effects? And why was I testing on production? I'll tell you a tale of small steps, monitoring, and old tricks in a new setting. Hope, despair, and broken production included.

This talk was delivered at EMEA on Rails, a virtual mega-meetup which took place on June 9, 2021.

EMEA on Rails 2021

00:00:18.400 and fridays a couple weeks before i managed to deploy faulty code and i
00:00:26.240 put our production service down and believe me or not i had to stay late
00:00:31.279 after work to fix them and believe me or not i have better fear than fixing
00:00:36.480 broken production after work i knew the risk
00:00:41.840 but why was it so important to deploy it uh earl that fast and
00:00:49.280 what gave me the confidence to deploy on code on on friday
00:00:55.039 and last but not least why the heck was i thinking to test something on production
00:01:02.879 we'll come to all to do this um but before we do it we need to take a
00:01:08.799 few steps back our topic toptile is a freelancer network
00:01:14.240 that helps companies hire freelancers and scale their their teams i'm not a
00:01:19.600 freelancer there i work in a cool team and the core team
00:01:25.280 develops our core tools and the heart of our core tools is the toptel platform it's a rails
00:01:32.799 application of over a million lines of code there are thousands of users using it
00:01:40.320 and there are hundreds of engineers working developing this platform
00:01:46.479 on day-to-day basis so it is a monolith and it is majestic
00:01:52.720 um i work for a billing extraction team because our monolith
00:01:59.600 has a majestic model of issues and we plan to extract parts of it
00:02:05.759 into a services and our team peter p link team tries to take billing code
00:02:10.800 billing domain and put it into a separate service we are on the bleeding edge of the
00:02:16.640 service extraction in our company and we need to do this but we can't disrupt
00:02:25.599 day-to-day operations and we can't make the life of our
00:02:31.120 fellow engineers harder so the bank extraction is out of the question
00:02:36.560 we decided to use step-by-step approach to make it safe
00:02:41.760 and to avoid disruption so the first step was to extract the
00:02:48.560 billing code and put it into a rails engine and still communicate
00:02:53.840 with it using regular ruby code ruby calls
00:02:59.680 next step was to actually test the external communication the network
00:03:05.040 communication so implemented async api using kafka we won't be
00:03:10.959 talking about this even though it's very interesting and
00:03:16.159 we also added synchronous communication over http and rest and that's the topic
00:03:22.400 of the stock to make it safe we added two important features first
00:03:28.640 of them was a feature flag that decided if any given request should be
00:03:35.680 rooted through rest or should use a direct call
00:03:41.360 and in case this rest call failed we added a safe fallback so
00:03:49.920 if there if the rest calls fail failed we were using a direct calls
00:03:55.680 fallback and to end users the call might be slower but it was still correct
00:04:02.879 and it was our first approach and we
00:04:08.720 and to to div dive a little bit deeper let's take a look at the models how it look
00:04:16.000 we had product class and building record class let's focus only on those two and before the
00:04:22.960 extraction they were associated using regular active record
00:04:28.000 relations has many and but after the extraction we still had to
00:04:34.400 connect the data from product that state form and billing record that was moved
00:04:39.759 to the billing service but we couldn't do it with active record associations
00:04:45.759 so we added an active method to product that was fetching
00:04:52.639 billing records using billing query service that was wrapping all the communication
00:04:57.919 and uh we replaced billing record active mode
00:05:03.199 active record with a plain old ruby object that was connected to product
00:05:09.840 using activerecord query but by product id and we deployed this ah sorry
00:05:17.919 so before we even started we had a safe environment when we can
00:05:25.280 ensure that in case there are any issues with our api we fall back to a direct
00:05:30.800 call and also that we can decide if we use those remote calls
00:05:38.400 using feature flux without the deployments so we deployed it and we enabled the
00:05:45.680 http calls for small percentage of the api 10 or 20 percent and we tried and we
00:05:51.520 failed miserably let's start maybe this time no yeah it fails every time um
00:05:58.160 the performance of this solution was abysmal even for a small fraction of the traffic we
00:06:04.160 we saw a lot of um timeouts and the code is was generally slow and we knew we had
00:06:11.840 to change it massively and and i think it's a quite a good
00:06:17.440 moment for an as of a software engineer right everybody loves those optimization
00:06:23.039 projects when you relic and you uh take a look at the traces and then you
00:06:29.120 optimize and you see how it worked and that's exactly what didn't happen because we realized that we can't just
00:06:36.800 optimize current approach in the middle of our massive refactoring
00:06:42.080 of service extraction we decided that we need another another massive change that
00:06:48.479 was switching from http api to graphql api
00:06:54.000 and this change we also performed in small steps
00:06:59.120 so we introduced another layer of feature flux that allowed us to switch
00:07:04.160 between correct although slow http api and experimental
00:07:11.599 graphql api and we also added a fallback
00:07:16.880 so that in case graphql call is slow we'll call billing logic directly
00:07:26.080 and we did think to our standard monitoring errors referral or sentry
00:07:32.240 performance with new relic we decided that we need something more fine-grained
00:07:38.240 so we decided so we locked every request to to our uh keep to our elastic stack
00:07:47.120 and we locked its um method name argument stack trace response time and
00:07:54.000 if needed an error and it gave us uh high visibility we were able to spot
00:08:01.680 which met um slow and recreate those slow calls
00:08:07.759 locally we were able to see the source of those slow calls or the source of
00:08:14.160 multiple calls so that we were able to notice when this when our the n plus one queries
00:08:20.800 we were able to to spot the errors and reproduce those errors locally and and fix them with confidence
00:08:27.599 um so we started with those two strong pillars
00:08:33.200 we had safe deployment environment and we had good monitoring that allowed us to
00:08:38.959 control and to decide if we really improve our code our solution
00:08:44.640 and those two strong pillars let's dive into the main phase the optimization the
00:08:50.480 optimization phase the first thing that we noticed was that several views or
00:08:57.760 sidekick jobs generated a lot of billing requests it
00:09:02.959 could be thousands of billing requests from a single page rendering single side
00:09:08.480 key job it was unbearable for the billing service it was in the
00:09:14.959 resume it wasn't reasonable to serve that so what happened there
00:09:20.880 it was exactly like an in plus n plus one uh
00:09:26.399 prob issue uh with the database access
00:09:31.519 it was not very easy to spot those issues because when we took a look at the chop
00:09:38.320 it was just iterating over products and in batches attitude and calling business
00:09:43.600 logic on them but then when we looked into the business logic it was
00:09:49.760 using billing records for products so for every
00:09:54.880 iteration in the job we were calling uh billing service and that generated that
00:10:02.000 flood of request an n plus one requests so what we did
00:10:07.279 we decided that we should resign from this active object approach when
00:10:13.680 uh we do external request when we access data
00:10:18.720 we should reload it one level higher so
00:10:23.839 we implemented caching preload and get caching
00:10:28.959 records inside the job and let's take a look at this cash
00:10:35.120 building records method it did three things first of all it was
00:10:40.480 it was fetching records from billing query service then it was
00:10:46.800 indexing them by product id and using this index it was assigning
00:10:55.200 it was assigning billing records to respective products and because of that
00:11:00.800 when we did the actual billing business logic call
00:11:07.279 we uh we weren't doing additional requests to building query service so instead
00:11:13.760 calling billing service once per iteration we were calling it
00:11:19.200 once per batch and that means instead of making thousands of we did a dozen or a half a
00:11:27.040 half of dozen and that was definitely a big improvement
00:11:32.640 so what we learned here that if we have flat quest uh we need to
00:11:38.079 recreate what database does and what or and uh that orange that
00:11:45.200 do we need to preload data cache them and join them
00:11:50.480 with data that we have locally and actually what we implemented really resembles the hajj join algorithm
00:11:57.839 that's known from databases so it was the first issue another issue
00:12:04.320 that we noticed was that there were there was a single field that was
00:12:11.600 requested thousands about a thousand times per day
00:12:16.639 and in the worst hours it were hundreds river hour
00:12:21.760 so it was a lot we weren't able to pre-load it because it was in the
00:12:27.519 different requests uh so what we did here
00:12:33.440 was we devised a very smart plan we decided that we can do smart caching we can add
00:12:39.839 this field to our kafka events then we can build a projection on the platform
00:12:44.880 side and after backfilling historical data we can start using this projection this
00:12:50.639 read model and then we can remove the billing query that was smart that was fancy and i was really happy that i
00:12:57.760 started to do this but then after a few first
00:13:04.839 samuel our colleague that is really into billing domain came and said ah guys i
00:13:10.720 guess that we have this field on the platform side let's try let's check it and in this
00:13:17.600 moment i was really annoyed because wow that that's such a nice approach to
00:13:24.560 his kafka and when we switch to this other approach that that's not so fancy and
00:13:30.880 this building read model it's it's so cool we could even talk on a conference about it
00:13:37.040 right but yeah we decided to to check the samuel's approach
00:13:43.120 and it's pretty really good because we were really able to use this local field
00:13:50.560 so instead of spending two weeks of building this solution with kafka we
00:13:55.680 spent two or three days mostly talking with developers that are into the
00:14:01.760 billing domain and with product people and by making huge sql queries comparing
00:14:08.959 the data from those to sources and we replaced this external call with
00:14:14.639 a single database call and when we deployed it and the flood of request
00:14:22.320 was gone and that was solution because we did it
00:14:27.519 not even not only faster but what we did was also more maintainable instead of
00:14:34.320 of its lines of code and asynchronous communication we did a
00:14:39.760 single liner with a single database query what we learned here
00:14:46.959 i learned that while technical excellence can
00:14:52.880 help us winning optimization battles domain knowledge
00:14:59.360 can let us avoid those battles at all and that's really wise
00:15:07.040 after those two issues we noticed another one it wasn't about
00:15:14.079 making too many queries it was about making queries that are in
00:15:20.399 or really really slow at least some of them what was the reason when we were
00:15:26.560 building the rest api we wanted a minimal surface so it was a universal api returning
00:15:35.199 every piece of data that could be probably needed on the platform side what
00:15:41.120 does it mean it mean that we were returning about 40 fields
00:15:46.639 and about three or four associations with potentially hundreds of records
00:15:53.279 whenever we needed something on the platform sign we were
00:15:58.320 fetching those and for our worst products with a lot of billing
00:16:03.759 data we we were sending a lot of data network and to make it even worse
00:16:11.120 uh we had filtering on the platforms and the client side
00:16:16.639 so what we did we leveraged the power of graphql and it
00:16:22.560 was the moment when the decision to move to graphql really really paid off
00:16:28.639 because we were able after um implementing the types of graphql on the server side we were able
00:16:36.639 to create customized queries fetching only the data that was really
00:16:42.320 needed without additional changes on the server side so instead of fetching 40
00:16:47.440 fields and hundreds of associated objects we are fetching four fields
00:16:53.519 and it was bloody fast and of course we implemented filtering on the server side to fetch only the
00:17:00.240 data that is needed and it looked like it worked like a charm
00:17:06.559 during our tests um and i must say that
00:17:12.000 i was implementing this um i really
00:17:17.199 i i was really into this process that we monitor then then
00:17:22.319 we decided how to optimize it we optimize and we deployed the production then we see
00:17:27.600 how our stats are are getting better and it was really really rewarding
00:17:33.360 experience with a short feedback cycle and i expected
00:17:39.200 something similar here and even bigger than usual because this change
00:17:45.120 was significant so i merged this and i was waiting uh to see the stats
00:17:53.120 by the way do you know what was the hottest day last summer at least in the northern
00:17:58.960 part of our globe i remember it well it was 26th of july tuesday
00:18:12.000 stats to get better i i enabled the feature flag and i was casually checking slack when
00:18:19.600 some sometimes someone wrote that sidekick workers consume too much memory i thought oh
00:18:25.200 poor infra guys they have to work on this and someone else said that we have
00:18:30.640 failing staging deploy and i thought time was someone was reckless and managed to
00:18:37.280 match a faulty code that's very bad and then someone told me directly man i
00:18:45.440 think it was your code that's faulty and i thought ah it can't it's not possible it's tested
00:18:53.600 um but i started checking it and then someone asked us if platform is okay
00:18:58.880 some some someone from the business side and then someone set us uh folks
00:19:04.640 platforms down and i knew it was my pull request that did it and
00:19:11.600 i felt cold sweat on my back it was the hottest day last summer
00:19:18.000 tuesday 26th of july how bad was it it was really really bad uh we consumed
00:19:25.520 about 20 times more memory than usual and cp cpu utilization was about 10 to
00:19:31.200 20 times higher than uh in normal times so normal times are so flat that they are almost invisible
00:19:39.039 on this on those charts and to make it worse requests queries to um to postgres
00:19:46.480 took about a minute a minute to query posters that was quite terrible
00:19:54.559 and then two good things happened the first infra folks
00:20:00.960 reacted without my intervention they um when they
00:20:07.679 decided that it was the faulty build they reverted it and then
00:20:13.200 showed that the platform returned to normal operations and then they assisted me
00:20:19.600 in reverting this on master and deploying the new version so we have remastered again
00:20:27.520 and the second thing the next very next day i started providing a fully fledged
00:20:33.039 fix and my manager approached me and you can imagine what the manager could
00:20:38.640 say in this case a little bit of blame casting or at least oh be careful the
00:20:44.159 next time and what he said what's what hossa said was good job
00:20:50.240 i was really surprised was it really for me and he said yeah it was wrong that
00:20:56.000 the platform was done but it happens and as long as you state to fix it it's
00:21:03.760 okay it's a good job and i realize it's also the part of our safe deployment strategy it's okay to
00:21:11.760 make mistake as long as you try as you are committed to fix it
00:21:17.600 um and that was really really good experience so what really happened
00:21:24.720 uh during implementing this feature i did a small refactoring and i
00:21:31.600 added parameter sanitization and i was a little bit too eager so
00:21:37.360 the three types of parameters were allowed but i allowed only one
00:21:42.840 and so if someone asked requested
00:21:48.240 billing records for a client it was set to nil because of this
00:21:54.080 sanitization and with no parameters those called
00:21:59.600 did nothing and we were asking for all billing records and then for all associated
00:22:06.400 associated objects and those are millions millions of records and that's why
00:22:13.840 we killed postgres and then we killed our workers
00:22:19.919 and i don't think we we reached the moment when all the objects were instantiated and serialized and pushed
00:22:26.799 to network we basically i basically killed the servers
00:22:32.480 the fix was pretty simple in case the list
00:22:38.799 parameters is empty uh return nothing instead of returning everything so we resigned from the sql
00:22:47.039 default of returning everything and when nowhere close
00:22:52.559 how did it happen that it slipped through our safe deployment strategy for our ci and
00:22:59.440 and so on um it was uh first of all
00:23:06.240 i was testing only one of those parameters in my unique test and it happened to be the parameter that i that
00:23:12.880 was correctly sanitized second of all we had a feature flag for this and i
00:23:19.919 switched it off when we notice production issues but the bug was introduced in the rest
00:23:27.039 branch in the safe branch battle tested one so i made it even worse
00:23:33.360 if i switched to graphql it would have been better but i didn't know it
00:23:38.559 then so that's that's how it worked
00:23:45.360 that's why it happened of course i fixed the sanitization
00:23:50.559 and what was the result of our fix finally it's the situation before the
00:23:56.080 fix and it's the chasm when i was fixing issue and after the fix
00:24:02.880 the um the response time is not only better but
00:24:08.799 it's also more predictable and so under fetching and custom queries really
00:24:14.799 helped us here but what i really learned here is that no matter how much i trust
00:24:20.880 myself and my unit tests and my feature flags and my reviewers i should always test manually we have this rule
00:24:29.120 this and this field in pulu request template how to test this
00:24:36.000 and in this case i put simply trust the specs and i think it was a mistake i should
00:24:42.559 provide a simple test scenario in rails console and with some expected lock output or
00:24:49.360 something because i believe we could have
00:24:54.799 quantities on development machines or during reviews and without this manual test strategy uh
00:25:02.480 i failed i i pro production okay
00:25:07.600 that was the bad part a lit something something better now right that the
00:25:13.279 packs were uh they were committed now now let's let's do something um after all those changes we were able
00:25:28.000 traffic for bigger percentage of our requests and we noticed that every sunday
00:25:35.120 we had four to ninth four to nine to many requests error
00:25:41.039 reported by our clients uh what happened every sunday uh we were first scheduling
00:25:48.159 reminders and then sending them at 5 pm and when scheduling we were doing a lot
00:25:53.679 of requests to to billing service so we fixed it by
00:25:59.360 preloading it was something we we knew well and we deployed it we waited for sunday
00:26:05.279 because we weren't able to uh to reproduce it locally because the issue was generated by by nginx
00:26:13.200 uh on our production setting and after on monday morning we saw the same
00:26:18.960 issue why was that the the
00:26:24.240 was raised not only during scheduling but also during sending the reminders we thought
00:26:31.200 it was uh it we would be covered because the reminders were
00:26:36.960 sent in our talents in our users um time zones
00:26:43.200 so not all of them at the same 5 p.m and we thought okay we'll split it into
00:26:49.840 24 buckets and it's okay the issue is that a quarter of our talents live
00:26:56.400 in the same time zone so when evening this time is okay
00:27:02.000 we were sending too many requests to bilink we couldn't move our talents to other
00:27:08.640 time zones so i did to move the reminders a little bit instead of sending them at 5 p.m sharp
00:27:15.760 uh uttered them adding jitter of two minutes and which we we calculated
00:27:22.080 that it should be okay so we deployed it and waited for sunday on monday morning we saw the same error
00:27:29.120 why was that we we missed one more reminder so we did the same and
00:27:35.520 on monday morning the same error and it was a month for fixing this
00:27:40.559 because we had to wait every time every time we was um this time
00:27:45.919 we realized that we it wasn't the problem with our domain knowledge this time it was a technical
00:27:52.399 details of sidekiq we thought that sidekick would run our scheduled jobs with
00:27:59.279 one second resolution but in fact sidekick did it with about 15 seconds
00:28:05.200 so instead of having uh 100 small buckets with requests we had
00:28:11.919 eight big packets of requests um and that was the issue so we took a
00:28:17.840 psychic limiter for sidekiq enterprise from sidekick enterprise uh we implemented window limiting and
00:28:25.600 deployed it and waited for sunday and it worked and it was marvelous
00:28:31.600 and i i must admit i must confess that it was during this saga
00:28:39.600 when i saw a green build on friday and i knew that it must be deployed
00:28:45.200 before uh sunday and i deployed it at 4 p.m on friday
00:28:53.360 and it worked and i went home at 5. and to
00:28:59.039 to make it more interesting i i did it twice during this uh this process at the very
00:29:04.559 beginning in the first week and at the very end of this of this process and both time it worked uh friday afternoon
00:29:12.080 and i was safe to go home and it was good to see it finally fixed
00:29:18.640 so to sum up what we did we
00:29:24.000 leveraged safe deployment strategy and good monitoring to be able to apply
00:29:30.399 a couple of fixing of optimization patterns we were using preloading to avoid and plus one
00:29:37.520 queries we moved filtering to server side and we were using custom queries to under fetch data
00:29:44.320 to factoring what is needed we tried to use local data and we spread the load with jitter and when you take a look at
00:29:51.919 this list it might sound familiar i guess that most of you
00:29:58.000 used those patterns or at least seen them let's
00:30:03.919 take a better look at them uh preloading turbid and plus one that's
00:30:09.279 something that every rm does server side filtering and uh
00:30:15.039 is quite common uh in rails to move from final to where both things we implemented not in the
00:30:22.480 database world but in microservice world uh we moved
00:30:28.640 we we applied this those both patterns for graphql api
00:30:34.080 using local data it's a pro browser
00:30:39.120 optimization primer to avoid requests if it's possible
00:30:44.399 under fetching is something that's built in sql
00:30:49.440 again we use graphql features to use it in the service oriented world
00:30:56.399 and spreading the load i read about it about 10 years ago or something uh in the high scalability post about youtube
00:31:03.679 architecture so it's not that we invented something new
00:31:10.960 so why do i keep it keep you here for more than half an hour
00:31:18.640 we are applying something known in a new setting and why didn't we do it well from the
00:31:26.000 beginning the answer here is threefold it was about learning simplicity and trade-off
00:31:33.279 trade-offs we realize that it's really hard to spot
00:31:38.960 performance issue simply by stirring up the code we need i we realized that you need to
00:31:46.720 apply traffic of the production shape to your application to see where are the performance
00:31:52.960 problems and it's really hard to in many cases it's really hard to do it locally and
00:32:00.080 the second part of the answer is simplicity we started with the simplest thing that could probably work
00:32:07.279 work we started with a boring rest solution and with the api of the
00:32:13.600 small surface of the minimal surface and only then we moved to something more
00:32:19.519 sophisticated like graphql and custom queries and finally it's all about trade-offs
00:32:26.799 when we were implementing the uh the jitter um we had to convince our colleagues
00:32:33.519 uh from feature teams that it's really required to spread this load and that we
00:32:39.919 can't really do everything at 5pm sharp because they
00:32:45.120 what they really wanted was the requirement is to send a remindered five let's send it five not
00:32:52.559 somewhat sometimes around five we had to explain them that it's impossible because we will we would kill the
00:32:59.360 billing service um so it was that was it it was about
00:33:05.120 determining what known patterns we should use in our novel setting it was about understanding
00:33:11.679 the trade-offs and about applying our knowledge in our specific case
00:33:18.240 and i shared what we learned i shared
00:33:23.519 uh our story about our solutions i detailed shared about mistakes
00:33:29.760 so you don't have to repeat the mistakes that we made
00:33:35.039 you can instead you can go to your work and commit your own mistakes
00:33:41.120 but then please come back to a meet up or to a conference or any anything
00:33:48.399 like this one and tell your story so that we can all learn from your mistakes
00:33:55.279 and if you wonder if you are ready to do them if it's safe to do them
00:34:01.200 just think if you can deploy your code on friday afternoon
00:34:08.240 thank you i'll answer any questions now
Explore all talks recorded at EMEA on Rails 2021
+4