00:00:18.400
and fridays a couple weeks before i managed to deploy faulty code and i
00:00:26.240
put our production service down and believe me or not i had to stay late
00:00:31.279
after work to fix them and believe me or not i have better fear than fixing
00:00:36.480
broken production after work i knew the risk
00:00:41.840
but why was it so important to deploy it uh earl that fast and
00:00:49.280
what gave me the confidence to deploy on code on on friday
00:00:55.039
and last but not least why the heck was i thinking to test something on production
00:01:02.879
we'll come to all to do this um but before we do it we need to take a
00:01:08.799
few steps back our topic toptile is a freelancer network
00:01:14.240
that helps companies hire freelancers and scale their their teams i'm not a
00:01:19.600
freelancer there i work in a cool team and the core team
00:01:25.280
develops our core tools and the heart of our core tools is the toptel platform it's a rails
00:01:32.799
application of over a million lines of code there are thousands of users using it
00:01:40.320
and there are hundreds of engineers working developing this platform
00:01:46.479
on day-to-day basis so it is a monolith and it is majestic
00:01:52.720
um i work for a billing extraction team because our monolith
00:01:59.600
has a majestic model of issues and we plan to extract parts of it
00:02:05.759
into a services and our team peter p link team tries to take billing code
00:02:10.800
billing domain and put it into a separate service we are on the bleeding edge of the
00:02:16.640
service extraction in our company and we need to do this but we can't disrupt
00:02:25.599
day-to-day operations and we can't make the life of our
00:02:31.120
fellow engineers harder so the bank extraction is out of the question
00:02:36.560
we decided to use step-by-step approach to make it safe
00:02:41.760
and to avoid disruption so the first step was to extract the
00:02:48.560
billing code and put it into a rails engine and still communicate
00:02:53.840
with it using regular ruby code ruby calls
00:02:59.680
next step was to actually test the external communication the network
00:03:05.040
communication so implemented async api using kafka we won't be
00:03:10.959
talking about this even though it's very interesting and
00:03:16.159
we also added synchronous communication over http and rest and that's the topic
00:03:22.400
of the stock to make it safe we added two important features first
00:03:28.640
of them was a feature flag that decided if any given request should be
00:03:35.680
rooted through rest or should use a direct call
00:03:41.360
and in case this rest call failed we added a safe fallback so
00:03:49.920
if there if the rest calls fail failed we were using a direct calls
00:03:55.680
fallback and to end users the call might be slower but it was still correct
00:04:02.879
and it was our first approach and we
00:04:08.720
and to to div dive a little bit deeper let's take a look at the models how it look
00:04:16.000
we had product class and building record class let's focus only on those two and before the
00:04:22.960
extraction they were associated using regular active record
00:04:28.000
relations has many and but after the extraction we still had to
00:04:34.400
connect the data from product that state form and billing record that was moved
00:04:39.759
to the billing service but we couldn't do it with active record associations
00:04:45.759
so we added an active method to product that was fetching
00:04:52.639
billing records using billing query service that was wrapping all the communication
00:04:57.919
and uh we replaced billing record active mode
00:05:03.199
active record with a plain old ruby object that was connected to product
00:05:09.840
using activerecord query but by product id and we deployed this ah sorry
00:05:17.919
so before we even started we had a safe environment when we can
00:05:25.280
ensure that in case there are any issues with our api we fall back to a direct
00:05:30.800
call and also that we can decide if we use those remote calls
00:05:38.400
using feature flux without the deployments so we deployed it and we enabled the
00:05:45.680
http calls for small percentage of the api 10 or 20 percent and we tried and we
00:05:51.520
failed miserably let's start maybe this time no yeah it fails every time um
00:05:58.160
the performance of this solution was abysmal even for a small fraction of the traffic we
00:06:04.160
we saw a lot of um timeouts and the code is was generally slow and we knew we had
00:06:11.840
to change it massively and and i think it's a quite a good
00:06:17.440
moment for an as of a software engineer right everybody loves those optimization
00:06:23.039
projects when you relic and you uh take a look at the traces and then you
00:06:29.120
optimize and you see how it worked and that's exactly what didn't happen because we realized that we can't just
00:06:36.800
optimize current approach in the middle of our massive refactoring
00:06:42.080
of service extraction we decided that we need another another massive change that
00:06:48.479
was switching from http api to graphql api
00:06:54.000
and this change we also performed in small steps
00:06:59.120
so we introduced another layer of feature flux that allowed us to switch
00:07:04.160
between correct although slow http api and experimental
00:07:11.599
graphql api and we also added a fallback
00:07:16.880
so that in case graphql call is slow we'll call billing logic directly
00:07:26.080
and we did think to our standard monitoring errors referral or sentry
00:07:32.240
performance with new relic we decided that we need something more fine-grained
00:07:38.240
so we decided so we locked every request to to our uh keep to our elastic stack
00:07:47.120
and we locked its um method name argument stack trace response time and
00:07:54.000
if needed an error and it gave us uh high visibility we were able to spot
00:08:01.680
which met um slow and recreate those slow calls
00:08:07.759
locally we were able to see the source of those slow calls or the source of
00:08:14.160
multiple calls so that we were able to notice when this when our the n plus one queries
00:08:20.800
we were able to to spot the errors and reproduce those errors locally and and fix them with confidence
00:08:27.599
um so we started with those two strong pillars
00:08:33.200
we had safe deployment environment and we had good monitoring that allowed us to
00:08:38.959
control and to decide if we really improve our code our solution
00:08:44.640
and those two strong pillars let's dive into the main phase the optimization the
00:08:50.480
optimization phase the first thing that we noticed was that several views or
00:08:57.760
sidekick jobs generated a lot of billing requests it
00:09:02.959
could be thousands of billing requests from a single page rendering single side
00:09:08.480
key job it was unbearable for the billing service it was in the
00:09:14.959
resume it wasn't reasonable to serve that so what happened there
00:09:20.880
it was exactly like an in plus n plus one uh
00:09:26.399
prob issue uh with the database access
00:09:31.519
it was not very easy to spot those issues because when we took a look at the chop
00:09:38.320
it was just iterating over products and in batches attitude and calling business
00:09:43.600
logic on them but then when we looked into the business logic it was
00:09:49.760
using billing records for products so for every
00:09:54.880
iteration in the job we were calling uh billing service and that generated that
00:10:02.000
flood of request an n plus one requests so what we did
00:10:07.279
we decided that we should resign from this active object approach when
00:10:13.680
uh we do external request when we access data
00:10:18.720
we should reload it one level higher so
00:10:23.839
we implemented caching preload and get caching
00:10:28.959
records inside the job and let's take a look at this cash
00:10:35.120
building records method it did three things first of all it was
00:10:40.480
it was fetching records from billing query service then it was
00:10:46.800
indexing them by product id and using this index it was assigning
00:10:55.200
it was assigning billing records to respective products and because of that
00:11:00.800
when we did the actual billing business logic call
00:11:07.279
we uh we weren't doing additional requests to building query service so instead
00:11:13.760
calling billing service once per iteration we were calling it
00:11:19.200
once per batch and that means instead of making thousands of we did a dozen or a half a
00:11:27.040
half of dozen and that was definitely a big improvement
00:11:32.640
so what we learned here that if we have flat quest uh we need to
00:11:38.079
recreate what database does and what or and uh that orange that
00:11:45.200
do we need to preload data cache them and join them
00:11:50.480
with data that we have locally and actually what we implemented really resembles the hajj join algorithm
00:11:57.839
that's known from databases so it was the first issue another issue
00:12:04.320
that we noticed was that there were there was a single field that was
00:12:11.600
requested thousands about a thousand times per day
00:12:16.639
and in the worst hours it were hundreds river hour
00:12:21.760
so it was a lot we weren't able to pre-load it because it was in the
00:12:27.519
different requests uh so what we did here
00:12:33.440
was we devised a very smart plan we decided that we can do smart caching we can add
00:12:39.839
this field to our kafka events then we can build a projection on the platform
00:12:44.880
side and after backfilling historical data we can start using this projection this
00:12:50.639
read model and then we can remove the billing query that was smart that was fancy and i was really happy that i
00:12:57.760
started to do this but then after a few first
00:13:04.839
samuel our colleague that is really into billing domain came and said ah guys i
00:13:10.720
guess that we have this field on the platform side let's try let's check it and in this
00:13:17.600
moment i was really annoyed because wow that that's such a nice approach to
00:13:24.560
his kafka and when we switch to this other approach that that's not so fancy and
00:13:30.880
this building read model it's it's so cool we could even talk on a conference about it
00:13:37.040
right but yeah we decided to to check the samuel's approach
00:13:43.120
and it's pretty really good because we were really able to use this local field
00:13:50.560
so instead of spending two weeks of building this solution with kafka we
00:13:55.680
spent two or three days mostly talking with developers that are into the
00:14:01.760
billing domain and with product people and by making huge sql queries comparing
00:14:08.959
the data from those to sources and we replaced this external call with
00:14:14.639
a single database call and when we deployed it and the flood of request
00:14:22.320
was gone and that was solution because we did it
00:14:27.519
not even not only faster but what we did was also more maintainable instead of
00:14:34.320
of its lines of code and asynchronous communication we did a
00:14:39.760
single liner with a single database query what we learned here
00:14:46.959
i learned that while technical excellence can
00:14:52.880
help us winning optimization battles domain knowledge
00:14:59.360
can let us avoid those battles at all and that's really wise
00:15:07.040
after those two issues we noticed another one it wasn't about
00:15:14.079
making too many queries it was about making queries that are in
00:15:20.399
or really really slow at least some of them what was the reason when we were
00:15:26.560
building the rest api we wanted a minimal surface so it was a universal api returning
00:15:35.199
every piece of data that could be probably needed on the platform side what
00:15:41.120
does it mean it mean that we were returning about 40 fields
00:15:46.639
and about three or four associations with potentially hundreds of records
00:15:53.279
whenever we needed something on the platform sign we were
00:15:58.320
fetching those and for our worst products with a lot of billing
00:16:03.759
data we we were sending a lot of data network and to make it even worse
00:16:11.120
uh we had filtering on the platforms and the client side
00:16:16.639
so what we did we leveraged the power of graphql and it
00:16:22.560
was the moment when the decision to move to graphql really really paid off
00:16:28.639
because we were able after um implementing the types of graphql on the server side we were able
00:16:36.639
to create customized queries fetching only the data that was really
00:16:42.320
needed without additional changes on the server side so instead of fetching 40
00:16:47.440
fields and hundreds of associated objects we are fetching four fields
00:16:53.519
and it was bloody fast and of course we implemented filtering on the server side to fetch only the
00:17:00.240
data that is needed and it looked like it worked like a charm
00:17:06.559
during our tests um and i must say that
00:17:12.000
i was implementing this um i really
00:17:17.199
i i was really into this process that we monitor then then
00:17:22.319
we decided how to optimize it we optimize and we deployed the production then we see
00:17:27.600
how our stats are are getting better and it was really really rewarding
00:17:33.360
experience with a short feedback cycle and i expected
00:17:39.200
something similar here and even bigger than usual because this change
00:17:45.120
was significant so i merged this and i was waiting uh to see the stats
00:17:53.120
by the way do you know what was the hottest day last summer at least in the northern
00:17:58.960
part of our globe i remember it well it was 26th of july tuesday
00:18:12.000
stats to get better i i enabled the feature flag and i was casually checking slack when
00:18:19.600
some sometimes someone wrote that sidekick workers consume too much memory i thought oh
00:18:25.200
poor infra guys they have to work on this and someone else said that we have
00:18:30.640
failing staging deploy and i thought time was someone was reckless and managed to
00:18:37.280
match a faulty code that's very bad and then someone told me directly man i
00:18:45.440
think it was your code that's faulty and i thought ah it can't it's not possible it's tested
00:18:53.600
um but i started checking it and then someone asked us if platform is okay
00:18:58.880
some some someone from the business side and then someone set us uh folks
00:19:04.640
platforms down and i knew it was my pull request that did it and
00:19:11.600
i felt cold sweat on my back it was the hottest day last summer
00:19:18.000
tuesday 26th of july how bad was it it was really really bad uh we consumed
00:19:25.520
about 20 times more memory than usual and cp cpu utilization was about 10 to
00:19:31.200
20 times higher than uh in normal times so normal times are so flat that they are almost invisible
00:19:39.039
on this on those charts and to make it worse requests queries to um to postgres
00:19:46.480
took about a minute a minute to query posters that was quite terrible
00:19:54.559
and then two good things happened the first infra folks
00:20:00.960
reacted without my intervention they um when they
00:20:07.679
decided that it was the faulty build they reverted it and then
00:20:13.200
showed that the platform returned to normal operations and then they assisted me
00:20:19.600
in reverting this on master and deploying the new version so we have remastered again
00:20:27.520
and the second thing the next very next day i started providing a fully fledged
00:20:33.039
fix and my manager approached me and you can imagine what the manager could
00:20:38.640
say in this case a little bit of blame casting or at least oh be careful the
00:20:44.159
next time and what he said what's what hossa said was good job
00:20:50.240
i was really surprised was it really for me and he said yeah it was wrong that
00:20:56.000
the platform was done but it happens and as long as you state to fix it it's
00:21:03.760
okay it's a good job and i realize it's also the part of our safe deployment strategy it's okay to
00:21:11.760
make mistake as long as you try as you are committed to fix it
00:21:17.600
um and that was really really good experience so what really happened
00:21:24.720
uh during implementing this feature i did a small refactoring and i
00:21:31.600
added parameter sanitization and i was a little bit too eager so
00:21:37.360
the three types of parameters were allowed but i allowed only one
00:21:42.840
and so if someone asked requested
00:21:48.240
billing records for a client it was set to nil because of this
00:21:54.080
sanitization and with no parameters those called
00:21:59.600
did nothing and we were asking for all billing records and then for all associated
00:22:06.400
associated objects and those are millions millions of records and that's why
00:22:13.840
we killed postgres and then we killed our workers
00:22:19.919
and i don't think we we reached the moment when all the objects were instantiated and serialized and pushed
00:22:26.799
to network we basically i basically killed the servers
00:22:32.480
the fix was pretty simple in case the list
00:22:38.799
parameters is empty uh return nothing instead of returning everything so we resigned from the sql
00:22:47.039
default of returning everything and when nowhere close
00:22:52.559
how did it happen that it slipped through our safe deployment strategy for our ci and
00:22:59.440
and so on um it was uh first of all
00:23:06.240
i was testing only one of those parameters in my unique test and it happened to be the parameter that i that
00:23:12.880
was correctly sanitized second of all we had a feature flag for this and i
00:23:19.919
switched it off when we notice production issues but the bug was introduced in the rest
00:23:27.039
branch in the safe branch battle tested one so i made it even worse
00:23:33.360
if i switched to graphql it would have been better but i didn't know it
00:23:38.559
then so that's that's how it worked
00:23:45.360
that's why it happened of course i fixed the sanitization
00:23:50.559
and what was the result of our fix finally it's the situation before the
00:23:56.080
fix and it's the chasm when i was fixing issue and after the fix
00:24:02.880
the um the response time is not only better but
00:24:08.799
it's also more predictable and so under fetching and custom queries really
00:24:14.799
helped us here but what i really learned here is that no matter how much i trust
00:24:20.880
myself and my unit tests and my feature flags and my reviewers i should always test manually we have this rule
00:24:29.120
this and this field in pulu request template how to test this
00:24:36.000
and in this case i put simply trust the specs and i think it was a mistake i should
00:24:42.559
provide a simple test scenario in rails console and with some expected lock output or
00:24:49.360
something because i believe we could have
00:24:54.799
quantities on development machines or during reviews and without this manual test strategy uh
00:25:02.480
i failed i i pro production okay
00:25:07.600
that was the bad part a lit something something better now right that the
00:25:13.279
packs were uh they were committed now now let's let's do something um after all those changes we were able
00:25:28.000
traffic for bigger percentage of our requests and we noticed that every sunday
00:25:35.120
we had four to ninth four to nine to many requests error
00:25:41.039
reported by our clients uh what happened every sunday uh we were first scheduling
00:25:48.159
reminders and then sending them at 5 pm and when scheduling we were doing a lot
00:25:53.679
of requests to to billing service so we fixed it by
00:25:59.360
preloading it was something we we knew well and we deployed it we waited for sunday
00:26:05.279
because we weren't able to uh to reproduce it locally because the issue was generated by by nginx
00:26:13.200
uh on our production setting and after on monday morning we saw the same
00:26:18.960
issue why was that the the
00:26:24.240
was raised not only during scheduling but also during sending the reminders we thought
00:26:31.200
it was uh it we would be covered because the reminders were
00:26:36.960
sent in our talents in our users um time zones
00:26:43.200
so not all of them at the same 5 p.m and we thought okay we'll split it into
00:26:49.840
24 buckets and it's okay the issue is that a quarter of our talents live
00:26:56.400
in the same time zone so when evening this time is okay
00:27:02.000
we were sending too many requests to bilink we couldn't move our talents to other
00:27:08.640
time zones so i did to move the reminders a little bit instead of sending them at 5 p.m sharp
00:27:15.760
uh uttered them adding jitter of two minutes and which we we calculated
00:27:22.080
that it should be okay so we deployed it and waited for sunday on monday morning we saw the same error
00:27:29.120
why was that we we missed one more reminder so we did the same and
00:27:35.520
on monday morning the same error and it was a month for fixing this
00:27:40.559
because we had to wait every time every time we was um this time
00:27:45.919
we realized that we it wasn't the problem with our domain knowledge this time it was a technical
00:27:52.399
details of sidekiq we thought that sidekick would run our scheduled jobs with
00:27:59.279
one second resolution but in fact sidekick did it with about 15 seconds
00:28:05.200
so instead of having uh 100 small buckets with requests we had
00:28:11.919
eight big packets of requests um and that was the issue so we took a
00:28:17.840
psychic limiter for sidekiq enterprise from sidekick enterprise uh we implemented window limiting and
00:28:25.600
deployed it and waited for sunday and it worked and it was marvelous
00:28:31.600
and i i must admit i must confess that it was during this saga
00:28:39.600
when i saw a green build on friday and i knew that it must be deployed
00:28:45.200
before uh sunday and i deployed it at 4 p.m on friday
00:28:53.360
and it worked and i went home at 5. and to
00:28:59.039
to make it more interesting i i did it twice during this uh this process at the very
00:29:04.559
beginning in the first week and at the very end of this of this process and both time it worked uh friday afternoon
00:29:12.080
and i was safe to go home and it was good to see it finally fixed
00:29:18.640
so to sum up what we did we
00:29:24.000
leveraged safe deployment strategy and good monitoring to be able to apply
00:29:30.399
a couple of fixing of optimization patterns we were using preloading to avoid and plus one
00:29:37.520
queries we moved filtering to server side and we were using custom queries to under fetch data
00:29:44.320
to factoring what is needed we tried to use local data and we spread the load with jitter and when you take a look at
00:29:51.919
this list it might sound familiar i guess that most of you
00:29:58.000
used those patterns or at least seen them let's
00:30:03.919
take a better look at them uh preloading turbid and plus one that's
00:30:09.279
something that every rm does server side filtering and uh
00:30:15.039
is quite common uh in rails to move from final to where both things we implemented not in the
00:30:22.480
database world but in microservice world uh we moved
00:30:28.640
we we applied this those both patterns for graphql api
00:30:34.080
using local data it's a pro browser
00:30:39.120
optimization primer to avoid requests if it's possible
00:30:44.399
under fetching is something that's built in sql
00:30:49.440
again we use graphql features to use it in the service oriented world
00:30:56.399
and spreading the load i read about it about 10 years ago or something uh in the high scalability post about youtube
00:31:03.679
architecture so it's not that we invented something new
00:31:10.960
so why do i keep it keep you here for more than half an hour
00:31:18.640
we are applying something known in a new setting and why didn't we do it well from the
00:31:26.000
beginning the answer here is threefold it was about learning simplicity and trade-off
00:31:33.279
trade-offs we realize that it's really hard to spot
00:31:38.960
performance issue simply by stirring up the code we need i we realized that you need to
00:31:46.720
apply traffic of the production shape to your application to see where are the performance
00:31:52.960
problems and it's really hard to in many cases it's really hard to do it locally and
00:32:00.080
the second part of the answer is simplicity we started with the simplest thing that could probably work
00:32:07.279
work we started with a boring rest solution and with the api of the
00:32:13.600
small surface of the minimal surface and only then we moved to something more
00:32:19.519
sophisticated like graphql and custom queries and finally it's all about trade-offs
00:32:26.799
when we were implementing the uh the jitter um we had to convince our colleagues
00:32:33.519
uh from feature teams that it's really required to spread this load and that we
00:32:39.919
can't really do everything at 5pm sharp because they
00:32:45.120
what they really wanted was the requirement is to send a remindered five let's send it five not
00:32:52.559
somewhat sometimes around five we had to explain them that it's impossible because we will we would kill the
00:32:59.360
billing service um so it was that was it it was about
00:33:05.120
determining what known patterns we should use in our novel setting it was about understanding
00:33:11.679
the trade-offs and about applying our knowledge in our specific case
00:33:18.240
and i shared what we learned i shared
00:33:23.519
uh our story about our solutions i detailed shared about mistakes
00:33:29.760
so you don't have to repeat the mistakes that we made
00:33:35.039
you can instead you can go to your work and commit your own mistakes
00:33:41.120
but then please come back to a meet up or to a conference or any anything
00:33:48.399
like this one and tell your story so that we can all learn from your mistakes
00:33:55.279
and if you wonder if you are ready to do them if it's safe to do them
00:34:01.200
just think if you can deploy your code on friday afternoon
00:34:08.240
thank you i'll answer any questions now