Summarized using AI

On the benefits of slot scheduling

Julik Tarkhanov • September 11, 2024 • Sarajevo, Bosnia and Herzegovina • Talk

In this talk, Julik Tarkhanov discusses the benefits of slot scheduling, particularly in the context of backend job management for a fintech application. The main topic revolves around alleviating load spikes from background jobs that can negatively impact system performance and operational costs. The presentation provides key insights into how slot scheduling can enhance workload predictability and reduce the need for substantial system scaling.

Key Points Discussed:

  • Traditional Scheduling Challenges:

    Using methods like cron for scheduling tasks leads to significant load spikes when many tasks are executed simultaneously. This is detrimental not only to performance but also incurs higher costs due to increased resource scaling.

  • Introduction of Slot Scheduling:

    Slot scheduling distributes tasks evenly across a limited number of slots derived from user IDs, which helps maintain consistent throughput and reduces queue depth.

  • Modeling and Simulation:

    Julik introduces discrete simulations as a strategy to evaluate scheduling mechanisms and predict load behavior accurately. By simulating several job scenarios, teams can derive effective scheduling solutions without excessive real-world testing.

  • Implementing Slot Scheduling:

    With the slot scheduling system, each user’s jobs are grouped into buckets managed efficiently at different time intervals. This prevents workload saturation in the job queue and enhances system stability.

  • Metrics and Feedback Loop:

    The use of performance metrics to characterize job durations and throughput is crucial. It allows teams to assess and adjust scheduling strategies based on actual performance data.

  • Handling Jumbo Users:

    The talk addresses the issue of 'jumbo users,' whose workload is significantly larger than average. Special handling strategies, like isolating those users onto dedicated queues or clusters, can help maintain overall system performance.

Conclusions and Takeaways:

  • Slot scheduling is a practical solution for managing workloads in environments with numerous users, preventing spikes in resource needs while maintaining performance.
  • Using simulations can help validate assumptions and arrive at efficient solutions quickly, highlighting that sometimes straightforward approaches yield the best results.
  • The importance of monitoring and adjusting strategies in real-time cannot be overstated, as it leads to better predictive scaling and overall system management.

Overall, slot scheduling proves to be an effective method for optimizing task execution in a heavily loaded environment, making it a beneficial approach for tech teams looking to enhance the performance of their applications.

On the benefits of slot scheduling
Julik Tarkhanov • Sarajevo, Bosnia and Herzegovina • Talk

Date: September 11, 2024
Published: January 13, 2025
Announced: unknown

Ensuring smooth operation for thousands of users with regular workloads is hard. Using slot scheduling can help you make your workload and throughput more predictable and your users happy. Slot scheduling is a great medicine to alleviate load spikes on your background jobs cluster.

The traditional way of performing tasks for multiple users in the system is to schedule them using cron or similar. The issue however, is that scheduling “all the work for all the users” into one time slot will create a large load spike. This is bad for your shared services - the DB, Redis, APIs you call into - but also bad for your wallet, as you will need to suddenly scale your system by a factor when the time comes to run those jobs. In our company we need to run sync jobs to external APIs for every user - tens of thousands of users, and we want those jobs to run with predictable frequency. While we previously selected “least-recently-synced” workloads first, this came with the load spikes and all the accompanying issues.

Slot scheduling tackles the problem differently. It computes a modulo over the workload owner ID (like the user ID) and then buckets the user into one of a fixed number of slots. This allows us to run an approximately even number of tasks at any given point in time, providing for a near-flat throughput. It is better for the shared services, better for the queue - as jobs get portioned into the job queue in smaller batches, and do not stick around for long - and better for autoscaling, as there is much less of it needed.

We are also going to discuss a very viable strategy for designing systems like this - discrete simulations, which - when applied properly - can save hours of testing.

EuRuKo 2024

00:00:07.120 hello wonderful people uh wow it's
00:00:09.800 echoing here wonderful to see you here
00:00:12.480 um I think my last Yuko was completely
00:00:15.240 remote remember there used to be Co and
00:00:17.160 all that stuff so great to see you in
00:00:19.279 person a wonderful City uh loving the
00:00:22.640 views the mountains everything just
00:00:25.240 excellent excellent uh my name is ulik
00:00:28.800 uh I've been coding in Ruby for about 20
00:00:30.920 years now maybe more uh and working
00:00:33.920 professionally with it for maybe 12
00:00:36.440 years 13 never mind and I wanted to talk
00:00:39.360 to you about the virtues of slot
00:00:41.520 scheduling and what the GC gcd method is
00:00:44.440 for maybe you have used it maybe not but
00:00:46.760 you might Discover it and uh I wanted to
00:00:51.360 actually I wanted to make a quick break
00:00:53.359 uh I know there are uh members of the
00:00:55.800 Ukrainian Community listening in I know
00:00:57.920 that the Ukrainian Ruby Community is
00:00:59.680 very
00:01:00.840 vibrant I love to visit Kiev I spoke for
00:01:05.720 Ruby motion I think uh once and I just
00:01:09.640 wanted to say let's not forget there is
00:01:11.080 a war going on in Europe it's an unjust
00:01:13.920 War Ukraine must be free the warmest end
00:01:18.080 these links too are two Charities that I
00:01:20.640 know are doing great work to help
00:01:22.520 ukrainians if you can help them uh I
00:01:26.159 know that lots of people from Russia do
00:01:29.439 not speak on the topic openly for fear
00:01:31.240 of
00:01:32.439 repression I try to stand up for them
00:01:35.399 freedom to
00:01:44.079 Ukraine now on the virtues of
00:01:46.880 scheduling so uh for people who don't
00:01:50.079 want to sit here for too long I have a
00:01:52.200 tldr of the talk all packages in one
00:01:54.119 slide you can make a picture of it and
00:01:56.439 walk away and have conversations in the
00:01:57.840 hall or whatever like this is the super
00:01:59.960 rapid thing now we go for the slow
00:02:02.920 version uh so I work at a company called
00:02:05.280 shedar where are a fledging startup in
00:02:08.879 the UK uh we're a fintech and what we do
00:02:11.720 we help you manage your personal spend
00:02:14.280 uh we help you manage your personal
00:02:16.040 finances uh we also do peer-to-peer
00:02:18.800 payments on the UK Market if you are uh
00:02:21.840 a UK resident or you have a UK bank
00:02:23.720 account you might want to try out
00:02:25.319 cheddar it's real neat um and cheddar is
00:02:29.080 a mobile applic funnily enough it is
00:02:31.160 actually a rails application with a
00:02:32.440 flutter front end that's also something
00:02:34.640 that exists and one of the things that
00:02:36.519 shedar does is we integrate with a ton
00:02:38.640 of banks we call to Banks we get
00:02:43.280 permission from users to access the
00:02:45.560 users transactions then we go out to
00:02:47.480 Banks we download the transactions to
00:02:49.000 analyze them and we show you your spend
00:02:51.319 some predictions uh and we're also able
00:02:53.720 to pay you out uh cash back
00:02:56.599 automatically uh that's also a feature
00:02:58.480 that we support and and uh all of this
00:03:02.159 all of these um Bank Integrations and we
00:03:04.920 integrate with 23 Banks just in the UK
00:03:07.480 at the moment which is a huge number
00:03:09.680 actually all of that happens from our
00:03:11.480 rails monolith it's all less Tech single
00:03:15.519 app uh very kind of barebones approach
00:03:18.200 postgress for everything uh and we are
00:03:21.239 unique in a way that in the sense that
00:03:22.959 we actually do those banking
00:03:25.080 Integrations ourselves now uh people who
00:03:27.680 are in the fintech space I already spoke
00:03:29.599 to a couple in the hallways um they uh
00:03:33.319 highlight that when you need to get
00:03:35.159 access to Banks it's always difficult
00:03:37.239 you have different apis to contend with
00:03:39.080 authentication is hard security is hard
00:03:41.239 things difficult uh however we chose to
00:03:44.080 do this integration ourselves uh which
00:03:46.360 gives us certain benefits and when we do
00:03:48.680 this transaction syncing we call out to
00:03:50.959 a bank on your behalf your user behalf
00:03:55.000 uh and the banks return us the
00:03:57.640 transaction information pretty simple
00:03:59.319 except
00:04:00.480 sometimes it's very slow sometimes
00:04:02.079 certain banks are down also you need to
00:04:04.360 study at least 20 different ways to uh
00:04:07.159 mess up open o uh and open o to uh
00:04:10.840 probably there is a book to be written
00:04:12.400 there about it but the most important
00:04:14.560 part is that Banks update data for
00:04:17.959 people not instantly but usually once a
00:04:20.959 day or twice a day and once this data is
00:04:24.120 updated at the bank we want to get at it
00:04:26.560 preferably quickly now our way of
00:04:29.440 getting us this data quickly was
00:04:32.479 initially this so we would look for
00:04:34.919 people who haven't had their stuff
00:04:37.000 updated in a while we would query for
00:04:38.800 those and then we would schedule some
00:04:40.479 kind of syn synchronization task for
00:04:43.039 them and we would run this task multiple
00:04:44.800 times a day uh depending on the day of
00:04:47.479 the week etc etc and that would be
00:04:49.919 pretty miserable
00:04:51.800 because uh doing it this way means that
00:04:54.919 you will have very large spikes in your
00:04:57.120 queue dep so you will Quee a huge number
00:04:59.000 of tasks your systems will react if you
00:05:02.080 use Autos scaling the more jobs you have
00:05:03.800 the more nodes you are going to spin up
00:05:05.680 the more money you're going to pay uh
00:05:07.919 it's uneven load which is just just an
00:05:11.360 issue that you will be dealing with and
00:05:13.360 the job that was dealing with it looked
00:05:15.199 somewhat like
00:05:16.240 this uh we use job iteration from
00:05:18.759 Shopify probably some folks know about
00:05:20.919 it um and that's basically how it would
00:05:24.560 proceed so uh it's a a way to destroy
00:05:28.039 your application like nobody can be as
00:05:30.199 good at dsing your application as
00:05:32.800 yourself um so you will inject a ton of
00:05:36.160 jobs into the queue you will select the
00:05:37.960 jobs and when you select the jobs you
00:05:40.199 slow you query the database because just
00:05:42.440 use pogress right when you query the
00:05:45.400 database your CPU load grows Q depth
00:05:47.919 increases the autoscaler says oh Q depth
00:05:50.479 is large we have a lot of jobs more
00:05:53.479 computers what do more computerss do
00:05:56.440 well they of course hammer on the same
00:05:58.680 table they cause more load on the CPU
00:06:00.560 they they cause more congestion and then
00:06:02.440 you end up seeing a load pattern on your
00:06:04.720 database which looks like this which
00:06:06.599 doesn't make you happy at all it's like
00:06:08.560 there is no issue uh running a pretty
00:06:12.400 successful um rail Z I would say maybe
00:06:15.160 on a cluster of raspberry pies as long
00:06:17.000 as you have a good database server
00:06:19.080 because that's the thing that will hurt
00:06:21.080 first so we were thinking okay how do we
00:06:23.880 spread this load in such a way that we
00:06:25.720 don't destroy ourselves every day right
00:06:27.880 and then this article sprang to mind if
00:06:30.960 you don't know about this article I
00:06:32.639 suggest you give it a look it's old and
00:06:34.880 it's good uh it says that simulate
00:06:38.080 anything that involves more than one
00:06:39.639 probability probabilities over time or
00:06:41.599 cues now we have probabilities we have
00:06:44.000 probabilities over time because Banks
00:06:45.680 fail
00:06:47.280 intermittently and we have cues exactly
00:06:50.520 the thing so we thought like okay what
00:06:53.479 if we were to model a situation with
00:06:56.639 lots of jobs right we would go something
00:06:58.800 like this we would create a fake job we
00:07:01.039 would say that this job takes a certain
00:07:02.479 amount of time to complete we would Mark
00:07:04.919 the start time of the job we would
00:07:07.720 register when the job is supposed to
00:07:09.319 finish right and then we would let this
00:07:11.680 job produce some fake side effects in
00:07:13.759 our case it would be job spooling more
00:07:16.639 jobs right and then but for example
00:07:19.960 produce side effects one of the things
00:07:21.319 that you can do for for a job that fake
00:07:23.400 job that produces side effects you can
00:07:24.800 say this job causes n unit of load on
00:07:27.199 your fake non-existent model database
00:07:29.639 right and then we would run it from a
00:07:32.360 simulator which looks kind of like so
00:07:35.240 that is um a uh concept related to
00:07:39.360 discrete event simulation but not quite
00:07:41.400 this is actually an incremental time
00:07:42.759 progression so we initial we initialize
00:07:44.720 our simulator and we give it the current
00:07:48.759 time it's kind of like a game engine and
00:07:50.759 then every time we increment the time uh
00:07:53.159 we check whether any of our jobs have
00:07:55.240 completed uh if they did we take more
00:07:57.879 jobs from the queue uh uh and we record
00:08:01.599 metrics now the important part here is
00:08:04.280 that you want to know how long your jobs
00:08:07.599 take and just a number is obviously not
00:08:11.280 very nice and not very useful but we use
00:08:14.919 an APM and an APM will usually give you
00:08:17.639 some kind of percentiles for how long
00:08:19.599 your jobs take in our case it is app
00:08:22.360 signal been using using it for a long
00:08:24.319 time and very happy with it and it gives
00:08:28.479 us something like this
00:08:30.120 right so it gives us the mean and gives
00:08:32.800 us the 19th percentile uh by the way the
00:08:36.640 13.37 seconds this is calling out to
00:08:39.519 slow off servers at Banks this is not
00:08:42.200 because Ruby is slow it's because you
00:08:43.640 sit and wait right now um then the
00:08:48.519 question comes if we have percentiles
00:08:50.440 there should be a way to actually
00:08:52.480 generate fake job durations based on
00:08:54.640 that percentile right but here's here
00:08:56.920 here's a problem I haven't studied
00:08:58.720 computer science I haven't studied math
00:09:00.560 I've done art school and I've even
00:09:02.000 worked for a CTO who has done Art School
00:09:03.680 ask me anything right so if we have a
00:09:07.160 hunch that there must be an algorithm to
00:09:09.000 do X but we don't have it what do we
00:09:12.399 do well we go to Chad GPT and we ask it
00:09:17.120 a question now I'm not necessarily
00:09:20.320 scared of llms as replacing as
00:09:23.519 developers I tend to see llms as a
00:09:26.440 drunken pop companion you can ask it
00:09:28.760 stupid question
00:09:29.880 and they hallucinate you some kind of
00:09:31.560 idea and it might just be inspirational
00:09:33.760 you know you you should never take it at
00:09:36.399 face value you should not take it as a
00:09:37.760 truth but it might hallucinate you
00:09:39.200 something which will make you think oh
00:09:40.600 maybe maybe that could work so after
00:09:43.040 some discussion with the drunken pop
00:09:45.360 companion we come at an algorithm which
00:09:47.839 is called inverse transform sampling
00:09:49.480 which is apparently used in games go
00:09:52.000 figure right uh and having put an
00:09:54.800 inconspicuous comment that this is
00:09:56.480 hallucinated ma and I don't know what
00:09:58.079 I'm doing this is dog
00:10:00.360 uh we get a way to generate values for
00:10:03.399 our jobs which are going to look pretty
00:10:05.880 much like the real thing and then from
00:10:08.959 there for example we can do a thing like
00:10:10.800 this right we run our simulation where
00:10:14.240 we make it um use S those sample
00:10:17.480 durations which satisfy this sample
00:10:21.440 which satisfy this distribution so the
00:10:23.480 job durations are going to match and
00:10:25.440 then every uh every hour we dump a huge
00:10:28.959 number of fake jobs into that queue and
00:10:32.320 we let the simulator run for a number of
00:10:34.440 time um the S so as far as code goes
00:10:38.760 this is pretty much all there is to it
00:10:40.440 like I'm not going to explain what what
00:10:42.839 what is there but it needs at the very
00:10:46.639 minimum it needs some way to record
00:10:48.440 metrics you need the simulator itself
00:10:51.639 and you need a sorted Heap which is
00:10:53.839 going to simulate your queue because if
00:10:55.240 you're using a queue which um supports
00:10:57.519 priorities for example uh you will want
00:11:00.200 to have fast uh uh DQ operations and
00:11:04.440 that's just faster to do in Ruby than in
00:11:06.880 than with sqlite or whatnot and then we
00:11:09.320 run that simulator which does something
00:11:11.839 like this now there is an important
00:11:13.720 thing to note here is that if you go for
00:11:15.959 this um monotonic increasing time type
00:11:18.600 of simulation your unit of work should
00:11:21.440 be somewhat um um somewhat
00:11:26.360 understandably sized to your tick
00:11:28.040 because for example if you have jobs
00:11:29.639 which take 200 milliseconds but the
00:11:31.560 maximum resolution of your simulation is
00:11:33.279 1 second you're not going to get very
00:11:34.639 useful output right uh but if your job
00:11:37.639 durations are on the order of 16 seconds
00:11:40.120 a minute maybe two minutes then you
00:11:42.360 should be pretty much pretty much good
00:11:44.120 to go so we run that thing it
00:11:47.320 simulates uh one week of our queue for
00:11:50.959 this type of workload and then it
00:11:52.680 outputs us things now uh in our case it
00:11:57.160 rather look like this so we try
00:11:58.880 different scheduling schemes um and a
00:12:03.040 simulator is just a bit of Ruby code
00:12:05.000 that defines your jobs and um and the
00:12:08.279 timings then you run these and then it
00:12:10.800 outputs you a very handy sqlite database
00:12:14.519 with all the metrics because on every
00:12:16.120 tick you can record um what the state
00:12:19.839 your simulation is in and you can also
00:12:21.639 create metrics like counters and S and
00:12:24.680 samples from your simulated jobs and
00:12:27.680 then we just use a hacky
00:12:30.240 thing uh where we just try to sqlite as
00:12:34.160 fast as possible this slide is here to
00:12:36.680 nerd snip Steven markheim I don't know
00:12:38.800 if it's going to work or not but this is
00:12:40.440 basically how you insert stuff into
00:12:43.240 sqlite if you don't care about
00:12:44.639 consistency and you just want to do it
00:12:46.000 as fast as possible and then having the
00:12:48.880 database you can query for example like
00:12:51.240 this uh now this then gives you the
00:12:54.959 information about the number of jobs
00:12:57.120 that have completed at every tick of the
00:12:59.639 simulation and we enter this exercise
00:13:02.040 with this premise right so we want every
00:13:05.000 user account to be synced with a
00:13:06.920 somewhat consistent delay we want no
00:13:09.360 huge Ingress on the Queue because we
00:13:11.079 were using good job and good job gets
00:13:13.600 way slow and loads the database way hard
00:13:16.880 when you have a lot of jobs lined up and
00:13:19.199 we want the throughput to be even
00:13:20.959 because we don't want this autoscaler
00:13:22.560 jumping all the time right we want to
00:13:24.360 know okay we will need that many
00:13:26.720 machines for that much time and ideally
00:13:29.920 that should be enough right plus
00:13:32.839 sometimes we have jobs which may step on
00:13:35.399 each other's toes because they use each
00:13:36.880 other's data right now how do we go
00:13:40.800 about this
00:13:43.079 well that's how we do it this is the
00:13:46.920 single I would say the single most
00:13:49.040 underused class in the Ruby standard
00:13:50.839 library and the reason it's underrated
00:13:53.199 it's because it's not really random it's
00:13:55.639 actually a memorizing algorithm and it's
00:13:57.600 called The mercen Twister
00:13:59.759 and if you seed it with a value it will
00:14:02.680 actually give you a predictable
00:14:04.399 deterministic sequence of random values
00:14:07.839 as you trigger it over and over and over
00:14:09.959 again uh and the first and this sequence
00:14:12.800 is
00:14:13.680 predetermined so imagine we want to inq
00:14:17.160 jobs for a specific user or jobs for our
00:14:19.920 entire user base as a matter of fact
00:14:21.720 throughout a certain amount of
00:14:23.680 time um we would do it like this right
00:14:28.040 now we want every user to get a specific
00:14:31.600 kind of delay and we want this delay to
00:14:33.440 be known in advance and we we don't want
00:14:35.120 it to change and we want it to be evenly
00:14:37.440 distributed for all the users now how do
00:14:39.079 we do this really easy right that's all
00:14:42.320 you need to uh that's all you need to
00:14:44.120 get a delay which is derived from your
00:14:45.959 user ID and since you are using um a
00:14:49.440 random number generator the values that
00:14:51.399 it outputs they are going to be pretty
00:14:53.800 uniformly distributed so you're
00:14:55.560 guaranteed to even if you have user IDs
00:14:57.480 like 1 two 3 four five still the values
00:15:00.160 that the random number generator will
00:15:01.480 give you they will have a very nice
00:15:03.519 spread now if like us you are stuck with
00:15:07.360 using uid primary Keys which is a whole
00:15:10.279 different discussion but anyway what you
00:15:12.079 can do in there is first you convert
00:15:14.360 your uuid into bytes which you do using
00:15:17.519 these 25 statements listed here and then
00:15:21.120 you can convert these bytes into a very
00:15:23.079 very large integer and you can SE your
00:15:24.959 mereni twister with this very very large
00:15:27.680 integer but
00:15:29.600 if we just give every user uh job a
00:15:33.560 specific start time to spread those jobs
00:15:36.399 out we still haven't solved the problem
00:15:38.319 of a very large Ingress of jobs into the
00:15:41.399 queue and for that we figured okay what
00:15:44.639 if we don't portion this out like why uh
00:15:49.160 let and it works like this imagine we
00:15:52.000 have our user base we split it into
00:15:55.040 eight buckets which we call slots right
00:15:58.360 and then when the time comes to run uh
00:16:01.000 the jobs for a specific user who is for
00:16:03.199 example in bucket zero or in slot zero
00:16:06.120 then what we're going to do is we're
00:16:07.959 going to inq a blue job which is that
00:16:11.199 dot at the very beginning right and that
00:16:13.600 job in turn will spool only the jobs for
00:16:16.680 the users of that slot and no more then
00:16:20.160 when the time rolls over to the next
00:16:22.839 slot a similar thing is going to happen
00:16:25.199 we're going to run a scheduling job for
00:16:27.199 that slot and then it's going to inq the
00:16:30.160 jobs for users who land in slot one and
00:16:33.480 then slot two and slot three and Slot
00:16:35.000 four and and and and so and so on now um
00:16:39.399 the question then becomes how do we
00:16:40.680 determine how many buckets or how many
00:16:42.199 slots we should have so that it be
00:16:44.120 reasonable because we want to have as
00:16:45.600 many as possible because the more slots
00:16:47.399 you have the smaller the Ingress on your
00:16:49.680 queue is going to be at any given time
00:16:52.360 and uh it actually comes down to a very
00:16:54.720 bad uh coding interview question uh
00:16:58.360 which is
00:16:59.399 uh I hope not getting asked at your
00:17:02.399 company but anyhow uh we have a certain
00:17:06.439 cycle duration for example we have a
00:17:08.520 cycle duration of um 8 hours and we want
00:17:11.319 to fit our entire user base into eight
00:17:13.360 hours uh how do we determine how many
00:17:15.520 slots to have well if we know that we
00:17:18.280 have a certain number of buckets in our
00:17:20.839 case it's the number of possible values
00:17:22.600 of a bite which I'll get to soon there
00:17:26.120 is a standard Library method for this
00:17:28.079 which is called GR common divisor do not
00:17:30.559 code it in
00:17:31.640 interviews
00:17:33.280 okay and that's exactly what it is for
00:17:35.880 so it tells us that given n buckets and
00:17:38.440 given this duration in seconds if we
00:17:40.200 want to get uh an integer number of
00:17:43.200 slots that's how many slots we want to
00:17:45.799 have so our entire workload then becomes
00:17:51.360 three workloads and we executed in steps
00:17:53.240 so first we run the so-called bootstrap
00:17:55.000 job which knows how long the cycle is
00:17:57.200 going to be it computes how how many
00:17:59.120 slots it's going to use and it computes
00:18:01.880 uh when the slots are supposed to start
00:18:03.960 it Ines the per slot job right and every
00:18:07.320 per slot job which runs then uh Ines the
00:18:12.159 sync job for a particular user and uh
00:18:15.720 the bootstrapping looks kind of like
00:18:17.480 this right and every per slot job we
00:18:21.320 delay just by a duration of a slot uh
00:18:24.280 multiplied by Which slot it is so we
00:18:26.000 delay the we delay the slot uh one by
00:18:30.840 one duration slot two by two durations
00:18:32.600 and so on and then when we go to execute
00:18:36.760 it uh then this is how it works uh we
00:18:40.840 select the user accounts for this
00:18:42.720 particular slot and what we do then is
00:18:46.640 uh we derive a delay for every user
00:18:49.960 using the same trick actually using the
00:18:52.520 the the the primary key of our um of our
00:18:56.200 um record in the database to also uh
00:19:00.200 distribute the users inside of that slot
00:19:03.280 evenly and then we just in those jobs
00:19:05.760 for for that particular user and uh that
00:19:10.360 has advantages because for example
00:19:12.480 certain cues like sqs will not allow you
00:19:14.600 to look at um the jobs before your
00:19:17.400 workers check them out to perform them
00:19:20.280 um it some cues gets slower as you have
00:19:23.679 more jobs lined up and also uh
00:19:27.919 specifically for us cues using databases
00:19:29.919 they use indexes indexes grow indexes
00:19:32.280 make things slower and indexes lock but
00:19:35.919 it gets even more interesting now
00:19:38.679 imagine we have this kind of situation
00:19:40.200 we have a an accept payments job and it
00:19:42.559 does things with your account with the
00:19:44.960 monies right for which it needs to lock
00:19:47.400 your account for example it needs to
00:19:49.240 accept payments right but we also have a
00:19:52.039 generate reports job which also needs to
00:19:55.200 lock your account to perform some
00:19:58.200 calculations and to generate you some
00:19:59.760 reports like what is your spend over a
00:20:01.679 month or what is your spend for a week
00:20:03.400 or what the prediction is right now how
00:20:06.919 do we do how do we
00:20:08.640 use this scheduling method to uh to
00:20:12.520 facilitate these two jobs running
00:20:14.960 together at the same time right now we
00:20:17.000 started with a
00:20:18.200 line uh and that's like a time interval
00:20:20.679 which is split into multiple intervals
00:20:22.840 now what if we take that interval and we
00:20:24.960 bend it into a donut like so right and
00:20:28.679 for example we know that we place our
00:20:30.840 accept payments job somewhere here
00:20:33.120 somewhere between slot zero and Slot one
00:20:35.760 right and we know that the generate
00:20:37.720 reports job has to run as far as
00:20:40.960 possible from the accept payments job uh
00:20:43.919 so that they do not Collide for the same
00:20:45.919 account how do we do it we Trace what we
00:20:48.799 call a
00:20:49.840 reciprocal uh and we run the accept
00:20:53.400 payments job for all the blue users
00:20:55.799 right all the users who fall into the
00:20:58.080 who fall into to the slot zero but at
00:21:00.520 the same time we can run generate
00:21:02.280 reports job for the users who fall into
00:21:04.080 slot four and this makes sure that these
00:21:07.960 jobs do not step onto each other because
00:21:09.720 they will ideally they will never run at
00:21:11.679 the same time for the same user account
00:21:13.320 and there will be no lock contention in
00:21:15.400 there right now if we add another job
00:21:17.600 into the mix imagine we do some know
00:21:19.279 your customer job right same principle
00:21:22.640 we can split those jobs to be
00:21:25.120 equidistant from each other right and
00:21:26.919 then we run the performance kyc jobs for
00:21:29.799 users in slot 5 generate reports job for
00:21:32.320 people in slot three and the accept
00:21:34.559 payment jobs in slot slot
00:21:38.559 zero um so the next um step is to figure
00:21:44.480 out how to select your users for this
00:21:46.600 for this kind of workflow because we
00:21:47.919 have of course this and um that is that
00:21:51.559 looks very simple uh and it actually is
00:21:54.000 but it is a postgress hack uh and
00:21:56.640 undercover it actually does this
00:21:59.039 so that's how you turn your uu IDs in
00:22:02.360 postgress into bytes and then what we do
00:22:04.600 is we grab the last bite which gives us
00:22:07.080 the 256 or
00:22:09.440 265 don't remember as many values as
00:22:12.400 there are in a bite you get you get the
00:22:14.400 buckets right and then we um just
00:22:17.279 perform a division which rounds off and
00:22:19.960 that's how we can select uh everyone
00:22:22.240 from our user base who is subjected to
00:22:24.320 this
00:22:25.039 slot um you also want to use the last
00:22:27.679 bite because if you are smart and you're
00:22:29.320 using your uid V7 your random component
00:22:32.520 is going to be at the end so that's why
00:22:34.640 the Les
00:22:35.880 bite uh and where does it get us well um
00:22:40.919 this was a desire to get even throughput
00:22:44.919 uh and we wanted to ease the load on our
00:22:46.840 systems and we wanted people to to we
00:22:49.320 wanted to know that all those sys
00:22:50.960 complete to begin with in due time and
00:22:54.840 boy did it work like uh the blue Gra
00:22:59.080 that's our throughput on the background
00:23:00.720 jobs left part is before we deployed
00:23:03.000 this change the right one is after we
00:23:05.799 deploy this change and the spike is when
00:23:07.679 we schedule uh notifications and bulk
00:23:10.919 sometimes these cause spikes but these
00:23:12.360 are predictable and this has a ton of
00:23:15.159 advantages because you will likely pay
00:23:17.520 less for compute because when you spin
00:23:19.760 up extra autoscaling uh instances you
00:23:23.799 spend time for th for those instances to
00:23:26.640 come up and go down
00:23:29.000 right uh you become very reactive your
00:23:32.440 autoscaler can be can become too
00:23:34.200 aggressive you need to be very careful
00:23:36.600 with the parameters so that your
00:23:37.720 autoscaler doesn't overscale if you're
00:23:41.120 using if you're using postgress you're
00:23:43.559 also got to be very careful with auto
00:23:45.200 scaling because of the number of
00:23:46.400 connections because if you're not
00:23:47.760 careful you're going to have PG Bouncer
00:23:49.799 and and the works right um also it
00:23:53.360 allows you to make estimations because
00:23:56.120 with this kind of even throughput you
00:23:57.559 can know okay at night at this time of
00:24:00.760 day we have throughput of X how many
00:24:02.880 boxes do we need to satisfy that
00:24:04.520 throughput of X and one of the things
00:24:06.600 that you can then do for example is have
00:24:08.799 pre-committed compute which you can do
00:24:11.200 in AWS for instance you can say okay we
00:24:13.279 know that every night we're going to
00:24:14.520 need that many Machines of Type X uh
00:24:17.279 with projected growth so and so like the
00:24:20.679 flat throughput is the ideal situation
00:24:24.960 for most operations like for Mo for most
00:24:28.080 oper concerns you want your throughput
00:24:29.880 to be flat and this kind of scheduling
00:24:32.120 despite the fact that it's discouraged
00:24:34.120 and
00:24:35.279 stupid um it works very well uh there is
00:24:39.600 one thing however which I need to
00:24:42.559 address is that um in like any company
00:24:47.039 with a sizable enough user base will get
00:24:48.919 the so-called jumbo users you know
00:24:50.679 that's your user who has 10x of
00:24:52.840 everything uh they have 10x as many
00:24:55.760 widgets 10x as many notifications 10x as
00:24:58.480 many comments they they they post 10
00:25:00.960 times as much data etc etc etc and this
00:25:04.960 kind of scheduling is not very good for
00:25:09.039 for this type of users because they will
00:25:11.360 their jobs will run longer than your
00:25:13.320 median job of it will throw off your
00:25:15.880 scheduling a little bit but there are
00:25:19.159 other ways to deal with that for example
00:25:20.840 you can locate your jumbo users well
00:25:24.559 first of all you got to send your jumbo
00:25:25.880 users flowers usually because supposedly
00:25:28.880 you're earning a ton of money with them
00:25:31.000 but also you can locate them on a
00:25:32.520 separate cluster for example or you can
00:25:34.279 send their workloads into a separate
00:25:35.960 queue which will allow you to size your
00:25:38.720 compute for those users specifically so
00:25:41.600 it's a solvable problem it's just that
00:25:44.039 they are bigger right so stupid
00:25:48.279 Solutions work despite folks telling you
00:25:51.279 otherwise uh and simulators are fun
00:25:54.200 because we had uh discussions on the
00:25:56.760 team like okay should we try this
00:25:58.799 heuristics for scheduling or that
00:26:00.799 heuristic for scheduling and in the end
00:26:02.880 it turned out that writing a simulator
00:26:04.640 running those scenarios and looking at
00:26:07.360 the metrics that they output told us
00:26:09.960 that yeah we can totally do the stupid
00:26:11.880 algorithm and it will totally work and
00:26:13.799 it will be the cheapest solution of all
00:26:16.559 and also a tiny bit of Statistics gets
00:26:18.640 you a long way uh only be careful with
00:26:22.279 hallucinated math
00:26:25.960 please and that's it thank you very much
00:26:28.880 if you want to take a look at cheddar
00:26:31.039 especially if you if you are in the UK
00:26:33.440 you can be our customer uh and we also
00:26:36.559 have some nice open source goodies on
00:26:38.679 GitHub and underneath you can find my
00:26:41.120 blog and Twitter uh posting a
00:26:43.799 little bit sometimes thank you
Explore all talks recorded at EuRuKo 2024
+39