On the benefits of slot scheduling

In this talk, Julik Tarkhanov discusses the benefits of slot scheduling, particularly in the context of backend job management for a fintech application. The main topic revolves around alleviating load spikes from background jobs that can negatively impact system performance and operational costs. The presentation provides key insights into how slot scheduling can enhance workload predictability and reduce the need for substantial system scaling.

Key Points Discussed:

Traditional Scheduling Challenges:

Using methods like cron for scheduling tasks leads to significant load spikes when many tasks are executed simultaneously. This is detrimental not only to performance but also incurs higher costs due to increased resource scaling.
Introduction of Slot Scheduling:

Slot scheduling distributes tasks evenly across a limited number of slots derived from user IDs, which helps maintain consistent throughput and reduces queue depth.
Modeling and Simulation:

Julik introduces discrete simulations as a strategy to evaluate scheduling mechanisms and predict load behavior accurately. By simulating several job scenarios, teams can derive effective scheduling solutions without excessive real-world testing.
Implementing Slot Scheduling:

With the slot scheduling system, each user’s jobs are grouped into buckets managed efficiently at different time intervals. This prevents workload saturation in the job queue and enhances system stability.
Metrics and Feedback Loop:

The use of performance metrics to characterize job durations and throughput is crucial. It allows teams to assess and adjust scheduling strategies based on actual performance data.
Handling Jumbo Users:

The talk addresses the issue of 'jumbo users,' whose workload is significantly larger than average. Special handling strategies, like isolating those users onto dedicated queues or clusters, can help maintain overall system performance.

Conclusions and Takeaways:

Slot scheduling is a practical solution for managing workloads in environments with numerous users, preventing spikes in resource needs while maintaining performance.
Using simulations can help validate assumptions and arrive at efficient solutions quickly, highlighting that sometimes straightforward approaches yield the best results.
The importance of monitoring and adjusting strategies in real-time cannot be overstated, as it leads to better predictive scaling and overall system management.

Overall, slot scheduling proves to be an effective method for optimizing task execution in a heavily loaded environment, making it a beneficial approach for tech teams looking to enhance the performance of their applications.

On the benefits of slot scheduling
Julik Tarkhanov • Sarajevo, Bosnia and Herzegovina • Talk

Date: September 11, 2024
Published: January 13, 2025
Announced: unknown

Ensuring smooth operation for thousands of users with regular workloads is hard. Using slot scheduling can help you make your workload and throughput more predictable and your users happy. Slot scheduling is a great medicine to alleviate load spikes on your background jobs cluster.

The traditional way of performing tasks for multiple users in the system is to schedule them using cron or similar. The issue however, is that scheduling “all the work for all the users” into one time slot will create a large load spike. This is bad for your shared services - the DB, Redis, APIs you call into - but also bad for your wallet, as you will need to suddenly scale your system by a factor when the time comes to run those jobs. In our company we need to run sync jobs to external APIs for every user - tens of thousands of users, and we want those jobs to run with predictable frequency. While we previously selected “least-recently-synced” workloads first, this came with the load spikes and all the accompanying issues.

Slot scheduling tackles the problem differently. It computes a modulo over the workload owner ID (like the user ID) and then buckets the user into one of a fixed number of slots. This allows us to run an approximately even number of tasks at any given point in time, providing for a near-flat throughput. It is better for the shared services, better for the queue - as jobs get portioned into the job queue in smaller batches, and do not stick around for long - and better for autoscaling, as there is much less of it needed.

We are also going to discuss a very viable strategy for designing systems like this - discrete simulations, which - when applied properly - can save hours of testing.

EuRuKo 2024