How Slack Transformed Their Cron Job System to Support Millions of Users

3 min readMay 31, 2024

Imagine you have a busy kitchen in a large restaurant. To keep things running smoothly, you need to ensure that tasks like chopping vegetables, cooking meals, and cleaning dishes happen at the right times. If not, chaos ensues. Slack faced a similar challenge with their cron jobs — automated tasks that need to run at specific times to keep the platform functioning seamlessly.

The Role of Cron Jobs at Slack

Just like a restaurant relies on its kitchen staff to prepare meals on time, Slack relies on cron jobs to ensure notifications, reminders, and updates are sent out promptly. As Slack grew, the number of these automated tasks (cron jobs) increased, making it harder to manage and maintain them.

The Challenges

Maintainability Issues: Imagine a kitchen where all the recipes are on a single piece of paper. As the number of recipes grows, it becomes harder to update, monitor, and manage them.
Single Point of Failure: Relying on a single server meant that if it went down, the entire system could fail, much like a kitchen without a head chef.
Vertical Scaling Costs: Initially, Slack tried to beef up their “kitchen” (server) by adding more resources, but this approach became too expensive and inefficient.

Building a New Cron Execution System

Slack decided to rebuild their cron system from scratch, using a more scalable and reliable approach. Here’s how they did it:

Scheduled Job Conductor

Think of the Scheduled Job Conductor as the kitchen manager who schedules when each task should be performed. This new service, written in Go and running on Slack’s Bedrock platform (a customized version of Kubernetes), ensures that tasks are scheduled efficiently.

Scalability with Kubernetes

By using Kubernetes, Slack can easily scale the number of “kitchen stations” (pods) based on the workload. This means they can dynamically add more resources when needed, ensuring that the system remains efficient and reliable.

Leader-Follower Architecture

In this system, there’s one “head chef” (leader pod) who schedules the jobs, while the other “chefs” (standby pods) are ready to take over if the head chef fails. This ensures rapid failover and simplifies synchronization.

Offloading Tasks to Job Queue

The actual execution of tasks is handled by worker nodes, allowing the job conductor to focus solely on scheduling. This separation ensures that resource-intensive tasks don’t bog down the scheduling process.

Job Queue

Slack uses a powerful job queue system to handle the execution of these tasks. Think of it as a series of conveyor belts in a factory, where each belt handles a different type of job. This system processes around 9 billion jobs per day, ensuring that tasks are completed efficiently and reliably.

Using Vitess for Job Tracking

To keep track of all the jobs, Slack employs Vitess, a scalable database system. This is like having a digital recipe book that tracks the status of every dish being prepared in the kitchen. It ensures that no task is duplicated and helps monitor the progress of each job.

Conclusion

By rebuilding their cron job execution system, Slack transformed their “kitchen” into a highly efficient and scalable operation. This ensures that all tasks are completed on time, keeping the platform running smoothly for millions of users. Just like a well-managed kitchen ensures timely and delicious meals, Slack’s new system ensures reliable and timely notifications and updates.

References:

Executing Cron Scripts Reliably at Scale