規模に応じたCronスクリプトの確実な実行

Reliable and Scalable Cron Execution at Slack

In the world of software development, cron scripts play a crucial role in ensuring the smooth functioning of various tasks. At Slack, cron scripts are responsible for executing reminders, sending email notifications, and managing databases, among other critical functions. However, as the number of scripts and the amount of data they process grew over time, the reliability and scalability of the existing cron execution system became a concern. In this article, we will explore how Slack designed and built a new system to execute cron scripts reliably at scale.

The Challenges of the Old System

Initially, Slack relied on a single node to execute all the cron scripts. This node had a copy of all the scripts and a crontab file with their schedules. However, as the number of scripts and the data they processed increased, the system’s reliability started to falter. Scaling up the node’s resources helped to some extent, but it still wasn’t a reliable solution. Any issues with provisioning, rotation, or configuration would bring the service to a halt, affecting critical Slack functionality.

Realizing the limitations of the existing system, Slack decided to build a new, more reliable cron execution service. The goal was to create a system that could handle the growing number of scripts and data while ensuring scalability and fault tolerance.

The New System Design

When designing the new cron execution service, Slack chose to leverage existing services to minimize the amount of custom-built components. The new system consists of three main components:

  1. A Golang service that mimics cron functionality using a Golang cron library.
  2. Slack’s Job Queue, an asynchronous compute platform that handles the execution of jobs.
  3. A Vitess table for deduplication and job tracking.

Golang Service for Scheduling

The Golang service plays a crucial role in scheduling the execution of cron scripts. It leverages a Golang cron library that supports the same cron string format used in the original cron box. This compatibility made the migration process simpler and less error-prone. To ensure scalability and fault tolerance, Slack used Bedrock, their wrapper around Kubernetes, to scale up multiple pods easily. However, only one pod is designated for scheduling, while the others remain in standby mode. This setup allows for seamless transitions between pods and reduces the chances of downtime.

By offloading the memory and CPU-intensive work of running the scripts to Slack’s Job Queue, the scheduling pod can focus on its primary task. This approach eliminates the need for synchronization between multiple nodes and simplifies the overall system architecture.

Slack’s Job Queue

Slack’s Job Queue is a powerful asynchronous compute platform that processes billions of jobs per day. It consists of various queues that jobs flow through, ensuring durability and efficient execution. In the new cron execution service, each cron script is treated as a single job. By leveraging the capabilities of the Job Queue, Slack was able to offload the compute and memory-intensive tasks to an existing system that can handle the load effectively.

Utilizing the Job Queue not only reduced the build time and maintenance effort but also provided a reliable and scalable solution for executing cron scripts. With the ability to handle large workloads, the Job Queue empowered Slack to handle the growing demands of their system.

Vitess Table for Deduplication and Job Tracking

To ensure that only one copy of a script is running at a time, Slack employed a Vitess table for deduplication and job tracking. In the previous cron system, flocks were used to manage locking in scripts, but there was a possibility of multiple copies running simultaneously for longer-running scripts. In the new system, each job execution is recorded as a new row in a table, and the job’s state is updated as it progresses through the system.

This table serves as a reference for checking whether a job is already running before triggering a new run. By querying the table for active jobs, Slack can prevent duplicate executions and ensure the smooth execution of cron scripts. Additionally, the table also serves as the foundation for a simple web page that provides information about cron script execution, allowing users to track the progress and identify any errors encountered during execution.

The Benefits of the New System

The new cron execution service at Slack has brought several significant benefits:

  1. Reliability: The new system addresses the reliability issues faced by the old cron execution system. By leveraging Kubernetes and the Job Queue, Slack ensures that cron scripts are executed reliably and without interruptions.
  2. Scalability: With the ability to scale up multiple pods and offload compute-intensive tasks to the Job Queue, Slack’s new system can handle the growing number of scripts and data effectively.
  3. User-Friendly: The Vitess table for deduplication and job tracking enables internal users to easily track the execution status of their cron scripts. The simple web page provides visibility into script runs and helps identify any errors or delays.
  4. Maintenance Effort: By leveraging existing services like Kubernetes and the Job Queue, Slack has reduced the build time and maintenance effort required for the new cron execution service. This allows Slack to focus on other critical aspects of their system.

The new system not only resolves the challenges faced by the old cron execution system but also provides a solid foundation for future growth and scalability. With a reliable and scalable cron execution service in place, Slack can continue to enhance its functionality and meet the demands of its ever-growing user base.

If you are passionate about working on systems like this, Slack is currently hiring! Apply now to join their team and contribute to building innovative solutions.

注意

  • この記事はAI(gpt-3.5-turbo)によって自動生成されたものです。
  • この記事はHackerNewsに掲載された下記の記事を元に作成されています。
    Executing Cron Scripts Reliably at Scale
  • 自動生成された記事の内容に問題があると思われる場合にはコメント欄にてご連絡ください。

コメントする