How Toast Safely Deploys Background Workers

Written ByZach Walsh
Jun 17, 2024

Introduction

At Toast, we strive to build a system that makes deployments safe and easy. When operations are safe, our engineers can move quickly to ship new solutions and delight our customers.

Our architecture is made up of 100s of distributed microservices. To decouple services, we make heavy use of events and asynchronous messaging for service-to-service communication. That means that many of our services are background workers, picking up work from a queue or topic, and passing messages along somewhere else.

The safe deployment of these background workers presents an interesting challenge; how do you deploy a competing consumer with sufficient control to handle risks? It must be possible to validate new deployments before they take on traffic, and roll them back quickly if necessary. This post describes the system we built to satisfy those requirements and its benefits.

Background Worker Deploys

Our messaging system of choice is Apache Pulsar™. Many teams at Toast operate services that act as Apache Pulsar™ consumers, pulling messages off of topics. On startup, these services connect to our Apache Pulsar™ cluster and launch a thread pool that pulls down messages. 

Development teams quickly adopted Apache Pulsar™ after its launch at Toast in 2019, implementing event driven architectures to decouple their services. However, they had a common complaint: they were not satisfied with the process for deploying code changes to these Apache Pulsar™ consumers.

When deploying a change, teams would launch a new version of their service, and it would immediately start consuming messages. This presented some risk, as issues with the new deployment could cause problems for customers. Additionally, rollbacks were slow: the only way to stop a new version was to tear down all of its infrastructure, which could take many minutes. Beyond that, roll-forwards also carried risk: teams had to fully destroy the old version of their service in order to shift all work onto the new release. After that point, rolling back would require a full redeploy of the prior version which could dramatically increase the mean time to resolution (MTTR) in an incident.

REST-ful Service Deploys

This stood in stark contrast to how teams deployed REST-ful services at Toast. Our service mesh control plane allows them to deploy a new version and keep it “inactive,” not processing any traffic.

This gives everyone an opportunity to check logs and metrics, see if alerts fire, and generally validate the health of their deployment. From there, they activate the new version and slowly elevate traffic to it, watching for regressions.

Finally, they deactivate the old version, leaving the infrastructure in place in case they need to do a rapid rollback.

Our Apache Pulsar™ solution lacked all of this safety and control, requiring teams to take extra care when making changes. They also (rightly) felt that an “inactive” service that was actively consuming messages was confusing. Until we built a similar solution for Apache Pulsar™, we would not see the adoption and velocity that we wanted.

Requirements

We wanted our Apache Pulsar™ consumers to have predictable and consistent behavior like the rest of our platform services with regards to how they functioned in our service mesh. As a result we set out to build a tool that would provide the following functionality:

  • When a deployment is “activated” by the Toast control plane, it starts consuming messages

  • When a deployment is “deactivated” it stops consuming messages

  • There should be no performance impacts; extra load on the mesh or control plane is not acceptable

With these capabilities, teams could more safely deploy their Apache Pulsar™ consumers (and any other background workers they might own) without incurring any additional risk.

Solution

Apache Pulsar™ gives us a very convenient solution to this problem: Consumers can be paused and resumed with one method call. If we could call that method at the right time, we could give teams the operational control they needed. However, we had no way to notify each process of its change in status without polling our control plane API. We knew that such an approach wouldn’t scale, so we had to get more creative.

Our ultimate approach leveraged another open-source component of our service mesh: Envoy®. 

Envoy at Toast

Envoy turned out to be the key to safely propagating service status information down into each running process without overwhelming our control plane. 

Envoy is a widely-used reverse proxy that functions as the data-plane in our service mesh. All of our service-to-service HTTP traffic flows through Envoy, typically deployed as a sidecar, representing tens of thousands of requests per second at peak. In this context, we leveraged Envoy to perform signaling to our background workers.

Implementation

Typically, Envoy serves as a proxy, sending each request to the service that can fulfill it. However, Envoy contains an interesting configuration that allows it to reply directly to a request (without proxying it elsewhere) if the response is known.

That means that with a simple configuration like the following:

Route {
    match = RouteMatch {
        pathSpecifier = RouteMatch.PathSpecifier.Path("/sidecar/v1/elevation/active")
    }
    action = Route.Action.DirectResponse(
        DirectResponseAction {
            status = 200
            body = DataSource {
                specifier = DataSource.Specifier.InlineString("{\"active\":true}")
            }
    )
}

We can satisfy (locally) a service’s request for its status:

As this endpoint is served locally on the same machine as the service, the application can call it as much as it wants without threatening the stability of the entire control plane. Then, when an engineer makes a change, the control plane will asynchronously update this endpoint and the application can discover the change by polling.

From there, we wrote a small library that polls this locally-running endpoint. The library provides engineers a hook to respond in-process to any changes in their elevation status.

We integrated this into all of our Apache Pulsar™ consumers, which enabled us to safely activate and deactivate different versions of those consumers through the same tooling we use to control any service at Toast. This put Apache Pulsar™ consumers at parity with how REST-ful services are deployed, making those deploys much safer.

Results

Today, over 140 different services on our platform use this solution. Without any effort on the part of internal teams, deployments have become much safer, and they can continue to use the same tools they are used to. We’ve seen massive adoption, with nearly 300 services connecting to Apache Pulsar™ either as a producer or a consumer of messages.

This platform functionality enables Toast engineers to deliver the features that restaurants and their guests need, enriching the food experience for all.

____________________________

1  Service mesh data plane vs. control plane | by Matt Klein | Envoy Proxy

2 Apache Pulsar™ is a  trademark of the Apache Software Foundation

3 Envoy® is a registered trademark of the Linux Foundation