Rate limiting
Table of Contents
Most software systems have to call external services to get some information or perform some action. Think of a ride-sharing system that may need to call a geospatial service and a billing service, among other integrations.
The case for rate limiting #
Suppose we’re building one of these services that will be consumed by other software. The service will expose some form of API that will accept requests. Even keeping with good API design, our service will run into bottlenecks.
There are only so many requests that we can process with decent quality. Even worse, if there are way too many requests, the service might fail. Those aside, there’s also the cost factor–even if we can process many requests, there could be a cost limit.
In light of these factors, perhaps we should consider rate limiting!
What does rate limiting look like? #
Despite there being many ways to achieve rate limiting, they all share some form of the following:
- The limit itself – an upper bound requests within a particular timeframe
- A window – time period that the limit is applied to, granularity varies from seconds to days
- An identifier – how to differentiate the request callers (like an ID or IP address)
Once the limit is reached, the service can:
- simply deny further requests
- slow down processing any more requests
- reduce the quality of service for excess requests
When requests are denied, generally the serving API will respond with an HTTP 429
Common approaches #
Fixed window counter #
This is one of the simplest approaches. As each request comes in, a counter is incremented. If the counter reaches the limit, then no further requests are allowed. When the next time window begins, the counter resets.
This approach is simple to implement and easy to reason about. The downside is that all requests in a window are treated the same.
Suppose there is a burst that begins close to the right edge of the current window and concludes just after the left edge of the next window. This burst can easily deplete the counter for the remainder of the next window. This means all requests will be stuck until the window after next begins.
Sliding window log #
This approach requires keeping track of each request’s timestamp and also uses a time threshold. Similarly to other approaches, it still aims to limit to “R requests per T time unit”.
The request tracker uses a data structure sorted by timestamp. Note that the state tracking is scoped to the entity being rate limited (generally the requesting user), so there could be multiple request trackers. One common implementation is to use Redis sorted sets.
On each new incoming request:
- remove all requests from the tracker that are older than the time threshold
- add the current request to the tracker
- count all requests in the tracker
- if there are fewer requests than the allowed limit per window, process the current request
This is a little more complex to implement, but it helps even out request load, especially compared to a fixed window.
The main drawback is the state tracking overhead, which requires logic and also resource usage (such as RAM with Redis).
Sliding window counter #
This approach somewhat fuses the fixed window counter and sliding window log approaches.
It involves dividing up the limit window into smaller windows. For instance, if the limit window is 1 minute, then the smaller window can be 1 second.
As requests come in, instead of tracking each request’s timestamp, we will create a mini-counter for the small window that corresponds to the request’s timestamp.
Suppose a request that came in at 12:30:05.007, where our small window is a 1 second granularity. We will add 1 to the count of the small window corresponding to 12:30:05.
Then similar to the sliding window log, we will look at all the counters in time period between now and one limit window ago.
Some implementations vary here (e.g. using weighted averages), but the simplest is just to add up the values of all those counters. If there are fewer requests than the allowed limit per window, let the current request proceed.
The main advantage of this approach is that it functions similarly to the sliding window log, but achieves that with more efficient resource usage.
Token bucket #
The token bucket approach is quite common. It keeps a metaphorical “bucket” of tokens, where each token grants callers one request.
Each bucket correponds to what entity is being limited–for example, this could be the user ID or IP address.
The tokens themselves are not inherently tied to time. If a token is available, the caller’s request will go through. Otherwise, the request is not processed.
There are two controllable constraints: how often new tokens are added and how many tokens the bucket can hold. Tuning these essentially controls what the effective rate limit is.
This approach provides a good balance between spreading out request load and also somewhat tolerating bursts.
The main difficulty with this approach is token management. Many services operate in a distributed environment, so synchronizing how many tokens are available and consuming tokens atomically is a hurdle.
Concluding thoughts #
Although there are a few common rate limiting recipes, no two implementations are exactly alike. It’s important to choose the right rejection strategy, limit/window selection, and algorithm based on your particular service and expected usage.
It makes sense to consider a commonly used implementation before reinventing the wheel, but sometimes a variation is needed. SlashID went with a variation based on the generic cell-rate algorithm. Other organizations may tweak the above approaches or even come up with their own.