Modern applications need to serve at a scale never seen before. The traffic has grown leaps and bounds in the last couple of decades with proliferation of internet across the world with users running in billions. The traffic, even though has increased, it seldom comes in a uniform pattern. It has its own ebbs and flows. It varies with time of the day, day of the week and even month of the year. Our applications and APIs need to be ready to serve this traffic. Applications usually are built with the objective of scaling as the traffic grows. However, there are always limits to what an application can scale to. We usually need other techniques to control this traffic. One such approach is Rate Limiting.
What is rate limiting?
Rate limiting is a technique used to control the amount of incoming and outgoing traffic to or from a network resource, such as an API.
Why is Rate Limiting Important?
Preventing Abuse: APIs are often targets for abuse through excessive or malicious requests. Rate limiting helps mitigate this risk by capping the number of requests a client can make.
Ensuring Fair Usage: In multi-tenant systems, rate limiting ensures that all users have fair access to resources, preventing any single user from monopolizing the API.
Maintaining Performance and Availability: By controlling the load on the system, rate limiting helps maintain consistent performance and prevents downtime caused by overloads. This helps them commit to their own SLAs.
Respecting throttling limits : Throttling limits are usually present on services like database or third-party services. Applications themselves need to rate limit to respect these throttling limits.
Cost Management: For APIs that incur costs per request, rate limiting can help manage and predict expenses by capping usage.
Now lets explore how to rate limit an application. We will use queues for this purpose.
Rate Limiting using Queues!
We will consider a simple service running on a server that is processing HTTP Rest API calls as shown below.
We need to rate limit this service so that we have granular control over the number of requests coming to this server. As of now, there is no control. As users and clients increase server is expected to respond to them. Server is completely at the mercy of the traffic. It either scales or fails!
The Architecture of Rate Limiting Pattern
To build an effective and scalable rate limiting system we need to take following steps:
Queue : We will use a message queue for this. The rationale behind using a queue is to persist the new requests till the time existing requests are appropriately processed. Queues help decouple the application. The service listening to queue can pick up messages whenever it is ready while users can send as many requests as possible.
Helper Service 1 : The incoming requests are HTTP. We need to persist requests in the form of queue messages. Therefore we will create a helper service that converts the http requests to messages and sends those messages to durable messaging queue.
Helper Service 2 : We will add a separate service that would listen to this queue at a particular rate and send the request to our original service. This queue triggered service can be configured to read messages at a particular rate using time delays and defining the batch of messages to be picked in a single polling. This way a specific rate of consumption of mesages can be maintained and these can be forwarded at the same rate to our main application service.
This is fine but there is one problem. In real world application, a single instance of HelperService2 may not be sufficient to process the queue messages. It should scale. Our rate limit will also increase by a factor of number of these instances.
Locking : This is where locking comes into picture. We need to logically partition the rate limit and assign a lock to each partition. We would need redis cache or some other storage like azure blob storage to work with locks/lease so that multiple instances of HelperService2 can coordinate.
Lets look at the following example to understand locking and coordination. We assume Rate limit to be achieved is 1000 requests/sec.
We can create 10 blobs in Azure Storage Account. Each blob represents a capacity of 100 requests. This is just a logical partition of rate limit and a way to link each partition to a single blob.
Now, an instance of HelperService2 can poll queue and fetch messages only when it has a lease on blob.
So multiple instances of HelperService2 when up and running will try to lease the blob for a certain period of duration.
As soon as one blob is leased that instance of application can fetch 100 messages from the queue and process with a delay of 1 sec.
Each instance will try to lease as many blob as possible, but since there are only 10 blobs, they will be randomly distributed between the instances yet keeping overall rate limit intact.
A similar thing can be achieved using redis cache with keys. This way we have ensured, no matter how many requests come from clients and users, our server will receive a maximum of 1000 requests/sec.
Conclusion
Rate limiting is an essential pattern for managing the performance, reliability, and security of APIs. By effectively controlling the flow of requests, it helps maintain service quality and user satisfaction. I hope this rate liitng pattern using queues will help readers build an architecture that will give them more granular control over the whole system.
If you liked this blog, don't forget to give it a like. Also, follow my blog and subscribe to my newsletter. I share content related to software development and scalable application systems regularly on Twitter as well.
Do reach out to me!!