Building a scalable, highly reliable, asynchronous user service

Here, I introduce some of the essential architectural concepts that we use for reaching scalable, highly-concurrent, and reliable user service. The concepts introduced here are described using the example of our GDPR functionality implementation and how we developed our architectural solution to handle the Data Subject Rights (DSR).

Specifically:

  • Access & Portability: Upon users’ request, we are obliged to provide a report containing all information about the user.
  • Right to erasure: Users can request to have all information concerning them deleted.

Architecturally significant requirements

The General Data Protection Regulation (GDPR) came into effect in the European Union on May 25, 2018, requiring most businesses to implement new processes in order to stay compliant with new data privacy rules. Showmax was no exception, and we introduced new customer facing features to automate user GDPR requests.

As with any new features, a proper analysis had to be done before we could implement our solution. It was clear that the following issues had to be addressed:

  • Scalability
    • Our solution must be able to process thousands of requests without having any impact on the rest of the platform and user experience.
  • “Good enough” time efficiency
    • GDPR gives us one month to respond to customer requests. That means that we don’t really care if a report generation takes one minute, or one week.
  • Stability
    • Potential errors or outages in the solution must be recoverable.
  • Concurrency
    • If we don’t care how long the report generation takes, we must make sure that multiple requests can be processed at once.

The first architectural decision is asynchronicity. Doing things asynchronously solves the scalability issue, as processing the GDPR request does not block the response to the user, nor does it require immediate execution of the requested action.

The approach with postponed execution of tasks implicates a need for a message queue and a worker. The worker consumes the queued messages and monitors the current processed tasks. Most modern web frameworks have libraries that provide an easy way to queue asynchronous tasks - it’s straightforward if you have a monolithic application that stores all data in one place. In our microservice environment, the data is scattered across multiple databases in multiple services, and thus we had to go a different way.

Asynchronous execution

Sequence diagram capturing simplified flow of asynchronous execution
with the use of a message queue.

GDPR user rights represent a new area of user features, thus we created a new service, called GDPR user requests manager (later called requests manager). This service also acts as the worker servicing the requests queue.

Once a user submits a request, a message is stored into our message queue. The requests manager awaits the messages, and fully processes them asynchronously. Thanks to this, we don’t need to be too concerned with temporal requests manager outages, as no data is lost, and is processed later.

At Showmax, we use the AMQP message passing protocol, and RabbitMQ as the message broker. Our wrapper library, written in Ruby, also provides features like message tracing and delayed execution. Sending MQ message has another advantage: Multiple consumers can listen for the same message. This can be set up to activate multiple subsequent actions, like data processing, sending email notifications, or storing actions into an audit log system.

Cross Service Synchronization

Upon receiving an MQ message, the requests manager saves the request into a PostgreSQL database in order to handle the state of the request. There are multiple workers running in the requests manager service that look into the database for new requests and start processing them. With multiple workers, it’s possible that several different workers can start processing the same request at the same time, and you need a cross worker synchronization. We handle this by creating mutexes stored in a distributed key-value store. We use Redis, as it provides speed and reliability.

Mutual exclusion

Sequence diagram capturing mutual execution achieved
using Redis as store for locks.

Service Reliability

If a request for data access comes, the requests manager is responsible for generating the data access report. During the process, we access multiple databases, call multiple third-party APIs that may contain user information, and also query Elasticsearch for data that is not being persisted in relational databases (e.g. IP addresses of a user).

Should the generation of any part of the report fail, the already-existing part of the report is deleted and the report’s state in the database is reverted to its previous state, thus the report generation is automatically retried upon the next run of a worker. This process ensures that report generation is self-healing.

Orchestration

Erasure requests are neither processed immediately nor blocking the user interface. The requests are stored in the requests manager. The service only emits an MQ message for the user cancellation service, which takes care of data deletion.

Because of our microservices architecture, customer data is scattered across multiple services (respectively, their databases) and other third-party services. The user cancellation service sends multiple MQ messages to the services that contain customer data, and calls third-party APIs in order to delete data from those services.

MQ messages are a type of asynchronous communication that leaves the user cancellation service with no information if the data was successfully deleted. To declare the deletion task as successful, each service sends an MQ message back, confirming the erasure.

The user cancellation service tracks the state of the data erasure and, after it gets confirmation from all other services, the erasure request is marked as completed. There is one interesting problem though. On one hand we must delete user data from a service that sends emails, on the other hand, we must send customers an email that the data was deleted. The solution is to postpone the deletion from the emailing system so that there is enough time for the email to be delivered.

To be able to delay erasure from the emailing system we needed a way to postpone the MQ message that triggers the deletion. RabbitMQ has the ability to send delayed events if you enable a delayed message plugin. However, after we investigated, we found that it wasn’t suitable for production use due to a few limitations. So, we created a new service which is scheduling the messages. It stores the MQ messages with a specific key in a relational database, and the messages must contain information about when they should be sent. Messages stored in the database are periodically processed and sent if their time has come.

Overall architecture

Simplified diagram of the overall solution.

Monitoring

When you’re developing a solution, keep in mind that you will also run and sustain the solution. Knowing whether the service is healthy is a critical part of the operations running the service.

No monitoring could be done without extensive logging. If you follow our blog, you probably know that we at Showmax are obsessed with logging. In the case of GDPR, we generate fine-grained log information about actions triggered and steps executed. At Showmax we use Elastic stack (Elasticsearch, Logstash, Kibana) for processing, indexing, and visualizing the data.

For the GDPR project, we created dashboards in Kibana with visualizations and numeric metrics. While it is interesting to look at the dashboards, we also need to be notified when something goes wrong. For that, we use Elastalert. This tool lets you define rules with threshold numbers on top of the Elasticsearch data.

We focus on monitoring of the communication between services:

  • How many requests users submitted (how many times did they click a button)?
  • How many of those requests were received by the requests manager?
  • How many emails were sent or data reports generated?

All of these numbers must be in sync.

Data access requests visualization

Visualisation of data access requests submitted and finished.

Conclusion

On our journey to deliver Data Subject Rights, we created multiple new services for several new features, all of which strongly embrace asynchronous messages managed by RabbitMQ. The entire solution highly-scalable.

As a side product, we created a service that allows us to send delayed MQ messages. We already started leveraging this new solution for retry functionality in multiple places around our platform.

Please check the original version of this article at