First Aid for Microservices

Hello, in this article, I want to talk about common problems you may encounter in your microservices and I would like to give you tips for intervening in your services in such emergency situations. If you’re ready, let’s get started!

Our applications may be exposed to different loads during the day. However, there may be problems in the database and other microservices you use (usually in unexpected parts). It is very important to bring the system back to life as soon as possible to prevent these issues from growing.

Damage assessment

It is important to guess the source of the problem before making first intervention. Incorrect intervention can make the problem worse. Let’s look at some of the metrics that can be tracked at this stage and what meanings we can draw from these metrics.

Let’s assume we notice that the response time of one of your services has increased or the error rate has risen. First, we quickly do our general checks:

Error rates and response times of dependent services
The number of HTTP 4XX and 5XX errors in the responses
CPU Usage
Memory Usage
If you use services such as Kafka or Kinesis, the amount of lag or iterator age metric

First, we need to understand the magnitude of the problem. If there are problems in other services as well, there might be a common cause. In such cases, the source of the problem could be network permissions, issues in the dependent service, or a general problem affecting your data center.

HTTP 4XX and 5XX can mean very different issues. For example, 4XX may involve timeout, authentication problems, errors related to business requirements, while HTTP 5XX may involve unreachable services, errors caused by database or cache system failures, or errors returned from the database. Looking at the logs should be one of the first things to do for both 4XX and 5XX.

Microservices may have different CPU consumption depending on the work they do. This means they may react differently at different CPU levels. For example, a service might not respond due to another bottleneck, even though the CPU usage is 20%. It is important to detect the reactions of the service through load tests and to know the service well. On the other hand, even a query to the database can increase the database CPU usage significantly. Knowing the operating logic and weaknesses of the components you use will make it easier for you in this regard. For example, Redis is very fast and runs as a single thread. Misuse can result in slow responses or even locking. Similarly to CPU usage, problems may occur due to insufficient memory.

There may be an accumulation in the queue you use because there are too many events or consumers cannot process fast enough. For example, if you have a queue for notifications and the consumer processing it is working slowly, notifications may be delayed. When there are thousands or millions of events accumulated, dealing with them can be difficult. An important step would be to check the metrics that show unprocessed messages, such as the amount of lag in Kafka, the iterator age metric in Kinesis.

Even if we have an idea about the problem in the previous steps, I think reviewing logs is important. Logs may contain information related to the source of the problem. While trying to guess the error with assumptions, it can create proof for us, shed some light.

First Intervention

We quickly looked at the general condition of the system, now we need to make the first intervention to get the system working. The most immediate solutions that come to mind are deploying or increasing instances. Although these two weapons can solve many problems, they can also have negative effects.

For example, if the CPU usage of the database is very high, instances may not be able to perform database operations. In such a case, increasing the number of instances of your microservice will not be a solution and will even make the situation worse. New instances will try to connect, they will try to send new queries, and there will be a small war between your instances. In such cases, instead of increasing instances, it may be a better option to disable features that bring load to the system.

This also applies to other components on which your service depends. For example, anything that communicates synchronously, such as another microservice or cache mechanism that you request, can be considered in this category. If the timeout values are long in the communications you establish, receiving late responses from the places you depend on will put your microservice in great danger. Since it waits until the last moment for a database that cannot respond, your microservice will soon become unable to respond.

If solutions such as stopping non-urgent queries and slowing down asynchronous parts do not work, temporarily adding a read replica can be considered before scale up the machine on which the database runs. You can apply similar solutions to adding shards to Redis or other structures that you can expand horizontally.

In cases where you encounter too many unprocessed events in the queue, you can start by looking at the logs. If we cannot make the system work quickly, processing priority events through another queue may make the problem less felt by users.

Detailed Review and Solution

After experiencing a major problem, you can learn from this mistake and take precautions to prevent it from happening again by asking questions such as the following:

How was the problem detected, how can we detect it earlier?
How many people/data were affected?
How long did it take for the system to start working properly?
What precautions can be taken to prevent the problem from recurring?

Although we may not include solving some problems in first aid, I believe that setting alarms for earlier detection of the problem is part of first aid. Finding a solution before the problem grows becomes much easier.

Experiencing the same problems over and over again can reduce the team’s motivation, so it would be useful to set time for definitive solutions during the first aid phase.

You can adapt the steps I mentioned to suit your needs and intervene in your microservices. The purpose of this article was to explain the basis of the changes we make in critical moments and the risks of these changes. While solving the problem, we go through a very rapid learning process. Although it can be a stressful process, it presents an opportunity to gain experience. Also you can read my another topic related managing cost of your microservices. Nevertheless, I wish you trouble-free days. 😄 Looking forward to the next articles 👋

Resources

https://docs.aws.amazon.com/elasticloadbalancing/latest/application/load-balancer-troubleshooting.html

https://aws.amazon.com/rds/features/read-replicas/

First Aid for Microservices

Damage assessment

First Intervention

Detailed Review and Solution

Resources

Comments

Leave a Reply Cancel reply