The 10 Factors of a successful Microservice architecture
What does good look like for Microservices?
While trying to answer this question I came up with a Definition Of Done, i.e. if one can answer “yes” to all of the following points, then probably they are well underway in a successful journey to a Microservices architecture.
- A Team can release a capability (a Microservice) multiple times a day, any day of the week, any time of the day, without causing disruption to the customer experience or to other Teams
- The only way Microservices interact with each other is by interface communication.
- Microservices are implemented following the 12 Factors and are therefore cloud-native and cloud-ready
- Microservices are deployed using a blue-green strategy
- Each Microservice can be audited and monitored as one service, even if there are multiple instances running at any point in time
- A Team deploys a set of components whose goal is to randomly disrupt production services
- Each Microservice offers the capability to perform A/B testing
- Each Microservice should have a “Limited Blast Radius”
- Microservices should auto-scale horizontally and limit degradation of service to a minimum in case of high request volumes
- Microservices should live in a self-healing ecosystem
Let’s examine each point in a bit more detail.
A Team can release a capability (a Microservice) multiple times a day, any day of the week, any time of the day, without causing disruption to the customer experience of to other Teams
We know that one of the symptoms that indicates the lack of a DevOps operating model is the inability to release into production in the middle of the week, at any time. In non Agile organisations, generally production releases are performed out of business hours, require several hours (if not days) to complete and typically require the intervention of Heroes that yet again save the day.
We know that this is neither necessary nor required. If an organisation has a Microservices architecture and can perform blue/green deployments, there is no reason why a production deployment could not be performed any time of the day, any day of the week. With blue/green deployments a new capability (whether a functionality upgrade or a brand new capability) is deployed into production without the users noticing. Pivotal Cloud Foundry (PCF) for example, allows to deploy the same component to two different URLs (one that has got “green” in it, one that has got “blue”). If we release a change into production with the “blue” URL, production users won’t even notice the new functionality. We can then partner with a small number of friendly users or direct a small number of the user base to the blue URL and observe how the new capability behaves. If it’s successful, which in a Lean Enterprise means it delivers the value that the business originally thought it would, then we can direct all green traffic to the blue capability, map green and blue to the same URL and start all over again.
The ability to release into production at any time is fundamental to a DevOps operating model and to a Lean Enterprise. It allows the Three Ways (fast flow from left to right, fast feedback and continuous experimentation and learning) and it allows Operations to have a normal life, to spend their week-ends with the family and the nights in bed, sleeping.
The only way Microservices interact with each other is by interface communication
The moment we introduce a physical dependency between a Microservice and an external library (whether another Microservice or a functional library) we have broken the Microservice architecture. Imagine that within a Product, several teams are putting together a concept that involves a domain. This can be anything, from user registration, to shopping carts, to digital markets and so on. In Java, for example, is not uncommon to have tools that automatically create a mapping between an interface payload and object graphs (e.g. JAXB). Now suppose that all teams with a certain product needed to use the same model. It would be easy to fall for the temptation of extracting the domain model objects into an external library and then have each Microservice to declare a dependency on that library in their build configuration file (e.g. Maven dependency).
The problem would occur when a change was needed to the model by one Microservice. What would happen then? The external library would change but then each Microservice would need to be rebuilt and redeployed as a consequence. This breaks a fundamental concept of a healthy Microservice architecture: independence.
Independence is more important than the DRY principle when it comes to Microservices. Because of this, a certain amount of duplication is tolerated if it leads to independence. This is very different from a Service Oriented Architecture (SOA) where the principle is to share as much as possible. With Microservices components should share as little as possible.
Microservices are implemented following the 12 Factors and are therefore cloud-native and cloud-ready
The main characteristics of a Microservice are that is independent and exposed through an interface and it scales horizontally.
A number of people working at Heroku, a Platform As A Service (PAAS) offering, have written the 12 factors that enable modern application architectures. The list can be found here. In substance, apart from the operational side of how to write software (e.g. having a SCM tool, environments uniformity, continuous delivery and so on), the 12 factors are important because they define the “rules of engagement” when writing Microservices.
To be viable, Microservices should be able to run in a Cloud-like environment. A PAAS, like Heroku or PCF, allows for such environment. The characteristics of a cloud environment boil down to virtually unlimited resources that can be automatically provisioned and to a lack of reliance on anything physical as the cloud is typically a virtual environment. So if an application relies on the file system to do its job, or to state of a particular instance, it cannot be considered a Microservice because that undermines the ability to auto-scale horizontally and to quickly recover from failure. Since Microservices must be cloud-ready and a single instance can be deployed hundreds of times, we need to assume that instances eventually will fail. This means that each instance cannot rely on its state to function properly. Ideally Microservices should be stateless or have as little state as possible.
It should be possible to deploy Microservices in a cloud environment natively, without having to make any changes to the code base.
Microservices are deployed using a blue/green strategy
How can we deploy changes into production any day of the week, any time of the day without causing disruption? The answer is to adopt a blue/green approach. What is it?
Let’s assume that an application is running in production. We will call it “green”. If a change to it is required, we can make the change and then deploy it into production under a non-production URL (we will call it blue). At this point we can either ask a small number of friendly and engaged users to test the new feature at the blue URL or we could run a series of automated tests at the blue URL or we could direct a small number of unaware users to the blue URL, observing how the new capability behaves. If there’s something wrong, we can simply revert the changes at the blue URL and users won’t even notice. However, if the hypothesis has been successfully validated, we can now direct the rest of the traffic to the blue URL. There are various ways to accomplish this, e.g. map the green URL to the blue and retire the blue. The point though is that users will not have been disrupted by the change. In the worse scenario, a small part of the Product didn’t work, but because it’s small the rest of the Product would still work and most importantly, the team in charge of that small part could fix it quickly and redeploy it (this is known as roll-forward).
Green/Blue deployments allow teams to deploy Microservices at any time of the day, any day of the week, thus moving the operating model towards DevOps, with both developers and operations having a good work-life balance and not having to perform heroic efforts.
Each Microservice can be monitored and audited as one service, even if there are multiple instances running at any point in time
With a monolith, it’s quite easy to find out what went wrong during a user journey. There are generally two or three places where the logs are kept (depending on how many clusters have been set up). Microservices add complexity because as discussed earlier, one of their main characteristics is to scale horizontally. This means that there might be hundreds of instances of a Microservice, scattered across the cloud. Assuming the right architectural approach has been followed, a user journey might touch several of these instances. How do we monitor and audit a user journey that has been potentially served by tenths of instances?
A good Microservice architecture must provide this capability. Through appropriate monitoring and logging tooling, this is possible. A monitoring tool should show the entire Microservices estate in a single dashboard, including those that are healthy and those that aren’t. Additionally, when writing Microservices, teams should apply the Circuit Breaker Pattern defined by Martin Fowler. The idea is that if a Microservice is unhealthy it should be taken out of service and the application should keep running (although possibly with limited capability). If the Microservice recovers it gets then added back to the pool.
At Pivotal, they’ve got a policy whereby the support team won’t look at production incidents for the first few minutes as there’s an assumption that the platform will self-recover. The Microservices that we architect should have the same principles. A Microservice-enabled tool should show on a dashboard the Product health at any time and provide the ability to drill down to the various Microservices composing it.
Now to the user journey. The fact that we now have a distributed architecture thanks to Microservices should not impact our ability to audit and track user journeys, from the moment they originated to the moment they completed. A Microservices-enabled logging strategy and tool should provide the ability to follow all the steps of a user journey regardless of how many instances have contributed to it. There are various ways of doing this, e.g. through the use of correlation ids and logging tools that allow for syslog feeds and therefore aggregated logging.
AppDynamics and Splunk are an example of monitoring and logging tools respectively that can be used within a Microservice architecture to have a single view of a service and a single audit trail for user journeys.
A Team deploys a set of components whose goal is to randomly disrupt production services
The Cloud is Microservices natural environment as it offers capability such us auto scaling and virtually unlimited capacity. However, when we architect something for the Cloud we need to architect for failure. At a certain point some server will fail and with it all the Microservices deployed on it. We also need to architect for failure with regards to dependencies on external systems (including other Microservices). Our Microservice should still be up and running (albeit not 100% functional perhaps) if an external system that it depends on is down. This is why concepts such as architecting for failure and stateless services is so important. At a certain point, one or more Microservice instances will disappear from under our feet and our application should keep running when this happens.
Netflix, one of the pioneers in Microservice architecture, developed a suite of tools whose purpose is to randomly kill some production elements. The goal that Netflix imposed on itself was to make sure that its products would still function when the “Simian Army” went to battle. If you want to know more, you can read the following blog.
If we code keeping in mind that the equivalent of a Simian Army will run randomly and randomly disrupt our production environment, the resulting services will be more reliable and stable, ultimately leading to an “always up” operating model and product.
Each Microservice offers the capability to perform A/B testing
Why do we want to adopt Microservices? The answer is not because the technology is “cool” but rather because a Microservice architecture enables teams to release independent functionalities whenever they want and this, in turn, allows the business to continuously validate their hypotheses.
The foundation of a Lean Enterprise is the scientific method approach. The business formulates hypotheses which, if delivered to production, should provide a certain value as outcome. In the modern era value is not only dollars, but it can be likes on Facebook, followers on Twitter or connections on LinkedIn or bitcoins. The point is that the business thinks that by delivering a certain capability to production they will obtain a certain value. The ability to validate these hypotheses as quickly as possible is fundamental to the success of any high performing organisation as generally only 5-10% of ideas generate value.
I’ve recently come across a great talk by Barry O’Reilly at the goto; conference in Chicago. Barry is the co-author of the book Lean Enterprise and therefore knows this stuff inside out.
During the talk Barry presented the equivalent of a story card for Hypothesis Driven Delivery, which I’m showing below.
The card shows the equivalent of the guidelines for writing story cards, i.e.: As a <stakeholder>, I need <this capability> so that I can achieve <this business outcome>. For a Lean Enterprise, the business starts from a hypothesis that they believe will result in a certain outcome and the capability is not Done until it has been measured in the hands of the customers (i.e. <we see a measurable signal>). From this it derives that the ability to conduct experiments is critical to business success and a Microservice architecture, together with a team-based organisational structure (e.g. the 2 Pizza Team from Amazon or Squads/Tribes/Guilds from Spotify) is what makes this possible.
An effective Microservices architecture therefore must provide the ability to perform A/B testing, i.e.: if we try this we think it’ll deliver this outcome, if we try that we think it’ll provide this other outcome. A/B testing is possible thanks to a number of techniques, including the blue-green deployments discussed above.
Each Microservice should have a “Limited Blast Radius”
Since we have decomposed a Product into a set of functionalities, each deployed as a Microservice, we want to make sure that when occasionally we get things wrong, e.g. a bug escapes into production, only a particular slice of the Product won’t be working but that overall users will still be able to use it. This is the concept of a Limited Blast Radius. Moreover, since a team is typically in charge of one or more Microservices, fixing the issue should not take long.
For example, when you look at the Amazon product page, you’re looking at a number of Microservices, each providing a slice of the entire user experience. If the Rating microservice was down for some reason, you would still be able to purchase a product, although you wouldn’t be able to leave a rating for it.
Microservices should auto-scale horizontally and limit the degradation of service in case of high request volumes
One of the aspects that makes Microservices so attractive is their ability to run in the Cloud. This ability should be exploited to setup auto-scaling policies so that when, say, a Microservice CPU has been above 80% utilisation for more than 120 seconds, then a new instance should be created and so on. Apart from the resulting ability to cope with changing volumes adequately, this ability offers the advantage of providing some defence against DDoS attacks.
When volumes are very high (or maybe too high), the overall Product should still be available and users should still enjoy their experience, albeit perhaps with some limitations.
Microservices should live in a self-healing ecosystem
Microservices are a distributed architecture. This means that we don’t know where exactly each Microservice instance will live and we might not know the exact number (for example if we have indicated ranges in our auto-scaling policy). As mentioned above, the fact that Microservices live in a Cloud (whether internal, external or hybrid), means that we must accept that a number of them will crash at a certain point.
Whatever platform we choose for our Microservices, it should provide the capability to both remove unhealthy ones from the pool and spin up new ones, automatically adding them to the pool. Platforms such as PCF and Heroku provide this kinds of functionality and this is what gives us peace of mind that our Product will have a certain degree of stability compared to more traditional applications where, say, once a cluster instance is down there isn’t much that can be done apart from trying to bring it up again.