This post is in continuation of the three part blog posts on this topic. Please find part 1 here and part 3 here.
Rolling Deployment - Recreate Strategy
The most commonly used deployment paradigm with stateless services is the ‘Recreate’ strategy. The idea is simple, since the software housed by every single node in a horizontally scalable stateless service (powering an API or ETL/AWS Lambda/Azure Functions) is capable of serving every single call or invocation independently and idempotently (every single call with same input parameters will yield the same result) and has no user/request specific persistence to the locally attached disk, we recreate a new node with the latest version of the software and destroy the existing node. As discussed in previous part, the primary benefit this provides is Immutability reducing the chances of bugs or errors caused by incorrectly upgraded dependencies, leftover configuration and volatile state from previous version, etc. This ensures a clean slate starting point for the new version of the software will all its required versions of dependencies installed and the starting configuration properties set as per the latest version's expectations. Since we did not update/modify any previously written libraries or configurations instead recreated everything from scratch, we can bootstrap the upgraded version with absolute reliability. Any issues arising in the mode when starting the new version of the software are bugs in the new version and indicate testing issues rather than deployment issues.
The diagram above depicts the various stages in a Rolling Deployment strategy using the recreate strategy. We see a load balanced API stateless service in the diagram above.
The canary deployment mode is where a single node of the N node cluster serving your load balanced stateless service is upgraded with the new version and starts receiving the traffic. The primary idea here is to test the new version of the software against production load/traffic and ensure that the last set of issues are ironed out if any before we open the flood-gates of production traffic on the latest version. This is an absolutely must-have mode in case you have a very large cluster of nodes (>50) serving the production traffic of your service. In my experience, it is an absolutely tedious and business impacting process to rollback to the stable version in case you detect bugs in the new version of the software post full fleet upgrade.
Once you are satisfied with the software in the canary deployment mode you move to the next step of upgrading the entire fleet of nodes in your service cluster with the latest version. This is termed as rolling since you keep upgrading the software X nodes at a time until the entire fleet of nodes is upgraded. X can be 1 node to as many nodes as the service degradation your service can handle without causing any business impact. You as a service owner would be the best judge of the same.
The key feature here is Zero Downtime for the service. The load balancer keeps servicing the production traffic while the software is being upgraded. As depicted in the diagram above the node with the dashed lines in the fleet is being recreated by destroying the previous node and while this is in progress the load balancer is not directing any traffic to this node. It clearly depicts the cluster operating at 75% capacity since only 3 of the 4 nodes are servicing traffic at any given moment.
Also, the service is operating in a mixed mode during the time of deployment where some nodes are servicing requests with version 2 code while others are still using version 1. You need to consider these aspects and in case it degrades the overall experience for your users, then you should ensure that you don't employ this mechanism. In my experience, stateless services are agnostic to these issues while stateful services are not. The other alternative to this blue green deployments which we will discuss with stateful services since that's where they make the most sense. In case your stateless service cannot yield mixed mode results, you can employ that strategy with stateless service too.
Choosing the number of nodes to be upgraded at one go is dependent on the amount of traffic your service has, the deployment start trigger and other such unique features of your service. Generally you would trigger a deployment for your service during non-peak hours and ensure that all of it finishes up (with cluster sizes of 2000+ nodes, I have seen deployment times of 4-5 hours before it's completely finished) before the service touches peak hours again. In case your service operates at 80% load at non peak hours, you could configure to have 10-15% nodes upgraded at a time (set X = 10-15% of all nodes) simultaneously to speed up the overall rolling deployment strategy.
With advent of containerization (docker, lxc), there are some clever optimizations employed by these technologies to speed up the overall deployment with recreate mode wherein they identify the diff and ensure recreating just the required subsystem rather than the whole container while ensuring Immutability benefits nonetheless.
Ramp Up / AB Testing Deployment
Some additional variants that could be placed somewhere in between the full rolling deployment and canary deployment on a single node is the ramp-up deployment or A/B testing mode deployment. In this mode only a portion of the N node fleet is upgraded to a newer software version while keeping the older version still in the mix.
In ramp up mode, the service idempotently serves each variant of the software and lets the developers test the impact of the newer version with limited impact on production traffic while allowing gradual scale up. This is demonstrated above in the diagram where we can see the use of ramp-up mode allowing for testing DB software migration from MySQL to Cassandra. This will allow for proper tuning of Cassandra and any additional scaling and problem mitigations strategies to be employed with the new DB while observing the impact of gradual increase in traffic to version 2 with newer nodes being added and served to the version 2 cluster.
The A/B testing mode, a similar setup is in place but we need auxiliary features in load balancer to allow for sticky routing of traffic to the same cluster. In this mode, the software developers want to expose a limited set of production user base to a new feature set or change in experience and measure the impact of the experiment. This could lead to total abandonment of the experimental feature moving back to older software version. In other words, this is not necessarily a software upgrade but more of an experiment and hence has more elaborate requirements around traffic flow controls and rollback.
Things to Note
Service response idempotency is a key requirement for the strategies discussed here. Any of these strategies would fail in case your service relies on the previous or future responses to a series of requests from the same user.
Consider this scenario during the service is under rolling deployment -
If a request 1 from user lands on version 1 node of the service and the service responds with response 1.
Then a request 2 from user lands on version 2 node of the service leading to a response 2.
A third request 3 from user lands again on version 1 node of the service. Now this flow of requests from the same user landing on different nodes serving different versions of the software should not have any bearing on response 3.
As a service owner, that is what you have to ensure.
Backward Compatibility / Breaking Changes
In no version of the software should you introduce a breaking change. This is the extension of the service idempotency discussed previously. The users requesting the older version of the code should not be impacted by the new code responding to the requests. A common scenario with APIs is the response format changing with newer data or additional data.
Ensure that you version your service or API endpoints using "version number" in the URI if your response format has to change.
GET /getData GET /v2/getData
In case the new change is just introducing additional data fields in response, then ensure that you still return all the older data fields as expected by the consumers of the APIs.
Ensure a clean protocol of the API request-response to start with, having well defined required and optional fields and try not change any existing behavior with newer code. A breaking change should be modeled as a newer version rather than updating the existing behavior.
That brings us to the close for this part. In the next part we will take a closer look at the stateful service deployment strategies.