This is the third and final in the three part blog posts on this topic. Please find part 1 here and part 2 here.
Services in the software architecture that provide durability and persistence could be termed as stateful services. Databases, caches, messaging platforms and object storage are all examples of stateful services. Of these, I have personally managed the Kafka messaging platform at Flipkart for company scale that acted as the backbone of all the back-office operations in the order management and delivery leg. Otherwise too, there are instances where we manage cache and database clusters powering are stateless services (APIs in general). Stateful services need extra care while being upgraded since they store and manage state and any loss of the same could lead to immediate and irreversible business impact with loss of user trust, brand erosion and in general loss of functionality. If the data was the business (like in case of a storage platform and/or services) this could mean loss of business itself.
We can imagine a stateful service consisting of two parts, viz.
The actual bits stored on the physical disk durably which is the data of interest for the service.
The software that manages the data stored on physical disk by organizing it in a manner that enables specialized and/or fast access of the persisted data. Additionally, this software also helps with backup and replication of the data to enable recovery in case of loss of disk/hardware. We discuss this in a later section. MySQL, Cassandra, Redis, Kafka are all examples of such stateful software.
Having context of the above inner bifurcation of stateful software is key and would help us in the later part of this article.
Need for Upgrade
Stateful software may not always be developed in-house. At most times we may pick up a pre-built software from the open source community or purchase a software and deploy the same. At other times, we may pick up the software and modify it to serve our custom needs and then deploy the same in our scenario. Generally, stateful systems need not undergo upgrades as often but there are certain scenarios where a need to upgrade may arise some of which are as follows -
The latest version of the stateful software that may feature increased performance, optimizations in general and around data storage and/or newer features and enhancements, bug fixes, etc thus improving overall experience.
We need to scale up/down the existing deployment as per the changing needs of the software users.
We need to modify the data storage and retrieval strategies to deal with increasing scale viz. moving away from custom sharded to out-of-box sharding or vice-versa, increasing/decreasing number of partitions and/or replicas to improve access parallelism and/or reliability.
Migrating to a newer data storage software to deal with newer requirements. e.g. MySQL to Cassandra.
Blue Green Deployment
At the core it still follows the Recreate strategy as discussed in the previous post and tries to leverage the general benefits of immutability that come as part of the same. The core idea is as depicted and described below -
The first diagram depicts a stateful service with associated disk in each node serving the V1 of the service in blue.
The second diagram shows the bringing up of a new mirror cluster of similar size with V2 version of the stateful software. This is shown in green. This cluster consists of the completely replicated disk of the same data as stored in the currently active blue cluster.
The third diagram depicts the simple process of going live with the green cluster (V2) by switching the load balancer to start pointing to the new cluster. The older cluster is completely razed once the green cluster is deemed healthy and bug free post the switch.
Things to Note / Caveats
The deployment may require twice the capacity to actually conduct successfully since we need to create a complete mirror cluster before the actual switch with the latest code. We will discuss some possible options to overcome this in the next section.
There will be some Non-Zero Downtime to actually conduct the switch to the green cluster from the blue cluster. We need to let the clients know of the service unavailability to avoid data loss. The goal for the deployment strategy could be to reduce this non-zero downtime as low as possible close to zero. But a non-zero downtime for a stateful service may be extremely difficult and expensive to achieve practically.
The green cluster verification itself cannot be done with prod data since any bug could possibly mean data loss in the worst case for the business. We need to be extremely cautious and thorough with the unit, functional, integration, regression, stress and load tests before conducting the actual switch of the cluster. The possibility to test with production traffic as in case of canary deployment with stateless services is nearly impractical and may have grave business impact.
The previous point creates the need to have a very exhaustive test suite for successful stateful blue green deployment strategy. Special attention has to be given to regression, integration and stress/load testing suite to ensure that the new cluster is at-least as performant as previous if not more. If scale up was the need for a switch then the load tests are even more critical to correctly benchmark the green cluster in terms of performance and scalability.
The stateful software's mirroring, replication and redundancy features have to be amply leveraged in coming up with a deployment strategy to create the mirrored disk copies of the green cluster from the blue cluster. The speed with which replication is achieved is critical to have the minimum possible downtime requirements. We need to stop replication at some point to start the traffic flow to the new cluster.
Additional hybrid strategies to go with Blue Green Deployments
We discuss certain possible strategies to mitigate or reduce the impact of twice capacity and non-zero downtime in the following lines.
Double Capacity: Provisioning twice capacity to conduct blue green deployments may not be a feasible option every time. We can leverage the following ideas to help with these issues.
Use the subtle difference of physical disk and stateful software managing the data to your benefit. You can configure and deploy your stateful software using an external JBOD disk for actual data storage while having the software (mysql, cassandra) deployed on the VMs or containers.
In this mode, you need to create a replica green cluster but instead just spin up VMs/containers with the new software and mount the external disk to the new VM/container thus leading to a software upgrade. You may be able to conduct a Rolling Deployment with this configuration.
The other caveats around testing may still apply. This may also require additional work or may not be completely useful when the reason to upgrade is scaling up. In such cases a complete data copy to new high capacity disk may still be warranted.
Using cloud options here needs to be considered with performance impact. Ensure the proximity of the data and the VMs/containers via electrically connected components such as JBOD. Do not try to achieve this via network connectivity as it will definitely be orders of magnitude poor in performance. Be cognizant of disk performance whichever way you wish to try this out.
Another approach here could be to shard the storage cluster and upgrade one shard at a time thus upgrading the cluster with limited resources to do the same.
This can increase the overall upgrade time for the entire cluster ranging from a few hours to possibly more than a day when other issues encountered.
This can also possibly introduce multiple downtimes for the entire cluster to get upgraded.
Non-Zero Downtime: The non-zero downtime like previously mentioned could be reduced to as low as possible.
The diagram above depicts a scenario where the version 1 blue cluster is put into the read only mode while we conduct the actual switch to both read-write enabled green cluster.
This can possibly lead to a mode where all user GET calls to read data are honored and served while the cluster is switched from blue to green. The write calls can be enabled as soon as the switch happens.
The write block is primarily to ensure consistent state of data. Since, allowing writes to blue cluster while the clusters are being switched can lead to data loss of the writes that happen in the window of the switch and which were not replicated.
Another alternative we can think of is having the writes flow through and replicate them from blue to green post switch but it may lead to a scenario where consistency and ordering might be compromised due to additional writes requiring the previous writes that went to blue prior switch went to green and previous state is not available due to replication delay. You can choose to build for such complexities or avoid them completely. The decision depends on the cost-benefit analysis of you stateful software.
Replication, Redundancy and Recovery: It is very important to have the redundant data replicas to empower recovery in case of data loss arising in general or due to software deployment and up-gradation processes.
The extent of replication and redundancy is dependent on the criticality of the data. You need to categorize the data/state and achieve the right balance for each category of the data since replication comes at a cost. E.g user data is of utmost importance and needs triple-way hot replication while catalog data was sourced and can be recreated hence a single hot and single cold archived copy is enough.
Good support for replication and recovery in the stateful software is necessary to ensure quick recovery in case of failed deployments, rollback to previous healthy state in case the upgrade causes data corruption and to in general minimize or completely avoid data loss.
Ensure you have good number of snapshots of the data again based on criticality but also keep archiving older snapshots to reduce storage costs.
Over the course of last three posts including this one we touched upon the multiple facets of the software deployment strategies particularly for the stateless and stateful applications and services. With the advent of Functions as a Service world in AWS Lambda / Azure Functions and in general serverless architectures we may need to move to modeling our software more idempotent and with Cloud Storage Solutions of S3 / DynamoDb / Spanner / Kinesis, etc there will be reducing worries around deployment strategies for both stateless and stateful services. But these ideas would still be relevant and important from a knowledge perspective around designing your software services and/or when leveraging cloud is cost prohibitive owing to the scale at which your business may be operating and you need to look into solving these problems all by yourselves.