18 March 2024

Failover

Failover is the mechanism used to route user operations to alternate spaces in case the target space (of the original operation) fails.

The component responsible for failover in GigaSpaces is the clustered proxy. When an operation on a space fails, the clustered proxy tries to locate an available and active space member. If it finds such a space, it re-invokes the operation on that space member, otherwise it throws the original exception.

Failover and Recovery Process

Failover and Recovery process includes the following steps:

Backup identifies that the primary space is not available and moving itself to a primary mode. For information on speeding up this step refer to Failure Detection.
Space Where GigaSpaces data is stored. It is the logical cache that holds data objects in memory and might also hold them in layered in tiering. Data is hosted from multiple SoRs, consolidated as a unified data model. proxy router discovers the new primary space and routs the request to it. For information on speeding up this step refer to Proxy Connectivity.
GSM Grid Service Manager. This is is a service grid component that manages a set of Grid Service Containers (GSCs). A GSM has an API for deploying/undeploying Processing Units. When a GSM is instructed to deploy a Processing Unit, it finds an appropriate, available GSC and tells that GSC to run an instance of that Processing Unit. It then continuously monitors that Processing Unit instance to verify that it is alive, and that the SLA is not breached. identifies the backup space is missing and provision new space into existing GSC Grid Service Container. This provides an isolated runtime for one (or more) processing unit (PU) instance and exposes its state to the GSM. - preferably empty GSC or based on predefined SLA. You can tune this step by modifying the GSM settings.
Backup started. It is looking for the look service and register itself.
Backup identifies that a primary space exists (goes through primary election process) and moves itself into backup mode. If there are multiple GSMs and bad network or wrong locators configuration – you might end up here with split-brain scenario where you have multiple primaries running.
Backup read existing space objects from the primary space (aka memory recovery). This might take some time with few million space objects. GigaSpaces using multiple threads with this activity. You can tune this by modifying the recovery batch size and also the amount of recovery threads.
Primary clears its redo log (During the above step) and starts to accumulate incoming destructive events (write , update , take) within its redo log. This is one of the reasons why the redo log size can have limited size in many cases. You can use the redo-log-capacity to configure the redo-log size. For example: If the recovery takes 30 sec and the client performs 1000 destructive operations/sec – your redo log size can be 30,000.
Backup completes reading space objects from primary.
Primary replicates redo log content to the backup (via the async replication channel) – In this point backup getting also sync replication events from primary. You can control the redo log replication speed using the async replication batch size.
Once the redo replication completed – Recovery done. Backup is ready to act as a primary.

To speed up the recovery time you can override the following properties:

cluster-config.groups.group.repl-policy.recovery-chunk-size - Objects per chunk (default is 200).
cluster-config.groups.group.repl-policy.recovery-thread-pool-size - Number of threads the backup is running to consume data from the primary (default is 4).

Failover with Blocking Operations and Transactions

Even if a client invokes a blocking operation, like take with timeout and the server fails after receiving the take arguments but before returning a result, the client gets an exception, and the failover process aborts. This is because in a clustered engine, unlike a non-clustered one, the server thread servicing the request waits for a final answer to be given (unless the request is timed out). If an operation is performed under a transaction and the target space that serviced the transaction has failed, the clustered proxy automatically aborts the transaction and throws a transaction exception back to the caller. (There is no point in re-invoking the operation to a different space, because the failed space member is a transaction participant. The transaction will ultimately be aborted by the transaction manager.) The caller catching the exception can start a new transaction and continue execution; later calls on the proxy are directed to an available space in the group.

Backup Space Reconciliation After Network Disconnection

Network disconnections and disruptions impact the availability of the distributed cluster. In such cases, the failure-detection mechanisms will attempt to heal the cluster by creating new instances to preserve availability. The failure-detection determines when an instance is unreachable. There is no real distinction between an instance which is abruptly terminated and an instance which is unreachable due to network problems. In the latter case the healing process may lead to extra Space instances when network connection is restored. When the extra Space instance is a primary, the Split-Brain mechanism kicks in. When the extra Space instance is a backup, reconciliation is done by determining if this backup Space is connected to a discovered primary Space. The discovered primary Space is either already connected to a backup Space or not. If the backup Space is not a "replication target" of the discovered primary Space, it will be shutdown.

When both Split-Brain and an extra Backup Space occur, both mechanisms work side-by-side. Extra backup Space instances will be shutdown and then a primary Space instance will be chosen out of the existing primaries. A backup Space will be provisioned if necessary.