29 May 2025

Consistency Biased

In distributed computing, consistency-biased leader election is established using an odd number of coordinators (usually 3). Having an odd number ensures that if network segmentation occurs, there will be no draw between the segments - the majority of coordinators (e.g. 2) will be in one segment, and the minority (e.g. 1) in the other. This allows each segment to play its role and ensure consistency:

If the minority segment holds leadership, it will relinquish it, and will suspend itself until it rejoins the majority.
The majority segment will select a new leader if needed, knowing that the minority will relinquish the previous leader.

In GigaSpaces, consistency-biased leader election is used when the space is deployed on an environment managed by a GigaSpaces Manager cluster. Each manager contains an embedded Apache Zookeeper Apache Zookeeper. An open-source server for highly reliable distributed coordination of cloud applications. It provides a centralized service for providing configuration information, naming, synchronization and group services over large clusters in distributed systems. The goal is to make these systems easier to manage with improved, more reliable propagation of changes. node (znode), and together they provide the necessary environment to ensure consistency.

What is Apache Zookeeper?

Apache ZooKeeper is a centralized service providing distributed synchronization, which can be used for various use cases in distributed systems such as leader election, configuration, distributed locks and more. It is highly reliable and widely used across the industry, both in open source projects such as Apache HBase and Apache Kafka Apache Kafka is a distributed event store and stream-processing platform. Apache Kafka is a distributed publish-subscribe messaging system. A message is any kind of information that is sent from a producer (application that sends the messages) to a consumer (application that receives the messages). Producers write their messages or data to Kafka topics. These topics are divided into partitions that function like logs. Each message is written to a partition and has a unique offset, or identifier. Consumers can specify a particular offset point where they can begin to read messages.), and by large companies such as Yahoo and Rackspace.

You don't have to download or setup Apache Zookeeper. It comes packaged with GigaSpaces, and is automatically started and monitored by the GigaSpaces Manager.

Usage

When a space is deployed on an environment managed by a Manager, consistency-biased leader election is automatically enabled.

Configuration

The default configuration is valid for most environments and applications. You can change it if you need to decrease/increase failover time (the time it takes from when a primary fails to when a backup accepts leadership in its place), using the following Space Where GigaSpaces data is stored. It is the logical cache that holds data objects in memory and might also hold them in layered in tiering. Data is hosted from multiple SoRs, consolidated as a unified data model. properties (can also be provided as a JVM Java Virtual Machine. A virtual machine that enables a computer to run Java programs as well as programs written in other languages that are also compiled to Java bytecode. System property):

Property	Description	Default
`space-config.leader-election.zookeeper.connection-timeout`	Connection timeout (in milliseconds)	5000
`space-config.leader-election.zookeeper.session-timeout`	Session timeout (in milliseconds)	15000
`space-config.leader-election.zookeeper.retry-timeout`	Retry policy maximum elapse timeout	Integer.MAX_VALUE
`space-config.leader-election.zookeeper.retry-interval`	Interval between retries (in milliseconds)	100

ZooKeeper connections have sessions that are maintained on each heartbeat. The connection timeout applies to an API call, while the session timeout applies to network partition incidents.

A new election takes place only in the presence of a ZooKeeper quorum. A backup Space in the quorum is elected primary when the primary Space session expires.

When a session expires, the primary Space suspends its activity until a quorum is reestablished. After the network partition is resolved, the primary Space resolves its state, terminating if a primary Space has already been elected.

A primary Space may resume activity only if its session has not yet expired. Otherwise it terminates and becomes re-instantiated as a backup Space by the managing GSM Grid Service Manager. This is is a service grid component that manages a set of Grid Service Containers (GSCs). A GSM has an API for deploying/undeploying Processing Units. When a GSM is instructed to deploy a Processing Unit, it finds an appropriate, available GSC and tells that GSC to run an instance of that Processing Unit. It then continuously monitors that Processing Unit instance to verify that it is alive, and that the SLA is not breached..

The failover time (of a backup Space until it is elected as primary) is a function of the session timeout plus the time it takes for the state to change. On a LAN network, this has been measured on average to be 35 seconds with the above default settings. This is twice as fast as Lookup-Service-based election.

A shorter failover time is not always advantageous. It may cause short network disconnections to trigger unnecessary failovers, which can affect system stability. Change the defaults only after careful consideration, and adjust the values to suit your network capabilities and applicative requirements.

Implementation

GigaSpaces uses the Apache Curator leader selector recipe, which implements a distributed lock with a notification mechanism using Apache Zookeeper.

The following occurs during leader election:

There is a znode, say “/participants/partitionX".
All participants of the election process create an ephemeral-sequential node on the same election path.
The node with the smallest sequence number is the leader.
Each “follower” node listens to the node with the next lower sequence number.
Upon leader removal, go to election path and find a new leader, or become the leader if it has the lowest sequence number.
Upon session expiration (disconnection), check the election state and go to election if needed. If there is a disconnection, the primary Space instance is moved to Quiese mode and will be restarted.
To protect partition data while an instance is restarted, that partition is marked as unhealthy in ZooKeeper. If after starting, a primary can’t be found, the recovery process will wait for the primary up to 5 minutes. If a primary is not found during that time, the restarted instance will become a primary, but there will be a risk of data loss.
The system parameter to configure this timeout is: com.gs.waiting-timeout-for-leader-if-unhealthy - the default is 5*60*1000.

Partition Split Brain Instances

The Apache Zookeeper leader selector prevents split-brain instances through quorum. If the primary Space is not in the majority, that Space is frozen (or quiesced) until the network is connected and the frozen primary Space is terminated automatically.