Service Level Agreement (SLA)

The GigaSpaces runtime environment (also referred to as the Service GridClosed A built-in orchestration tool which contains a set of Grid Service Containers (GSCs) managed by a Grid Service Manager. The containers host various deployments of Processing Units and data grids. Each container can be run on a separate physical machine. This orchestration is available for XAP only.) provides SLA-driven capabilities via the Grid Service Manager (GSM) and the Grid Service Container (GSC) runtime components. The GSCClosed Grid Service Container. This provides an isolated runtime for one (or more) processing unit (PU) instance and exposes its state to the GSM. is responsible for running one or more Processing UnitsClosed This is the unit of packaging and deployment in the GigaSpaces Data Grid, and is essentially the main GigaSpaces service. The Processing Unit (PU) itself is typically deployed onto the Service Grid. When a Processing Unit is deployed, a Processing Unit instance is the actual runtime entity.. The GSMClosed Grid Service Manager. This is is a service grid component that manages a set of Grid Service Containers (GSCs). A GSM has an API for deploying/undeploying Processing Units. When a GSM is instructed to deploy a Processing Unit, it finds an appropriate, available GSC and tells that GSC to run an instance of that Processing Unit. It then continuously monitors that Processing Unit instance to verify that it is alive, and that the SLA is not breached. is responsible for analyzing the deployment and provisioning the processing unit instances to the available GSCs.

The SLA definitions are only enforced when deploying the Processing Unit to the Service Grid, because this environment actively manages and controls the deployment using the GSM(s). When running within your IDEClosed Integrated Development Environment. A software application that helps programmers develop software code efficiently. It increases developer productivity by combining capabilities such as software editing, building, testing, and packaging in an easy-to-use application. Example: DBeaver. or in standalone mode, these definitions are ignored.

The SLA definitions can be provided as part of the Processing Unit package, or during the Processing Unit's deployment process. These definitions determine the number of Processing Unit instances that should run, and deploy-time requirements such as the amount of free memory or CPU, or the clustering topology for Processing Units that contain a SpaceClosed Where GigaSpaces data is stored. It is the logical cache that holds data objects in memory and might also hold them in layered in tiering. Data is hosted from multiple SoRs, consolidated as a unified data model.. The GSM reads the SLA definitions, and deploys the Processing Unit to the available GSCs accordingly.

Defining the SLA for Your Processing Unit

The SLA contains the deployment criteria in terms of clustering topology (if it contains a Space) and deployment-time requirements. It can be defined in multiple ways:

The SLA definition, whether it comes in a separate file or embedded inside the pu.xml file, is composed of a single <os-sla:sla> XML element. A sample SLA definition is shown below (you can use plain Spring definitions, or GigaSpaces-specific namespace bindings):

<beans xmlns="http://www.springframework.org/schema/beans"
       xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd
       http://www.openspaces.org/schema/sla http://www.openspaces.org/schema/17.0.0/sla/openspaces-sla.xsd">

    <os-sla:sla cluster-schema="partitioned" number-of-instances="2" number-of-backups="1" max-instances-per-vm="1">

The number of backups per partition is zero or one.

<beans xmlns="http://www.springframework.org/schema/beans"
       xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd">

    <bean id="SLA" class="org.openspaces.pu.sla.SLA">
        <property name="clusterSchema" value="partitioned" />
        <property name="numberOfInstances" value="2" />
        <property name="numberOfBackups" value="1" />

The number of backups per partition is zero or one.

        <property name="maxInstancesPerVM" value="1" />

The SLA definition shown above creates four instances of a Processing Unit using the partitioned space topology. It defines two partitions (number-of-instances="2"), each with one backup (number-of-backups="1"). In addition, it requires that a primary and a backup instance of the same partition not be provisioned to the same GSC (max-instances-per-vm="1").

It is up to the developer to configure the SLA correctly. Trying to deploy a Processing Unit with a cluster schema that requires backups without specifying numberOfBackups will cause the deployment to fail.

The number of backups per partition is zero or one.

In older releases, the SLA definition also included dynamic runtime policies, such as creating additional Processing Unit instances based on CPU load, relocating a certain instance when the memory becomes saturated, etc. These capabilities are still supported, but have been deprecated in favor of the Administration and Monitoring API which supports the above and much more.

Defining the Space Cluster Topology

When a Processing Unit contains a Space, the SLA definition specifies the clustering topology for that Space. The cluster-schema XML element defines this, and is an attribute of the os-sla:sla XML element, as shown above.

The Space's clustering topology is defined within the SLA definition rather than the Space definition with the pu.xml file for the following reasons:

  • The clustering topology directly affects the number of instances of the Processing Unit, and is therefore considered part of its SLA. For example, a partitioned Space can have two primaries and two backups, and a replicated Space can have five instances. This means that the Processing Unit containing them will have the same number of instances.

  • Separating the clustering topology from the actual Space definition enables truly implementing the "write once, scale anywhere" principal. You can run the same Processing Unit within your IDE for unit tests using the default single-Space-instance clustering topology, and then deploy it to the runtime environment and run the same Processing Uunit with the real clustering topology (for example, with 4 partitions).

You can choose from numerous clustering topologies:

From the client application's perspective (the one that connects to the Space from another process), the clustering topology is transparent in most cases.

The number-of-backups parameter should be used with the partitioned cluster schema. It is not supported with the sync-replicated or async-replicated cluster schema.

The number of backups per partition is zero or one.

SLA-Based Distribution and Provisioning

When deployed to the service grid, the Processing Unit instances are distributed based on the SLA definitions. These definitions form a set of constraints and requirements that should be met when a Processing Unit instance is provisioned to a specific container (GSC). The SLA definitions are considered during the initial deployment, relocation, and re-provisioning of an instance after failure.

Default SLA Definitions

If no SLA definition is provided either within the Processing Unit XML configuration or during deploy time, the following default SLA is used:

<os-sla:sla number-of-instances="1" />
<bean id="SLA" class="org.openspaces.pu.sla.SLA">
    <property name="numberOfInstances" value="1" />

Maximum Instances per VM/Machine

The SLA definition allows you to define the maximum number of instances for a certain Processing Unit, either per JVM (GSC) or physical machine (regardless of the number of JVMs/GSCs that are running on it).

The max-instances parameter has different semantics when applied to Processing Units that contain a Space with primary-backup semantics (that uses the partitioned cluster schema and defines at least one backup), and when applied to a Processing Unit that doesn"t contain an embedded Space, or that contains a Space with no primary-backup semantics.

When applied to a Processing Unit that doesn't contain an embedded Space, or that contains a Space with no primary-backup semantics, the max-instances parameter defines the total number of instances that can be deployed on a single JVM or on a single machine.

When applied to a Processing Unit that contains a Space with primary-backup semantics, the max-instances parameter defines the total number of instances (which belong to the same primary-backup group or partition) that can be provisioned to a single JVM or a single machine.

The most common use of the max-instances parameter is for a Processing Unit that contains a Space with primary-backup semantics. Setting the value to 1 ensures that a primary and its backup(s) cannot be provisioned to the same JVM (GSC)/physical machine.

The max-instances-per-vm defines the maximum number of instances per partition for a Processing Unit with an embedded Space. A partition may have a primary or backup instance. max-instances-per-vm=1 means you won't have a primary and a backup of the same partition provisioned to the same GSC. (It is allowed to have multiple partitions with primary or backup instances provisioned to the same GSC.) You cannot limit the amount of instances from different partitions that a GSC may host.

If you have enough GSCs (the number of partitions x2), you will have a single instance per GSC. If you don't have enough GSCs, the primary and backup instances of the different partitions will be distributed across all the existing GSCs, and the system tries to distribute the primary instances in an even manner across all the GSCs on all the machines. If you increase the amount of GSCs after the initial deployment, you must "rebalance" the system, meaning distribute all the primaries across all the GSCs.

You can perform this activity via API. Rebalancing the instances will increase the capacity of the data grid (because it will result in more GSCs hosting the data grid). The rebalance concept is based on the assumption that there are initially more partitions than GSCs. The ratio between partitions to GSCs is called the scaling factor; meaning how much your data grid can expand itself without shutdown, and without increasing the amount of partitions.

Example: Start with 4 GSCs and 20 partitions with backups (40 instances). Each GSC will initially host 10 instances, and can end up with 40 GSCs (after one or more rebalancing operations) where each GSC hosts a single instance. In this example, the capacity of the data grid is increased to 10 times larger without downtime, while the amount of partitions remains the same. See the capacity planning section for more details.

The following is an example of setting the max-instances-per-vm parameter:

<os-sla:sla max-instances-per-vm="1" />
<bean id="SLA" class="org.openspaces.pu.sla.SLA">
    <property name="maxInstancesOfVM" value="1" />

The following is an example of setting the maximum instances per machine parameter:

<os-sla:sla max-instances-per-machine="1" />
<bean id="SLA" class="org.openspaces.pu.sla.SLA">
    <property name="maxInstancesOfMachine" value="1" />

Total Maximum Instances per VM

This refers to the total amount of Processing Unit instances that can be instantiated within a GSC. It should not be confused with the max-instances-per-vm parameter, which controls how many instances a partition may have within a GSC. The total maximum instances per VM does not control the total amount of instances from different Processing Units, or other partitions, that can be provisioned into a GSC. To control the total maximum instances per VM, use the com.gigaspaces.grid.gsc.serviceLimit system property and set its value before running the GSC:

set GS_GSC_OPTIONS=-Dcom.gigaspaces.grid.gsc.serviceLimit=2

The default value of the com.gigaspaces.grid.gsc.serviceLimit is 1.

Monitoring the Liveness of Processing Unit Instances

The GSM monitors the liveness of all the Processing Unit instances it provisioned to the GSCs. The GSM pings each instance in the cluster to see whether it is available.

You can control how often a Processing Unit instance is monitored by the GSM, and in case of failure, how many times the GSM will try again to ping the instance and how long it will wait between retry attempts.

This is done using the <os-sla:member-alive-indicator> element, which contains the following attributes:

Attribute Description Default
invocation-delay How often (in milliseconds) an instance is monitored and verified to be alive by the GSM 5000 (5 seconds)
retry-count After an instance has been determined to not be alive, how many times to check it before giving up on it 3
retry-timeout After an instance has been determined to not be alive, the timeout interval between retries (in milliseconds) 500 (0.5 seconds)

When a Pprocessing Unit instance is determined to be unavailable, the GSM that manages its deployment tries to re-provision the instance on another GSC (according to the SLA definitions).

    <os-sla:member-alive-indicator invocation-delay="5000" retry-count="3" retry-timeout="500" />
<bean id="SLA" class="org.openspaces.pu.sla.SLA">
    <property name="member-alive-indicator">
        <bean class="org.openspaces.pu.sla.MemberAliveIndicator">
            <property name="invocationDelay" value="5000" />
            <property name="retryCount" value="3" />
            <property name="retryTimeout" value="500" />

Troubleshooting the Liveness Detection Mechanism

For troubleshooting purposes, you can lower the logging threshold of the relevant log category by modifying the log configuration file located under $GS_HOME/config/log/xap_logging.properties on the GSM. The default definition is as follows:

org.openspaces.pu.container.servicegrid.PUFaultDetectionHandler.level = INFO You can change it to one of the below thresholds for more information:

Level Description
CONFIG Logs the configurations applied
FINE Logs when a member is determined to be not alive
FINER Logs when a member is indicated to be not alive (on each retry)
FINEST every fault detection attempt

Level FINE is generally sufficient for service-failure troubleshooting.