18 March 2024

Large Scale Deployment

Considerations

When designing a large cluster, there are several things that need to be taken into account to assure that the cluster will be able to handle heavy loads, and perform quickly and stably.

When speaking of a larger cluster, we are referring to more than few hundreds. If this is the size of cluster you intend to build, the following considerations are relevant for you.

Unregistering Spaces "Disappear" from LUS

This occurs when a large amount of memory is consumed in the process, causing extensive JVM Java Virtual Machine. A virtual machine that enables a computer to run Java programs as well as programs written in other languages that are also compiled to Java bytecode. GC spikes. This results in high CPU usage and distracts the LeaseRenewManager (a long GC/CPU clock causes the LeaseRenewManager to miss the default 4 seconds, or to attempt to renew the lease, firing a space service un-registering event). If the LUS Lookup Service. This service provides a mechanism for services to discover each other. Each service can query the lookup service for other services, and register itself in the lookup service so other services may find it. fires an event to unregister a space, the UI spaces tree node represents it using a specific icon. Additionally, specific logging is printed out in the UI.

To avoid the unregistering of spaces, add resources (memory, CPU) or spaces, or tune the LeaseRenewal maxLeaseDuration and roundTripTime. These two values can be configured using the system properties:

//Default value for roundTripTime 4 seconds
-Dcom.gs.jini.config.roundTripTime=4000

//Default value for maxLeaseDuration  8 seconds.
-Dcom.gs.jini.config.maxLeaseDuration=8000

It is recommended to increase these values to ⁴⁰⁰⁰⁰⁄₈₀₀₀₀ respectively in case a large cluster is used.

Increasing these values causes a delay when the space recognizes failover, since the active election infrastructure is based on space un-registration.

Minimize RMI Registry Overhead

Since every space container starts an embedded RMIRegistry service, it creates a set of threads which consume some resources.

If the RMIRegistry service is not used, or if a full replication cluster or a large cluster is used; it is recommended to disable the RMIRegistry service in the space container and in the GSC Grid Service Container. This provides an isolated runtime for one (or more) processing unit (PU) instance and exposes its state to the GSM./GSM Grid Service Manager. This is is a service grid component that manages a set of Grid Service Containers (GSCs). A GSM has an API for deploying/undeploying Processing Units. When a GSM is instructed to deploy a Processing Unit, it finds an appropriate, available GSC and tells that GSC to run an instance of that Processing Unit. It then continuously monitors that Processing Unit instance to verify that it is alive, and that the SLA is not breached..

Lookup Service

Do not start more than 2 Lookup Services per cluster. Preferably start these on your strongest machines.

Many Clients Accessing Space

When attempting to run hundreds of clients, which need to find a space and perform operations; a few considerations need to be taken.

Cluster Availability Monitoring

When there are many clients monitoring the availability of a cluster, it is recommended to increase the value of the Monitor thread to a maximum. Usually, when there is no failover or there are no backup-only spaces, the Monitor thread can be safely set to its maximum value; since clients directly interact with the space members. If either is detected as unavailable, the Detector thread is responsible for detecting their re-availability.

For more details, refer to Viewing Clustered Space Status.