Large Scale Deployment
Considerations
When designing a large cluster, there are several things that need to be taken into account to assure that the cluster will be able to handle heavy loads, and perform quickly and stably.
When speaking of a larger cluster, we are referring to more than few hundreds. If this is the size of cluster you intend to build, the following considerations are relevant for you.
Unregistering Spaces "Disappear" from LUS
This occurs when a large amount of memory is consumed in the process, causing extensive JVM Java Virtual Machine. A virtual machine that enables a computer to run Java programs as well as programs written in other languages that are also compiled to Java bytecode. GC spikes. This results in high CPU usage and distracts the LeaseRenewManager
(a long GC/CPU clock causes the LeaseRenewManager
to miss the default 4 seconds, or to attempt to renew the lease, firing a space service un-registering event). If the LUS Lookup Service.
This service provides a mechanism for services to discover each other. Each service can query the lookup service for other services, and register itself in the lookup service so other services may find it. fires an event to unregister a space, the UI spaces tree node represents it using a specific icon. Additionally, specific logging is printed out in the UI.
To avoid the unregistering of spaces, add resources (memory, CPU) or spaces, or tune the LeaseRenewal maxLeaseDuration
and roundTripTime
. These two values can be configured using the system properties:
//Default value for roundTripTime 4 seconds
-Dcom.gs.jini.config.roundTripTime=4000
//Default value for maxLeaseDuration 8 seconds.
-Dcom.gs.jini.config.maxLeaseDuration=8000
It is recommended to increase these values to 40000⁄80000 respectively in case a large cluster is used.
Increasing these values causes a delay when the space recognizes failover, since the active election infrastructure is based on space un-registration.
Minimize RMI Registry Overhead
Since every space container starts an embedded RMIRegistry
service, it creates a set of threads which consume some resources.
If the RMIRegistry
service is not used, or if a full replication cluster or a large cluster is used; it is recommended to disable the RMIRegistry
service in the space container and in the GSC Grid Service Container.
This provides an isolated runtime for one (or more) processing unit (PU) instance and exposes its state to the GSM./GSM Grid Service Manager.
This is is a service grid component that manages a set of Grid Service Containers (GSCs). A GSM has an API for deploying/undeploying Processing Units. When a GSM is instructed to deploy a Processing Unit, it finds an appropriate, available GSC and tells that GSC to run an instance of that Processing Unit. It then continuously monitors that Processing Unit instance to verify that it is alive, and that the SLA is not breached..
Lookup Service
Do not start more than 2 Lookup Services per cluster. Preferably start these on your strongest machines.
Many Clients Accessing Space
When attempting to run hundreds of clients, which need to find a space and perform operations; a few considerations need to be taken.
Cluster Availability Monitoring
When there are many clients monitoring the availability of a cluster, it is recommended to increase the value of the Monitor
thread to a maximum. Usually, when there is no failover or there are no backup-only spaces, the Monitor
thread can be safely set to its maximum value; since clients directly interact with the space members. If either is detected as unavailable, the Detector
thread is responsible for detecting their re-availability.
For more details, refer to Viewing Clustered Space Status.