MemoryXtend for Disk (SSD/HDD)

Overview

The MemoryXtend disk storage driver delivers built-in, high-speed persistence that leverages local or attached SSD devices, all-flash-arrays (AFA), or any other disk or memory option. It delivers low latency write and read performance, as well as fast data recovery. The MemoryXtend disk storage driver is based on RocksDB, which is a persistent key/value store optimized for fast storage environments.

Architecture and Components

When configured for the disk storage driver, the MemoryXtend architecture tiers the storage of each Space partition instance across two components: a Space partition instance (managed JVM heap) and an embedded key/value store (the blobstore) as shown in the diagram below.

image

InsightEdge and XAP require Read/Write permissions to mounted devices/partitions.

Space Partition Instance

The Space partition instance is a JVM heap that acts as an LRU cache against the underlying blobstore. This tier in the architecture stores indexes, Space class metadata, transactions, replication redo logs, leases, and statistics. Upon a Space read operation, if the object exists in the JVM heap (i.e. a cache hit) it is immediately returned to the Space proxy client. Otherwise, the Space loads it from the underlying blobstore and places it on the JVM heap (known as a cache miss).

Blobstore

The blobstore is based on a log-structured merge tree architecture (similar to popular NoSQL databases such as: HBase, BigTable, or Cassandra). There are three main components in the blobstore:

  • MemTable: An in-memory data structure (residing on off-heap RAM) where all incoming writes are stored. When a MemTable fills up, it is flushed to an SST file on storage.
  • Log: A write-ahead log (WAL) that serializes MemTable operations to a persistent medium as log files. In the event of failure, WAL files can be used to recover the key/value store to its consistent state, by reconstructing the MemTable from the logs.
  • Sorted String Table (SST) files: SSTable is a data structure (residing on disk) that efficiently stores a large data footprint while optimizing for high throughput, and sequential read/write workloads. When a MemTable fills up, it is flushed to an SST file on storage and the corresponding WAL file can be deleted.
  • Built-in Cache: MemoryXtend comes with a built-in cache. This cache is part of the Space partition tier, and stores objects in their native form.

Basic Configuration

Any existing Space can be configured to integrate a blobstore with it. As with a typical Processing Unit, configuration is done using either the pu.xml configuration file, or in the code. For example:

<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xmlns:os-core="http://www.openspaces.org/schema/core"
       xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd
       http://www.openspaces.org/schema/core http://www.openspaces.org/schema/14.5/core/openspaces-core.xsd">

    <bean id="rocksDbBlobstore" class="com.gigaspaces.blobstore.rocksdb.config.RocksDBBlobStoreDataPolicyFactoryBean">
        <property name="paths" value="[/mnt/db1,/mnt/db2]"/>
        <property name="mappingDir" value="/tmp/mapping"/>
    </bean>

    <os-core:embedded-space id="space" space-name="mySpace">
        <os-core:blob-store-data-policy blob-store-handler="rocksDbBlobstore" persistent="true"/>
    </os-core:embedded-space>

    <os-core:giga-space id="gigaSpace" space="space"/>
</beans>
// Create RocksDB storage driver:
BlobStoreStorageHandler blobStore = new RocksDBBlobStoreConfigurer()
        .setPaths("[/mnt/db1,/mnt/db2]")
        .setMappingDir("/tmp/mapping")
        .create();
// Create space with that storage driver:
EmbeddedSpaceConfigurer spaceConfigurer = new EmbeddedSpaceConfigurer("mySpace")
        .cachePolicy(new BlobStoreDataCachePolicy()
                .setBlobStoreHandler(blobStore)
                .setPersistent(true));
GigaSpace gigaSpace = new GigaSpaceConfigurer(spaceConfigurer).gigaSpace();

The above pu.xml example is relevant only for the service grid environment. For information on how to configure the pu.xml for a Kubernetes environment, see Persistent Volume Storage in Kubernetes below.

The following table describes the configuration options used in rocksdb-blob-store above.

Property Description Default Use
paths A comma-separated array of mount paths used for each Space partition's blobstore. The number of paths in the array should correspond to the number of partition instances in the Space (primaries and backups). For instance, for a two-partition Space with no backups, /mnt/db1 corresponds to the first partition, and /mnt/db2 to the second one. required
mapping-dir A directory in the file system that contains the "partition to blobstore" mapping file. This file is automatically generated by MemoryXtend. If a partition is relocated, the mapping file is updated. required
central-storage Specifies whether the deployment strategy is for central storage (i.e. SAN configuration) or local storage on each grid machine (default). false optional
db-options Specifies the tuning parameters for the persistent data store in the underlying blobstore. This includes SST formats, compaction settings and flushes. See the Performance Tuning section for details. optional
data-column-family-options Specifies the tuning parameters for the LSM logic and memory tables. See the Performance Tuning section for details. optional
blob-store-handler Blobstore implementation. required

For an example of how to configure the on-heap cache properties, see the MemoryXtend overview topic.

Persistent Volume Storage in Kubernetes

This topic describes the Kubernetes storage model and how it is leveraged by GigaSpaces to support tiered storage that utilizes external persistent memory. This support enables using GigaSpaces's MemoryXtend feature for SSD in the Kubernetes environment.

Kubernetes Storage Model

The KubeGrid implementation of persistent volume storage utilizes the following Kubernetes entities:

  • Container - The Docker container that hosts the Space.
  • Pod - Hosts the container, and is essentially the Processing Unit instance, or service in the data grid.
  • storageClass - A Kubernetes wrapper around the physical storage volume that contains the storage profile. Provides the requested storage defined by the persistentVolumeClaim.
  • persistentVolumeClaim - The request for Kubernetes to create a persistentVolume. Used to attach (bind) the persistentVolume to the Data Pod.
  • persistentVolume - A Kubernetes wrapper around the storage that is bound to the Space.

To enable persistent volume storage, the container definition must include a volumeMount, which is a path that is mapped to the external persistent storage. If a volumeMount is defined, this indicates to the pod that external storage is required. The pod definition must include a persistentVolumeClaim, which defines the capacity of the external storage volume. The pod can then create a persistentVolumeClaim when it receives the volumeMount from the container.

When a persistentVolumeClaim is created, it communicates to the storageClass that external storage is needed. The storageClass dynamically creates a persistentVolume, which is then attached to the persistentVolumeClaim and its associated pod.

To summarize, the pod creates a persistentVolumeClaim, which interfaces with the storageClass to create a persistentVolume dedicated to that pod.

Deploying a Data Pod with MemoryXtend in Kubernetes

In order to deploy a Processing Unit with tiered storage in Kubernetes, both the pu.xml and pu Helm charts must be configured with the MemoryXtend definitions.

Defining the Processing Unit pu.xml

The pu.xml contains a paths property. This property defines the mount path for the persistent external storage, or blobstore, used by MemoryXtend.

When using a GigaSpaces product in a non-containerized environment, such as the service grid, then you must define a path for each Space instance. For example, when configuring MemoryXtend for a 2:1 high availability environment, four Space instances will be deployed so four paths must be defined in the paths array.

In the Kubernetes environment, Spaces reside in dedicated Data Pods that are completely separate from each other. Therefore, the same path can be used for each instance because each Space creates its own persistentVolumeClaim, and gets a separate persistentVolume from the storageClass.

The defined path must be identical to the volumeMount path in the Data Pod.

<bean id="propertiesConfigurer" class="org.springframework.beans.factory.config.PropertyPlaceholderConfigurer">
    <property name="properties">
        <props>
            <prop key="space-name">rocksdb</prop>
            <prop key="mx-path">opt/gigaspaces/memoryxtend</prop>
        </props>
    </property>
</bean>
	
<blob-store:rocksdb-blob-store id="myBlobStore" paths="[${mx-path}]" mapping-dir="${mx-path}/mapping/rocksdbmap"/>

<os-core:embedded-space id="space" name="${space-name}">
    <os-core:blob-store-data-policy blob-store-handler="myBlobStore" persistent="true"/>								
</os-core:embedded-space>

Installing the Helm Chart

After the pu.xml has been configured for the MemoryXtend persistent storage option, the next step is installing the Helm chart. Fetch the GigaSpaces Helm charts as described in Getting Started with GigaSpaces in Kubernetes, then use the helm install command to deploy the processing unit. The minimum configuration required for MemoryXtend volume is to set the enabled property to true, and the storage property to the required capacity of the external storage. For example:

helm install insightedge-pu --name my-pu --set memoryXtendVolume.enabled=true,memoryXtendVolume.volumeClaimTemplate.storage=10Gi,resourceUrl=http://<my-server>/path/to/my-pu.jar

The following table describes the properties in the memoryXtendVolume section of the Helm chart.

Property Description Default Value Required
memoryXtendVolume.enabled Indicates whether MemoryXtend for persistent memory is enabled. false Yes
memoryXtendVolume.volumeMount.name Name of the volumeMount. mx-volume No
memoryXtendVolume.volumeMount.mountPath Path that is mapped to the external persistent storage (must be identical to the volumeMount path in the Data Pod configuration). /opt/gigaspaces/memoryxtend No
memoryXtendVolume.volumeClaimTemplate.storage Capacity of the external persistent memory volume.   Yes
memoryXtendVolume.volumeClaimTemplate.storageClassName The storageClass from which the Data Pod is requesting the external persistent memory volume.   No
memoryXtendVolume.volumeClaimTemplate.accessModes Mode of access to the external persistent memory volume ReadWriteOnce No
memoryXtendVolume.volumeClaimTemplate.persistentVolumeReclaimPolicy Define what to do with the persistentVolume after the persistentVolumeClaim is deleted. Delete No

Off-Heap Cache

The data grid can cache the values of indexed fields in the process native (off-heap) memory. This optimizes operations that require only indexed values in order to execute, by fetching the data from off-heap instead of disk. The operations that benefit from this optimization are:

  • Read with projection and only indexed fields in query and projection - primary instance optimization
  • Take with only indexed fields in query - backup optimization
  • Clear with only indexed fields in query - primary and backup instance optimization

This feature increases the overall memory consumption of the Space by several bytes (depending on the indexed fields) per entry.

The off-heap cache is disabled by default. To enable it, simply set the off-heap-cache-memory-threshold property in rocksdb-blob-store to the amount of memory you want to allocate for off-heap caching, for example 20m. Use the following sizing units:

  • b - Bytes
  • k, kb - Kilobytes
  • m, mb - Megabytes
  • g, gb - Gigabytes

Before any operation that requires memory allocation (write, update, and initial load), the memory manager checks how much of the allocated memory has been used. If the threshold has been breached, an OffHeapMemoryShortageException is thrown. Read, take, and clear operations are always allowed.

If the used memory is below the configured threshold, then a large write operation may exceed the threshold without being blocked or throwing an exception. Users should take this into account when setting the maximum memory size. This behavior is similar to that of the regular memory manager.

The amount of used off-heap memory can be tracked using the following:

  • Metrics - The space_blobstore_off-heap-cache_used-bytes_total metric, as described on the Metrics page.
  • Administration API - Thru SpaceInstanceStatistics.getBlobStoreStatistics()
  • Web Management Console - In the Space instances view, right-click any of the columns in the table and add the Used Off-Heap Cache column.

When a Processing Unit is undeployed or a Space is killed, the off-heap memory manager frees the allocated off-heap memory.

See the following example:

    public static class TestSpaceClass {

        private Integer id;
        private Integer age;
        private Boolean active;

        public TestSpaceClass() {
        }

        @SpaceId
        public Integer getId() {
            return id;
        }

        public void setId(Integer id) {
            this.id = id;
        }

        @SpaceIndex
        public Integer getAge() {
            return age;
        }

        public void setAge(Integer age) {
            this.age = age;
        }

        public Boolean getActive() {
            return active;
        }

        public void setActive(Boolean active) {
            this.active = active;
        }
    }
gigaSpace.readMultiple(new SQLQuery<TestSpaceClass>(TestSpaceClass.class, "age = 3").setProjections("id", "age"));

To see whether your query can benefit from this optimization behavior, you can set com.gigaspaces.cache to FINE and look for the following log entry:

2017-11-28 07:57:49,115  FINE [com.gigaspaces.cache] - BlobStore - enabled optimization for query: SELECT * FROM com.gigaspaces.test.blobstore.rocksdb.AbstractRocksDBOptimizationTest$TestSpaceClass 
2017-11-28 07:57:49,115  FINE [com.gigaspaces.cache] - BlobStore - disabled optimization for query: SELECT * FROM com.gigaspaces.test.blobstore.rocksdb.AbstractRocksDBOptimizationTest$TestSpaceClass 

When set to FINER, you can see where the entries were fetched from. Look for the following log entry:

2017-11-28 07:57:49,117  FINER [com.gigaspaces.cache] - container [mySpace_container1_1:mySpace] Blobstore- entry loaded from off heap, uid=-1850388785^84^98^0^0
2017-11-28 07:57:49,117  FINER [com.gigaspaces.cache] - container [mySpace_container1_1:mySpace] Blobstore- entry loaded from disk, uid=-1850388785^84^98^0^0

Advanced Configuration

As part of fine-tuning the MemoryXtend functionality, you can control the balance between memory utilization and system performance. This is useful because there are two ways to update the value of an object in the off-heap memory:

  • Free up the old memory allocation, and reallocate new memory.
  • Overwrite the old value using the same allocated memory.

Deleting and reallocating memory takes longer then overwriting the same memory area, but if the size of the new value is smaller then the old value, then the overwrite option leaves part of the original allocated memory underutilized. You can use the BLOBSTORE_OFF_HEAP_MIN_DIFF_TO_ALLOCATE_PROP Space property to control when to trade off system performance for maximizing the memory usage.

This Space property works as follows: if the oldValueSize - newValueSize > blobstore.off-heap.update_threshold, then delete and reallocate memory for the object. Otherwise overwrite the old object value with the new object value.

The default threshold is 50B.

Example

<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xmlns:os-core="http://www.openspaces.org/schema/core"
       xmlns:blob-store="http://www.openspaces.org/schema/off-heap-blob-store"
       xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd
       http://www.openspaces.org/schema/core http://www.openspaces.org/schema/14.5/core/openspaces-core.xsd
       http://www.openspaces.org/schema/off-heap-blob-store http://www.openspaces.org/schema/14.5/off-heap-blob-store/openspaces-off-heap-blob-store.xsd">

    <blob-store:off-heap-blob-store id="offheapBlobstore" memory-threshold="20g"/>

    <os-core:embedded-space id="space" name="mySpace">
       <os-core:blob-store-data-policy persistent="false" blob-store-handler="offheapBlobstore"/>
       <os-core:properties>
          <props>
             <prop key="blobstore.off-heap.update_threshold">1M</prop>
          </props>
    </os-core:embedded-space>

    <os-core:giga-space id="gigaSpace" space="space"/>
</beans>
// Create off-heap storage driver:
String memoryThreshold = "5g";
BlobStoreStorageHandler blobStore = new OffHeapBlobStoreConfigurer()
        .setMemoryThreshold(memoryThreshold)
        .create();
// Create space with that storage driver:
String spaceName = "mySpace";
EmbeddedSpaceConfigurer spaceConfigurer = new EmbeddedSpaceConfigurer(spaceName)
        .cachePolicy(new BlobStoreDataCachePolicy()
                .setBlobStoreHandler(blobStore)
                .setPersistent(false));
spaceConfigurer.addProperty("blobstore.off-heap.update_threshold", "1M");
GigaSpace gigaSpace = new GigaSpaceConfigurer(spaceConfigurer).gigaSpace();

Deployment Strategies

Local Storage

This configuration allows each Space partition instance (primary or backup) to use a machine-local storage device (SSD/HDD). With this approach, data locality is leveraged so that the devices local to the machine are used for reads/writes. The local storage deployment strategy provides an easy way to implement a "local persistent store" (also known as native persistence) pattern.

image

Central Storage

This deployment strategy works well with storage area networks (SAN), which means that the disk drive devices are installed in a remote storage array but behave as if they're attached to the local machine. Most storage networks use the iSCSI or Fibre Channel protocol for communication between servers and disk drive devices. This configuration breaks the coupling between a partition instance and a local machine device, allowing seamless relocation of partitions across data grid machines.

image

Tiering storage between Space partition instances and attached storage can be applied across one or more storage arrays, as shown in the configurations below:

Single Storage Array

The following example deployes a 2 partitions space with a single backup (2,1) in the following manner:

  • /mnt/db1 will be mounted to the 1st primary.
  • /mnt/db2 will be mounted to the 2nd primary.
  • /mnt/db3 will be mounted to the 1st backup.
  • /mnt/db4 will be mounted to the 2nd backup.

<blob-store:rocksdb-blob-store id="myBlobStore" paths="[/mnt/db1,/mnt/db2,/mnt/db3,/mnt/db4]"/>

Dual Storage Array

The following example deploys a 2-partition Space with a single backup (2,1) in the following manner:

  • /mnt1/db1 will be mounted to the 1st primary.
  • /mnt1/db2 will be mounted to the 2nd primary.
  • /mnt2/db1 will be mounted to the 1st backup.
  • /mnt2/db2 will be mounted to the 2nd backup.

<blob-store:rocksdb-blob-store id="myBlobStore" paths="[/mnt1/db1,/mnt1/db2],[/mnt2/db1,/mnt2/db2]"/>

Performance Tuning

Persistent Data Store Tuning Parameters

The data grid uses the default DBOptions class com.com.gigaspaces.blobstore.rocksdb.config.XAPDBOptions.

A list of configuration properties can be found in the org.rocksdb.DBOptionsInterface API documentation.

Property Description Value
createIfMissing Configure whether to create the database if it is missing. This value is always overrode with true. true
maxBackgroundCompactions Specifies the maximum number of concurrent background compaction jobs, submitted to the default LOW priority thread pool.
If you're increasing this, also consider increasing the number of threads in the LOW priority thread pool
8
maxBackgroundFlushes Specifies the maximum number of concurrent background flush jobs.
If you're increasing this, also consider increasing the number of threads in the HIGH priority thread pool.
8
maxOpenFiles Number of open files that can be used by the database. You may need to increase this if your database has a large working set. When the value is set to -1, files that are opened are always kept open. -1
tableFormatConfig Set the configuration for the table format.
The default is BlockBasedTableConfig with
  • noBlockCache = opposite of useCache
  • blockCacheSize = cacheSize
  • blockSize = blockSize
  • filter = BloomFilter(10,false)
  • formatVersion = 0
The highlighted values are taken from the rocksdb-blob-store namespace / RocksDBBlobStoreConfigurer if provided, otherwise the following defaults are used:
  • useCache = true
  • cacheSize = 100MB
  • blockSize = 16KB
If a custom tableFormatConfig is provided, the values from the namespace/configurer are ignored.

LSM Logic and MemTables Tuning Parameters

Below are the values for the default class com.com.gigaspaces.blobstore.rocksdb.config.XAPColumnFamilyOptions.

A list of configuration properties can be found in the org.rocksdb.ColumnFamilyOptionsInterface API documentation.

Property Description Value
writeBufferSize Amount of data to build up in memory (backed by an unsorted log on disk)
before converting to a sorted on-disk file. Should be in bytes.
64 MB
levelZeroSlowdownWritesTrigger Soft limit on number of level-0 files. The data grid begins to slow down writes at this point.
A value < 0 means that no writing slowdown is triggered by the number of files in level-0.
8
maxWriteBufferNumber The maximum number of write buffers that are built up in memory. 4
targetFileSizeBase The target file size for compaction. Should be in bytes. 64 MB
softPendingCompactionBytesLimit The soft limit to impose on compaction. 0
hardPendingCompactionBytesLimit The hard limit to impose on compaction. 0
levelCompactionDynamicLevelBytes If true, RocksDB will pick the target size of each level dynamically. false
maxBytesForLevelBase The upper limit of the total size of level-1 files, in bytes. 512 MB
compressionPerLevel Sets the compression policy for each level. [NO_COMPRESSION,
NO_COMPRESSION,
SNAPPY_COMPRESSION]
mergeOperatorName Sets the merge operator to be used for merging two merge operands of the same key. put
fsync By default, each write returns after the data is pushed to the operating system. The transfer from operating system memory to the underlying persistent storage happens asynchronously. When sync is configured to true, each write operation doesn't return until the data being written has been pushed all the way to persistent storage.
This parameter should be set to true while storing data to file systems like ext3, which can lose files after a reboot.
If this property is set, the fsync that is passed to the rocksdb-blob-store namespace/RocksDBBlobStoreConfigurer (if any) is ignored.
false

Examples

<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xmlns:os-core="http://www.openspaces.org/schema/core"
       xmlns:blob-store="http://www.openspaces.org/schema/rocksdb-blob-store"
       xsi:schemaLocation="http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd
       http://www.openspaces.org/schema/core http://www.openspaces.org/schema/14.5/core/openspaces-core.xsd
       http://www.openspaces.org/schema/rocksdb-blob-store http://www.openspaces.org/schema/14.5/rocksdb-blob-store/openspaces-rocksdb-blobstore.xsd">

    <bean id="dbOptions" class="com.gigaspaces.blobstore.rocksdb.config.XAPDBOptions">
        <!-- This will override the default of 8 -->
        <property name="maxBackgroundCompactions" value="4" />
        <property name="maxBackgroundFlushes" value="4" />
    </bean>
    
    <bean id="dataColumnFamilyOptions" class="com.gigaspaces.blobstore.rocksdb.config.XAPColumnFamilyOptions">
        <property name="maxBytesForLevelMultiplier" value="20"/>
    </bean>

    <blob-store:rocksdb-blob-store id="myBlobStore" fsync="false"
        cache-size-mb="100" block-size-kb="16"
        db-options="dbOptions" data-column-family-options="dataColumnFamilyOptions"
        paths="[/tmp/rocksdb,/tmp/rocksdb2]" mapping-dir="/tmp/mapping"/>

    <os-core:embedded-space id="space" space-name="mySpace" >
        <os-core:blob-store-data-policy blob-store-handler="myBlobStore"/>
    </os-core:embedded-space>

    <os-core:giga-space id="gigaSpace" space="space"/>
</beans>
RocksDBBlobStoreConfigurer configurer = new RocksDBBlobStoreConfigurer()
        .setPaths("[/tmp/rocksdb,/tmp/rocksdb2]")
        .setMappingDir("/tmp/mapping")
        .setCacheSizeMB(100)
        .setBlockSizeKB(16)
        .setFsync(false);

XAPDBOptions dbOptions = new XAPDBOptions();
configurer.setDBOptions(dbOptions); // optional

XAPColumnFamilyOptions dataColumnFamilyOptions = new XAPColumnFamilyOptions();
configurer.setDataColumnFamilyOptions(dataColumnFamilyOptions); // optional

RocksDBBlobStoreHandler rocksDBBlobStoreHandler = configurer.create();

BlobStoreDataCachePolicy cachePolicy = new BlobStoreDataCachePolicy();
cachePolicy.setBlobStoreHandler(rocksDBBlobStoreHandler);

EmbeddedSpaceConfigurer embeddedSpaceConfigurer = new EmbeddedSpaceConfigurer("mySpace");
embeddedSpaceConfigurer.cachePolicy(cachePolicy);

gigaSpace = new GigaSpaceConfigurer(embeddedSpaceConfigurer.space()).gigaSpace();