Configuring AnalyticsXtreme on the Server Side

AnalyticsXtreme is implemented in conjunction with a specific Space. This is part of the standard configuration of a pu.xml file.

For more information about pu.xml files and their properties, read the Configuration topic in The Processing Unit Service section of the developer guide.

Configuring AnalyticsXtreme on the server side involves the following steps:

Defining the Space.
Setting up the AnalyticsXtreme Manager
Configuring the data life cycle policy.
Defining the data source and the data target.
Exporting the AnalyticsXtreme Manager service for discovery (so that client applications can access it).

The Space definition is part of the standard pu.xml configuration as described in The Processing Unit Service section of the developer guide. The rest of the steps, which are specific to AnalyticsXtreme, are explained in the sections below.

Setting Up the AnalyticsXtreme Manager

As mentioned above, all of the relevant information about AnalyticsXtreme is provided to the data grid in a dedicated pu.xml file. The first step in the configuration process is creating an AnalyticsXtreme Manager bean, with a bean ID and class. In this part of the pu.xml, you can also modify the AnalyticsXtreme logging policy configuration if necessary.

In the following code snippet, taken from the full pu.xml file, the AnalyticsXtreme Manager is assigned a bean ID of ax-manager, and the logging policy has been changed from its default value.

<bean id="ax-manager" class="com.gigaspaces.analytics_xtreme.server.AnalyticsXtremeManagerFactory">
     <property name="config">
          <bean class="com.gigaspaces.analytics_xtreme.AnalyticsXtremeConfigurationFactoryBean">
               <!-- Verbose is recommended for getting started, usually turned off in production -->
               <property name="verbose" value="true"/>
               <!-- more properties -->
          </bean>
     </property>
</bean>

Cold Starts

For persistency reasons, GigaSpaces maintains a record of the last confirmed data that was copied from the Space to the external data source. If you need to undeploy and then redeploy a Processing Unit, for example as part of a maintenance activity, you can configure whether you want to perform a cold start of the Space (copy data from the Space to the external data source from the beginning), or whether you want the redeployed Processing Unit to continue copying the data from where it left off when it was undeployed.

Global Properties

The global configuration contains a list of DataLifecyclePolicy properties, as well as the AnalyticsXtreme logging level.

Parameter	Description	Unit	Default Value	Required/Optional
verbose	Increases the log levels for both the client and the server to provide verbose AnalyticsXtreme information (useful for troubleshooting).	True/False	False	Optional
coldStart	(Automatic data tiering only) After redeploying a Processing Unit, set the value to "true" if you want to copy data from the `batchDataSource` to the `batchDataTarget` from the beginning. Leave the default value to continue copying data from the last confirmed data that was copied from the `batchDataSource` to the `batchDataTarget`.	True/False	False	Optional

Configuring the Data Life Cycle Policy

The next step in configuring AnalyticsXtreme is to define the data life cycle policy, which specifies how and when data is archived from the data grid (speed layer) to the external data storage (batch layer). This policy is configured in the AnalyticsXtreme Manager for each data object, or table.

You must define the following properties in the policy:

typeName
timeColumn
speedPeriod
batchDataSource

All other properties are optional, and you can leave the default values unless your specific environment has different requirements. See Configuring AnalyticsXtreme on the Server Side below for a full list of the data life cycle policy properties and their descriptions.

The relationship between the data object and the data life cycle policy is one to one. You cannot have more than one policy per data object, and each policy applies to a single data object.

Defining the Data Source and Data Target

Part of configuring the data life cycle policy is defining the batchDataSource property. You can define it directly, or you can use a ref to point to the definition. If you are implementing AnalyticsXtreme in external data tiering mode, this is the only required property because the data is simply evicted at the end of the life cycle.

If you are implementing AnalyticsXtreme in automatic data tiering mode, you must also define the batchDataTarget property using the same method. Towards the end of the speedPeriod, the data will be moved to the target defined here.

Example

The following code snippet from the pu.xml file shows a data life cycle policy that was configured for specific trading information that needs to be stored in the data grid for 5 hours, and then moved to external data storage. Both the data source and the data target have their full definitions in another area of the pu.xml file.

<list>
     <!-- Data life cycle policy for Trade class -->
     <bean class="com.gigaspaces.analytics_xtreme.DataLifecyclePolicyFactoryBean">
          <property name="typeName" value="com.gigaspaces.demo.Trade"/>
          <property name="timeColumn" value="dateTimeTrade"/>
          <property name="speedPeriod" value="pt5h"/>
          <property name="batchDataSource" ref="ax-datasource"/>
          <property name="batchDataTarget" ref="ax-datatarget"/>
     </bean>
     <!-- Add a life cycle policy for each additional class -->
</list>

Configuring the Data Source

AnalyticsXtreme supports HDFS out of the box. However, you can add any external data source that is supported by Spark.

The following code snippet demonstrates how to configure AnalyticsXtreme to work with a Spark Hive data source.

<!-- Data source plugin based on Spark Hive -->
<bean id="ax-datasource" class="com.gigaspaces.analytics_xtreme.spark.SparkHiveBatchDataSourceFactoryBean">
     <property name="sparkSessionProvider" ref="ax-sparkSessionFactory"/>
</bean>

The data source and the data target use the same sparkSessionFactory bean.

Configuring the Data Target

There are three possible implementations for the data target out of the box; JDBC, Spark, and Hive. Each is wrapped in a factory bean. The BatchDataTarget interface can be used for other data targets by implementing the feed method.

Spark Hive Batch Data Target Example

This example demonstrates the Spark Hive implementation.

<!-- Data target plugin based on Spark Hive -->
<bean id="ax-datatarget" class="com.gigaspaces.analytics_xtreme.spark.SparkHiveBatchDataTargetFactoryBean">
     <property name="format" value="hive"/>
     <property name="mode" value="append"/>
     <property name="sparkSessionProvider" ref="ax-sparkSessionFactory"/>
</bean>

<!-- Spark session provider -->
<bean id="ax-sparkSessionFactory" class="org.insightedge.spark.SparkSessionProviderFactoryBean">
     <property name="master" value="local[*]"/>
     <property name="enableHiveSupport" value="true"/>
     <property name="configOptions">
          <map>
               <entry key="hive.metastore.uris" value="thrift://hive-metastore:9083"/>
          </map>
     </property>
</bean>

JDBC Batch Data Target Example

This example demonstrates the JDBC implementation.

<bean id="ax-datasource" class="com.gigaspaces.analytics_xtreme.jdbc.JdbcBatchDataSourceFactoryBean">
     <property name="connectionString" value="jdbc:hive2://hive-server:10000/;ssl=false"/>
</bean>
	
<bean id="ax-datatarget" class="com.gigaspaces.analytics_xtreme.jdbc.JdbcBatchDataTargetFactoryBean">
     <property name="connectionString" value="jdbc:hive2://hive-server:10000/;ssl=false"/>
     <property name="useLowerCase" value="true"/>
</bean>

Using a Different Apache Hive Version

Apache Spark currently is packaged with Apache Hive version 1.2.1. If you want to use a different Hive version for the batch data source or the batch data target, you need to do the following:

Import the relevant JAR files.

Add the following to the pu.xml under the sparkSessionFactory configOptions:

property name="configOptions">
       <map>
           <entry key="hive.metastore.uris" value="thrift://hive-metastore:9083"/>
           <entry key="spark.sql.hive.metastore.version" value="2.3.2"/>
           <entry key="spark.sql.hive.metastore.jars"
                  value="#{systemEnvironment['SPARK_HOME']}/jars/hive/*:#{systemEnvironment['SPARK_HOME']}/jars/*"/>
           <entry key="spark.ui.enabled" value="false"/>
       </map>
   </property>

AnalyticsXtreme is tested and certified on Apache Hive version 2.3.2.

Table (Object) Properties

The DataLifecyclePolicy table contains the following parameters. The first four properties are required. The others are optional, and it is recommended to leave the default values as is unless your business application environment has specific requirements.

Parameter	Description	Unit	Default Value	Required/Optional
typeName*	Name of the table/type/class to which this policy applies.			Required
timeColumn**	Name of column/property/field that contains the time data used to manage this policy.			Required
speedPeriod	Time period or fixed timestamp.			Required
batchDataSource	Endpoint for querying the batch layer			Required
batchDataTarget	Endpoint for feeding data to the batch layer. Only necessary when AnalyticsXtreme is implemented in Automatic data tiering mode.			Optional
timeFormat	Time format for the data in the timeColumn parameter.	See Configuring AnalyticsXtreme on the Server Side below	Java time format (see the Java API documentation)	Optional
mutabilityPeriod	Period of time during which data can be updated.	Duration in time units or % of speedPeriod.	80%	Optional
batchFeedInterval	Data is fed from the Space (speed layer) to the batch layer in these time-based intervals at the end of the speedPeriod, after the mutabilityPeriod has expired.	See Configuring AnalyticsXtreme on the Server Side below	pt1m	Optional
batchFeedSize	Maximum data entries per batch feed interval.	Integer (number of entries)	1000	Optional
evictionPollingInterval	Polling interval for querying and evicting each policy.	See Configuring AnalyticsXtreme on the Server Side below	pt1m	Optional
evictionBuffer	Additional waiting period before evicting data from the Space (speed layer) after it was fed to the batch layer, so that long queries and clock differences won't cause errors or generate exceptions.	Duration in time units or % of speedPeriod.	pt10m	Optional

* This correlates to an object or JSON in the object store.

** This correlates to a property or entity in an object or JSON.

Java Time Pattern

AnalyticsXtreme is a time-based feature, and therefore uses the Java time pattern (required as of Java 8) in the timeFormat property. This time pattern requires use of the letter "p" in either upper or lower case in all duration fields, as well as the letter "t" for any duration that is less than 24 hours long.

For example, pt15m represents a duration of 15 minutes, and pt10h represents a duration of 10 hours. A duration of 2 days should be written as p2d.

For more information about the Java time pattern, see the Java API documentation.

Exporting the AnalyticsXtreme Manager Service

In order for client applications to access the speed layer, the AnalyticsXtreme Manager needs to be exported to enable discovery. This mechanism leverages the data grid's remoting mechanism, which registers the AnalyticsXtreme Manager as a remote service.

The following code snippet demonstrates how to implement this mechanism in the pu.xml.

<!-- Register the ax-manager bean as a remote service so clients can load configuration & stats from it -->
<os-remoting:service-exporter id="serviceExporter">
     <os-remoting:service ref="ax-manager"/>
</os-remoting:service-exporter>

For more information about the data grid's remoting mechanism, see the Space-Based Remoting section of the developer guide.

Sample pu.xml File

The following pu.xml file shows all the above steps, including defining a Space.

<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
     xmlns:context="http://www.springframework.org/schema/context"
     xmlns:os-core="http://www.openspaces.org/schema/core"
     xmlns:os-remoting="http://www.openspaces.org/schema/remoting"
     xsi:schemaLocation="
   http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd
   http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context.xsd
   http://www.openspaces.org/schema/core http://www.openspaces.org/schema/14.0/core/openspaces-core.xsd
   http://www.openspaces.org/schema/remoting http://www.openspaces.org/schema/remoting/openspaces-remoting.xsd">

     <context:annotation-config />

     <os-core:annotation-support />

     <os-core:embedded-space id="space" space-name="demo-space" />

     <os-core:giga-space id="gigaSpace" space="space"/>

<bean id="ax-manager" class="com.gigaspaces.analytics_xtreme.server.AnalyticsXtremeManagerFactory">
     <property name="config">
          <bean class="com.gigaspaces.analytics_xtreme.AnalyticsXtremeConfigurationFactoryBean">
               <!-- Verbose is recommended for getting started, usually turned off in production -->
               <property name="verbose" value="true"/>
               <property name="policies">
                   <list>
                       <ref bean="dataLifecyclePolicy"/>
                   </list>
               </property>
               <!-- more properties -->
          </bean>
     </property>
</bean>
	
<!-- Data life cycle policy for Trade class -->
<bean id="dataLifecyclePolicy" class="com.gigaspaces.analytics_xtreme.DataLifecyclePolicyFactoryBean">
     <property name="typeName" value="com.gigaspaces.demo.Trade"/>
     <property name="timeColumn" value="dateTimeTrade"/>
     <property name="speedPeriod" value="pt5h"/>
     <property name="batchDataSource" ref="ax-datasource"/>
     <property name="batchDataTarget" ref="ax-datatarget"/>
</bean>

<!-- Register the ax-manager bean as a remote service so clients can load configuration & stats from it -->
<os-remoting:service-exporter id="serviceExporter">
     <os-remoting:service ref="ax-manager"/>
</os-remoting:service-exporter>


<!-- Data source plugin based on Spark Hive -->
<bean id="ax-datasource" class="com.gigaspaces.analytics_xtreme.spark.SparkHiveBatchDataSourceFactoryBean">
     <property name="sparkSessionProvider" ref="ax-sparkSessionFactory"/>
</bean>

<!-- Data target plugin based on Spark Hive -->
<bean id="ax-datatarget" class="com.gigaspaces.analytics_xtreme.spark.SparkHiveBatchDataTargetFactoryBean">
     <property name="format" value="hive"/>
     <property name="mode" value="append"/>
     <property name="sparkSessionProvider" ref="ax-sparkSessionFactory"/>
</bean>

<!-- Spark session provider -->
<bean id="ax-sparkSessionFactory" class="org.insightedge.spark.SparkSessionProviderFactoryBean">
     <property name="master" value="local[*]"/>
     <property name="enableHiveSupport" value="true"/>
     <property name="configOptions">
          <map>
               <entry key="hive.metastore.uris" value="thrift://hive-metastore:9083"/>
          </map>
     </property>
</bean>

<!-- To use jdbc instead of spark, replace ax-datasource, ax-datatarget and sparkSessionFactory with these beans
<bean id="ax-datasource" class="com.gigaspaces.analytics_xtreme.jdbc.JdbcBatchDataSourceFactoryBean">
     <property name="connectionString" value="jdbc:hive2://hive-server:10000/;ssl=false"/>
</bean>
	
<bean id="ax-datatarget" class="com.gigaspaces.analytics_xtreme.jdbc.JdbcBatchDataTargetFactoryBean">
     <property name="connectionString" value="jdbc:hive2://hive-server:10000/;ssl=false"/>
     <property name="useLowerCase" value="true"/>
</bean>
-->
</beans>

Limitations

Hot deploy of policy changes/updates is not supported. The Processing Unit must be undeployed and then redeployed.

Monitoring

You can use the following metrics to monitor how AnalyticsXtreme is functioning.

Metric	Description	Type
evicted-entries	Total number of data entries evicted from the Space.	Long
copied-entries	Total number of data entries copied to the batch layer (external data storage).	Long
\{type-name\}\_evicted-entries	Total number of type_name data entries evicted from the Space.	Long
\{type-name\}\_copied-entries	Total number of type_name entries copied to the batch layer.	Long
\{type-name\}\_confirmed-feed-timestamp	Last confirmed time stamp of when data entries of type_name were copied to the batch layer.	Time stamp