Configuring AnalyticsXtreme on the Server Side
AnalyticsXtreme is implemented in conjunction with a specific Space. This is part of the standard configuration of a pu.xml file.
For more information about pu.xml files and their properties, read the Configuration topic in The Processing Unit Service section of the developer guide.
Configuring AnalyticsXtreme on the server side involves the following steps:
- Defining the Space.
- Setting up the AnalyticsXtreme Manager
- Configuring the data life cycle policy.
- Defining the data source and the data target.
- Exporting the AnalyticsXtreme Manager service for discovery (so that client applications can access it).
The Space definition is part of the standard pu.xml configuration as described in The Processing Unit Service section of the developer guide. The rest of the steps, which are specific to AnalyticsXtreme, are explained in the sections below.
Setting Up the AnalyticsXtreme Manager
As mentioned above, all of the relevant information about AnalyticsXtreme is provided to the data grid in a dedicated pu.xml file. The first step in the configuration process is creating an AnalyticsXtreme Manager bean, with a bean ID and class. In this part of the pu.xml, you can also modify the AnalyticsXtreme logging policy configuration if necessary.
In the following code snippet, taken from the full pu.xml file, the AnalyticsXtreme Manager is assigned a bean ID of ax-manager
, and the logging policy has been changed from its default value.
<bean id="ax-manager" class="com.gigaspaces.analytics_xtreme.server.AnalyticsXtremeManagerFactory">
<property name="config">
<bean class="com.gigaspaces.analytics_xtreme.AnalyticsXtremeConfigurationFactoryBean">
<!-- Verbose is recommended for getting started, usually turned off in production -->
<property name="verbose" value="true"/>
<!-- more properties -->
</bean>
</property>
</bean>
Cold Starts
For persistency reasons, GigaSpaces maintains a record of the last confirmed data that was copied from the Space to the external data source. If you need to undeploy and then redeploy a Processing Unit, for example as part of a maintenance activity, you can configure whether you want to perform a cold start of the Space (copy data from the Space to the external data source from the beginning), or whether you want the redeployed Processing Unit to continue copying the data from where it left off when it was undeployed.
Global Properties
The global configuration contains a list of DataLifecyclePolicy properties, as well as the AnalyticsXtreme logging level.
Parameter | Description | Unit | Default Value | Required/Optional |
---|---|---|---|---|
verbose | Increases the log levels for both the client and the server to provide verbose AnalyticsXtreme information (useful for troubleshooting). | True/False | False | Optional |
coldStart | (Automatic data tiering only) After redeploying a Processing Unit, set the value to "true" if you want to copy data from the batchDataSource to the batchDataTarget from the beginning. Leave the default value to continue copying data from the last confirmed data that was copied from the batchDataSource to the batchDataTarget . |
True/False | False | Optional |
Configuring the Data Life Cycle Policy
The next step in configuring AnalyticsXtreme is to define the data life cycle policy, which specifies how and when data is archived from the data grid (speed layer) to the external data storage (batch layer). This policy is configured in the AnalyticsXtreme Manager for each data object, or table.
You must define the following properties in the policy:
typeName
timeColumn
speedPeriod
batchDataSource
All other properties are optional, and you can leave the default values unless your specific environment has different requirements. See Configuring AnalyticsXtreme on the Server Side below for a full list of the data life cycle policy properties and their descriptions.
The relationship between the data object and the data life cycle policy is one to one. You cannot have more than one policy per data object, and each policy applies to a single data object.
Defining the Data Source and Data Target
Part of configuring the data life cycle policy is defining the batchDataSource
property. You can define it directly, or you can use a ref
to point to the definition. If you are implementing AnalyticsXtreme in external data tiering mode, this is the only required property because the data is simply evicted at the end of the life cycle.
If you are implementing AnalyticsXtreme in automatic data tiering mode, you must also define the batchDataTarget
property using the same method. Towards the end of the speedPeriod
, the data will be moved to the target defined here.
Example
The following code snippet from the pu.xml file shows a data life cycle policy that was configured for specific trading information that needs to be stored in the data grid for 5 hours, and then moved to external data storage. Both the data source and the data target have their full definitions in another area of the pu.xml file.
<list>
<!-- Data life cycle policy for Trade class -->
<bean class="com.gigaspaces.analytics_xtreme.DataLifecyclePolicyFactoryBean">
<property name="typeName" value="com.gigaspaces.demo.Trade"/>
<property name="timeColumn" value="dateTimeTrade"/>
<property name="speedPeriod" value="pt5h"/>
<property name="batchDataSource" ref="ax-datasource"/>
<property name="batchDataTarget" ref="ax-datatarget"/>
</bean>
<!-- Add a life cycle policy for each additional class -->
</list>
Configuring the Data Source
AnalyticsXtreme supports HDFS out of the box. However, you can add any external data source that is supported by Spark.
The following code snippet demonstrates how to configure AnalyticsXtreme to work with a Spark Hive data source.
<!-- Data source plugin based on Spark Hive -->
<bean id="ax-datasource" class="com.gigaspaces.analytics_xtreme.spark.SparkHiveBatchDataSourceFactoryBean">
<property name="sparkSessionProvider" ref="ax-sparkSessionFactory"/>
</bean>
The data source and the data target use the same sparkSessionFactory
bean.
Configuring the Data Target
There are three possible implementations for the data target out of the box; JDBC, Spark, and Hive. Each is wrapped in a factory bean. The BatchDataTarget
interface can be used for other data targets by implementing the feed
method.
Spark Hive Batch Data Target Example
This example demonstrates the Spark Hive implementation.
<!-- Data target plugin based on Spark Hive -->
<bean id="ax-datatarget" class="com.gigaspaces.analytics_xtreme.spark.SparkHiveBatchDataTargetFactoryBean">
<property name="format" value="hive"/>
<property name="mode" value="append"/>
<property name="sparkSessionProvider" ref="ax-sparkSessionFactory"/>
</bean>
<!-- Spark session provider -->
<bean id="ax-sparkSessionFactory" class="org.insightedge.spark.SparkSessionProviderFactoryBean">
<property name="master" value="local[*]"/>
<property name="enableHiveSupport" value="true"/>
<property name="configOptions">
<map>
<entry key="hive.metastore.uris" value="thrift://hive-metastore:9083"/>
</map>
</property>
</bean>
JDBC Batch Data Target Example
This example demonstrates the JDBC implementation.
<bean id="ax-datasource" class="com.gigaspaces.analytics_xtreme.jdbc.JdbcBatchDataSourceFactoryBean">
<property name="connectionString" value="jdbc:hive2://hive-server:10000/;ssl=false"/>
</bean>
<bean id="ax-datatarget" class="com.gigaspaces.analytics_xtreme.jdbc.JdbcBatchDataTargetFactoryBean">
<property name="connectionString" value="jdbc:hive2://hive-server:10000/;ssl=false"/>
<property name="useLowerCase" value="true"/>
</bean>
Using a Different Apache Hive Version
Apache Spark currently is packaged with Apache Hive version 1.2.1. If you want to use a different Hive version for the batch data source or the batch data target, you need to do the following:
- Import the relevant JAR files.
-
Add the following to the pu.xml under the
sparkSessionFactory
configOptions
:property name="configOptions"> <map> <entry key="hive.metastore.uris" value="thrift://hive-metastore:9083"/> <entry key="spark.sql.hive.metastore.version" value="2.3.2"/> <entry key="spark.sql.hive.metastore.jars" value="#{systemEnvironment['SPARK_HOME']}/jars/hive/*:#{systemEnvironment['SPARK_HOME']}/jars/*"/> <entry key="spark.ui.enabled" value="false"/> </map> </property>
AnalyticsXtreme is tested and certified on Apache Hive version 2.3.2.
Table (Object) Properties
The DataLifecyclePolicy table contains the following parameters. The first four properties are required. The others are optional, and it is recommended to leave the default values as is unless your business application environment has specific requirements.
Parameter | Description | Unit | Default Value | Required/Optional |
---|---|---|---|---|
typeName* | Name of the table/type/class to which this policy applies. | Required | ||
timeColumn** | Name of column/property/field that contains the time data used to manage this policy. | Required | ||
speedPeriod | Time period or fixed timestamp. | Required | ||
batchDataSource | Endpoint for querying the batch layer | Required | ||
batchDataTarget | Endpoint for feeding data to the batch layer. Only necessary when AnalyticsXtreme is implemented in Automatic data tiering mode. | Optional | ||
timeFormat | Time format for the data in the timeColumn parameter. | See Configuring AnalyticsXtreme on the Server Side below | Java time format (see the Java API documentation) | Optional |
mutabilityPeriod | Period of time during which data can be updated. | Duration in time units or % of speedPeriod. | 80% | Optional |
batchFeedInterval | Data is fed from the Space (speed layer) to the batch layer in these time-based intervals at the end of the speedPeriod, after the mutabilityPeriod has expired. | See Configuring AnalyticsXtreme on the Server Side below | pt1m | Optional |
batchFeedSize | Maximum data entries per batch feed interval. | Integer (number of entries) | 1000 | Optional |
evictionPollingInterval | Polling interval for querying and evicting each policy. | See Configuring AnalyticsXtreme on the Server Side below | pt1m | Optional |
evictionBuffer | Additional waiting period before evicting data from the Space (speed layer) after it was fed to the batch layer, so that long queries and clock differences won't cause errors or generate exceptions. | Duration in time units or % of speedPeriod. | pt10m | Optional |
* This correlates to an object or JSON in the object store.
** This correlates to a property or entity in an object or JSON.
Java Time Pattern
AnalyticsXtreme is a time-based feature, and therefore uses the Java time pattern (required as of Java 8) in the timeFormat
property. This time pattern requires use of the letter "p" in either upper or lower case in all duration fields, as well as the letter "t" for any duration that is less than 24 hours long.
For example, pt15m
represents a duration of 15 minutes, and pt10h
represents a duration of 10 hours. A duration of 2 days should be written as p2d
.
For more information about the Java time pattern, see the Java API documentation.
Exporting the AnalyticsXtreme Manager Service
In order for client applications to access the speed layer, the AnalyticsXtreme Manager needs to be exported to enable discovery. This mechanism leverages the data grid's remoting mechanism, which registers the AnalyticsXtreme Manager as a remote service.
The following code snippet demonstrates how to implement this mechanism in the pu.xml.
<!-- Register the ax-manager bean as a remote service so clients can load configuration & stats from it -->
<os-remoting:service-exporter id="serviceExporter">
<os-remoting:service ref="ax-manager"/>
</os-remoting:service-exporter>
For more information about the data grid's remoting mechanism, see the Space-Based Remoting section of the developer guide.
Sample pu.xml File
The following pu.xml file shows all the above steps, including defining a Space.
<?xml version="1.0" encoding="UTF-8"?>
<beans xmlns="http://www.springframework.org/schema/beans"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:context="http://www.springframework.org/schema/context"
xmlns:os-core="http://www.openspaces.org/schema/core"
xmlns:os-remoting="http://www.openspaces.org/schema/remoting"
xsi:schemaLocation="
http://www.springframework.org/schema/beans http://www.springframework.org/schema/beans/spring-beans.xsd
http://www.springframework.org/schema/context http://www.springframework.org/schema/context/spring-context.xsd
http://www.openspaces.org/schema/core http://www.openspaces.org/schema/14.0/core/openspaces-core.xsd
http://www.openspaces.org/schema/remoting http://www.openspaces.org/schema/remoting/openspaces-remoting.xsd">
<context:annotation-config />
<os-core:annotation-support />
<os-core:embedded-space id="space" space-name="demo-space" />
<os-core:giga-space id="gigaSpace" space="space"/>
<bean id="ax-manager" class="com.gigaspaces.analytics_xtreme.server.AnalyticsXtremeManagerFactory">
<property name="config">
<bean class="com.gigaspaces.analytics_xtreme.AnalyticsXtremeConfigurationFactoryBean">
<!-- Verbose is recommended for getting started, usually turned off in production -->
<property name="verbose" value="true"/>
<property name="policies">
<list>
<ref bean="dataLifecyclePolicy"/>
</list>
</property>
<!-- more properties -->
</bean>
</property>
</bean>
<!-- Data life cycle policy for Trade class -->
<bean id="dataLifecyclePolicy" class="com.gigaspaces.analytics_xtreme.DataLifecyclePolicyFactoryBean">
<property name="typeName" value="com.gigaspaces.demo.Trade"/>
<property name="timeColumn" value="dateTimeTrade"/>
<property name="speedPeriod" value="pt5h"/>
<property name="batchDataSource" ref="ax-datasource"/>
<property name="batchDataTarget" ref="ax-datatarget"/>
</bean>
<!-- Register the ax-manager bean as a remote service so clients can load configuration & stats from it -->
<os-remoting:service-exporter id="serviceExporter">
<os-remoting:service ref="ax-manager"/>
</os-remoting:service-exporter>
<!-- Data source plugin based on Spark Hive -->
<bean id="ax-datasource" class="com.gigaspaces.analytics_xtreme.spark.SparkHiveBatchDataSourceFactoryBean">
<property name="sparkSessionProvider" ref="ax-sparkSessionFactory"/>
</bean>
<!-- Data target plugin based on Spark Hive -->
<bean id="ax-datatarget" class="com.gigaspaces.analytics_xtreme.spark.SparkHiveBatchDataTargetFactoryBean">
<property name="format" value="hive"/>
<property name="mode" value="append"/>
<property name="sparkSessionProvider" ref="ax-sparkSessionFactory"/>
</bean>
<!-- Spark session provider -->
<bean id="ax-sparkSessionFactory" class="org.insightedge.spark.SparkSessionProviderFactoryBean">
<property name="master" value="local[*]"/>
<property name="enableHiveSupport" value="true"/>
<property name="configOptions">
<map>
<entry key="hive.metastore.uris" value="thrift://hive-metastore:9083"/>
</map>
</property>
</bean>
<!-- To use jdbc instead of spark, replace ax-datasource, ax-datatarget and sparkSessionFactory with these beans
<bean id="ax-datasource" class="com.gigaspaces.analytics_xtreme.jdbc.JdbcBatchDataSourceFactoryBean">
<property name="connectionString" value="jdbc:hive2://hive-server:10000/;ssl=false"/>
</bean>
<bean id="ax-datatarget" class="com.gigaspaces.analytics_xtreme.jdbc.JdbcBatchDataTargetFactoryBean">
<property name="connectionString" value="jdbc:hive2://hive-server:10000/;ssl=false"/>
<property name="useLowerCase" value="true"/>
</bean>
-->
</beans>
Limitations
Hot deploy of policy changes/updates is not supported. The Processing Unit must be undeployed and then redeployed.
Monitoring
You can use the following metrics to monitor how AnalyticsXtreme is functioning.
Metric | Description | Type |
---|---|---|
evicted-entries | Total number of data entries evicted from the Space. | Long |
copied-entries | Total number of data entries copied to the batch layer (external data storage). | Long |
\{type-name\}\_evicted-entries | Total number of type_name data entries evicted from the Space. | Long |
\{type-name\}\_copied-entries | Total number of type_name entries copied to the batch layer. | Long |
\{type-name\}\_confirmed-feed-timestamp | Last confirmed time stamp of when data entries of type_name were copied to the batch layer. | Time stamp |