Configuring AnalyticsXtreme on the Client Side
After the server side has been configured, you need to set up your client application to use AnalyticsXtreme. You must configure the JDBC connection string to enable communication between the client and the data grid, and then you can run queries using the following:
- Spark/JDBC API from any client application.
- Web notebook - while you can configure any web notebook to work with AnalyticsXtreme, we recommend using the InsightEdge Zeppelin notebook that comes packaged with InsightEdge, which contains a custom JDBC interpreter.
For more information about the InsightEdge Apache Zeppelin notebook, see the Using the InsightEdge Web Notebook topic.
Configuring the JDBC Connection String
To configure the client side to work with AnalyticsXtreme, you need to add the
analyticsXtreme.enabled=true property to the JDBC connection string, as described in the SQL Driver topic, in the Running Analytics section of the developer guide. When this property is true, all queries (Spark, JDBC, etc.) pass through the AnalyticsXtreme engine.
Using Apache Spark with AnalyticsXtreme
InsightEdge provides a simple API to initialize Spark over AnalyticsXtreme. Initializing Spark to work with AnalyticsXtreme is a two-step process:
- Connect Spark to the data grid.
- Create a Spark Dataframe with AnalyticsXtreme enabled.
To implement the AnalyticsXtreme API, add the following Scala code snippet to any client code that will have to access AnalyticsXtreme. This code specifies the Space, initializes Spark, and creates the Spark Dataframe (per class).
import org.insightedge.spark.implicits.all._ import org.insightedge.spark.context.InsightEdgeConfig //spaceName is required val ieConfig = new InsightEdgeConfig(spaceName = "speedSpace") //spark context initialization spark.initializeInsightEdgeContext(ieConfig) //create the AX dataframe object for the specific class val analyticsXtremeDataframe: Dataframe = spark.analyticsXtreme("Trade")
For more information about Spark data source APIs such as the Dataframe, see the Spark documentation.
Spark Query Execution
The way in which Spark executes a query depends on how the Dataframe is initialized (as a table or as a raw query). Simple queries are identical in behavior, while queries that contain aggregations, such as
AVERAGE, have significant differences in behavior depending on whether they are submitted as a table and are therefore executed on the client side, or as a raw query and therefore executed on the server side. This second method is called predicate pushdown.
The following code snippet contains a simple query submitted to the Spark Dataframe as a table.
val volumeDataFrame = analyticsXtremeDataframe.select("volume").where("volume < 1000")
This next code snippet contains the same simple query, but submitted as a raw query.
val rawVolumeDataFrame = spark.read.analyticsXtreme("SELECT volume FROM Trade WHERE volume < 1000")
The behavior is the same in both cases, because only data retrieval is required without any additional action in order to return the required result.
Query with Aggregation
For a more complex query that includes aggregations, the method by which the query is submitted causes important behavioral differences.
In the code snippet below, the query is submitted to the Spark Dataframe as a table. Only the raw data is retrieved from the data grid; the client application has to perform the aggregation in order to finish executing the query and return results.
val aggregationDataFrame = analyticsXtremeDataframe.agg("volume -> avg", "endPrice -> max", "startPrice -> min")
When the query is submitted as a raw query, like in the following code snippet, Spark pushes the entire query (predicate) to the data grid. The query is executed from beginning to end in the data grid (both data retrieval and aggregation), and only the final results are returned to the client application. This predicate pushdown can significantly reduce the processing time of the query because it filters data out at the source, so that the query can complete faster and return only the required data back to the client application.
val rawAggregationDataFrame = spark.read.analyticsXtreme("SELECT AVERAGE(volume) as avg_volume, MAX(endPrice) as max_endPrice, MIN(endPrice) as min_endPrice FROM Trade")
Query with Joins
JOIN queries that meet the following requirements:
- The query is limited to two tables
- Both tables use the same timestamp
- Both tables have similar AnalyticXtreme policies
The following query types are supported:
Select * from t1 join t2 on t1.id=t2.id where...
Select * from t1, t2 where t1.id=t2.id
Select t1.<variable1> as <variable1>, t2.<variable2> as <var2> from t1, t2 where t1.id=t2.id
Select * from t1 join t2 on t1.id=t2.id and t1.<variableId>=t2.<variableId>
Select * from t1 join t2 on t1.id=t2.id where t1.id<x or t2.[varId]<y(where x>y)
Select * from t1 join t2 on t1.id=t2.id where t1.id<xand t2.[varId]<y(where x>y)
Select t1.column1, t2.column2 from t1 join t2 on t1.id=t2.id
The following query types are not supported:
- Diagonal queries between the speed and batch layers (only same-layer JOINs are executed)
- Outer layer JOIN (only inner JOIN is executed)
min from t1 join t2 on t1.id=t2.id
Select * from t1 join t2 ont1.id=t2.id order by t2.id desc
Supported SQL Query Functions
The following SQL keywords and operations can be used with AnalyticsXtreme:
AND/ORoperators to join two or more conditions in a
- Aggregate functions:
- All basic logical operations to create conditions: =, <>, <,>, >=, <=,
is [NOT] null,
ORDER BYfor multiple columns
GROUP BYfor multiple columns
- Column aliases
This feature is meant for querying data that is being fed by other sources into the target database(s). It cannot be used to alter the data (
The following SQL query functionality is not supported in this release:
- SQL keywords:
- Only the TIMESTAMP 'yyyy-MM-dd HH:mm:ss' format is supported in the SQL query