Configuring AnalyticsXtreme on the Client Side

After the server side has been configured, you need to set up your client application to use AnalyticsXtreme. You must configure the JDBC connection string to enable communication between the client and the data grid, and then you can run queries using the following:

  • Spark/JDBC API from any client application.
  • Web notebook - while you can configure any web notebook to work with AnalyticsXtreme, we recommend using the GigaSpaces Zeppelin notebook that comes packaged with GigaSpaces, which contains a custom JDBC interpreter.

For more information about the InsightEdge Apache Zeppelin notebook, see the Using the GigaSpaces Web Notebook topic.

Configuring the JDBC Connection String

To configure the client side to work with AnalyticsXtreme, you need to add the analyticsXtreme.enabled=true property to the JDBC connection string, as described in the GigaSpaces Legacy JDBC Driver topic, in the Integrating with BI Tools section of the developer guide. When this property is true, all queries (Spark, JDBC, etc.) pass through the AnalyticsXtreme engine.

Using Apache Spark with AnalyticsXtreme

GigaSpaces provides a simple API to initialize Spark over AnalyticsXtreme. Initializing Spark to work with AnalyticsXtreme is a two-step process:

  1. Connect Spark to the data grid.
  2. Create a Spark Dataframe with AnalyticsXtreme enabled.

To implement the AnalyticsXtreme API, add the following Scala code snippet to any client code that will have to access AnalyticsXtreme. This code specifies the Space, initializes Spark, and creates the Spark Dataframe (per class).

import org.insightedge.spark.implicits.all._
    import org.insightedge.spark.context.InsightEdgeConfig
    
//spaceName is required
val ieConfig = new InsightEdgeConfig(spaceName = "speedSpace")
    
//spark context initialization
spark.initializeInsightEdgeContext(ieConfig)
	
//create the AX dataframe object for the specific class
val analyticsXtremeDataframe: Dataframe = spark.analyticsXtreme("Trade")

For more information about Spark data source APIs such as the Dataframe, see the Spark documentation.

Spark Query Execution

The way in which Spark executes a query depends on how the Dataframe is initialized (as a table or as a raw query). Simple queries are identical in behavior, while queries that contain aggregations, such as AVERAGE, have significant differences in behavior depending on whether they are submitted as a table and are therefore executed on the client side, or as a raw query and therefore executed on the server side. This second method is called predicate pushdown.

Simple Query

The following code snippet contains a simple query submitted to the Spark Dataframe as a table.

val volumeDataFrame = analyticsXtremeDataframe.select("volume").where("volume < 1000")

This next code snippet contains the same simple query, but submitted as a raw query.

val rawVolumeDataFrame = spark.read.analyticsXtreme("SELECT volume FROM Trade WHERE volume < 1000")

The behavior is the same in both cases, because only data retrieval is required without any additional action in order to return the required result.

Query with Aggregation

For a more complex query that includes aggregations, the method by which the query is submitted causes important behavioral differences.

In the code snippet below, the query is submitted to the Spark Dataframe as a table. Only the raw data is retrieved from the data grid; the client application has to perform the aggregation in order to finish executing the query and return results.

val aggregationDataFrame = analyticsXtremeDataframe.agg("volume -> avg", "endPrice -> max", "startPrice -> min")

When the query is submitted as a raw query, like in the following code snippet, Spark pushes the entire query (predicate) to the data grid. The query is executed from beginning to end in the data grid (both data retrieval and aggregation), and only the final results are returned to the client application. This predicate pushdown can significantly reduce the processing time of the query because it filters data out at the source, so that the query can complete faster and return only the required data back to the client application.

val rawAggregationDataFrame = spark.read.analyticsXtreme("SELECT AVERAGE(volume) as avg_volume, MAX(endPrice) as max_endPrice, MIN(endPrice) as min_endPrice FROM Trade")

Query with Joins

AnalyticsXtreme supports using JOIN in queries that meet the following requirements:

  • The query is limited to two tables
  • Both tables have similar AnalyticXtreme policies
  • Both tables use the same time thresholds in their policies

The following query types are supported:

  • Select * from t1 join t2 on t1.id=t2.id where ...
  • Select * from t1, t2 where t1.id=t2.id ...

Limitations

The following query types are not supported:

  • Diagonal queries between the speed and batch layers (only same-layer JOINs are executed)
  • Outer JOIN (only inner JOIN is executed)
  • Aggregations (sum, count, avg, max, min)
  • order by
  • group by
  • limit

Supported SQL Query Functions

The following SQL keywords and operations can be used with AnalyticsXtreme:

  • Support SELECT SQL statement
  • AND/OR operators to join two or more conditions in a WHERE clause
  • NOT IN and NOT NULL
  • Aggregate functions: COUNT, MAX, MIN, SUM, AVG
  • All basic logical operations to create conditions: =, <>, <,>, >=, <=, [NOT] like, is [NOT] null, IN, BETWEEN
  • ORDER BY for multiple columns
  • GROUP BY for multiple columns
  • Column aliases

Limitations

This feature is meant for querying data that is being fed by other sources into the target database(s). It cannot be used to alter the data (INSERT, UPDATE, DELETE).

The following SQL query functionality is not supported in this release:

  • SQL keywords: HAVING
  • Only the TIMESTAMP 'yyyy-MM-dd HH:mm:ss' format is supported in the SQL query