Developing Your First Spark Application

This topic explains how to create an GigaSpaces application that can read and write from/to the Data Grid. You should have a basic knowledge of Apache Spark.

For instructions on how to install a minimum GigaSpaces cluster setup and launch it, refer to Local Machine Setup.

Project Dependencies

GigaSpaces 14.5 runs on Spark 2.4 and Scala 2.11.8. These dependencies will be included when you depend on the GigaSpaces artifacts.

GigaSpaces .jars are not published to Maven Central Repository yet. To install Maven artifacts run the following command from the $GS_HOME/insightedge/tools/maven directory:

./gs.sh maven install

For SBT projects include the following:

resolvers += Resolver.mavenLocal
resolvers += "Openspaces Maven Repository" at "http://maven-repository.openspaces.org"

libraryDependencies += "org.gigaspaces.insightedge" % "insightedge-core" % "14.5.0" % "provided" exclude("javax.jms", "jms")

And if you are building with Maven:

<dependency>
    <groupId>org.gigaspaces.insightedge</groupId>
    <artifactId>insightedge-core</artifactId>
    <version>14.5.0</version>
    <scope>provided</scope>
</dependency>

GigaSpaces .jars are already packed in the GigaSpaces distribution, and are automatically loaded with your application if you submit them with insightedge-submit script or run the Web Notebook. As such, there is no need to pack them into your uber .jar. However, if you want to run Spark in local[*] mode, the dependencies should be declared with the compile scope.

Developing a Spark Application

GigaSpaces provides an extension to the regular Spark API.

Read the Self-Contained Applications topic in the Apache Spark documentation if you are new to Spark.

InsightEdgeConfig is the starting point in connecting Spark with the Data Grid. Create the InsightEdgeConfig and the SparkContext:

import org.insightedge.spark.context.InsightEdgeConfig
import org.insightedge.spark.implicits.all._

val sparkConf = new SparkConf()
    .setAppName("sample-app")
    .setMaster("spark://127.0.0.1:7077")
    .setInsightEdgeConfig(InsightEdgeConfig("demo"))
val sc = new SparkContext(sparkConf)

It is important to import org.insightedge.spark.implicits.all._ to enable the Data Grid specific API.

demo is the default Data Grid name that the demo mode starts automatically.

When you are running Spark applications from the Web Notebook, InsightEdgeConfig is created implicitly with the properties defined in the Spark interpreter.

Modeling Data Grid Objects

Create a case class Product.scala to represent a Product entity in the Data Grid:

import org.insightedge.scala.annotation._
import scala.beans.{BeanProperty, BooleanBeanProperty}

case class Product(   
   @BeanProperty @SpaceId var id: Long,
   @BeanProperty var description: String,
   @BeanProperty var quantity: Int,   
   @BooleanBeanProperty var featuredProduct: Boolean
) {
    def this() = this(-1, null, -1, false)
}

Saving to the Data Grid

To save a Spark RDD, use the saveToGrid method.

val rdd = sc.parallelize(1 to 1000).map(i => Product(i, "Description of product " + i, Random.nextInt(10), Random.nextBoolean()))
rdd.saveToGrid()

Loading and Analyzing Data from the Data Grid

Use the gridRdd method of the SparkContext to view Data Grid objects as Spark RDDs.

val gridRdd = sc.gridRdd[Product]()
println("total products quantity: " + gridRdd.map(_.quantity).sum())

Closing the Spark Context

When you are done, close the Spark context and all connections to Data Grid with the following command:

sc.stopInsightEdgeContext()

Under the hood, this will call the regular Spark sc.stop() command, so there is no need to call it manually.

Running your Spark Application

After you have packaged a .jar, submit the Spark job via insightedge-submit located in <$GS_HOME/insightedge/bin instead of spark-submit as follows:

insightedge-submit --class com.insightedge.spark.example.YourMainClass --master spark://127.0.0.1:7077 path/to/jar/insightedge-examples.jar