Developing Your First Spark Application
This topic explains how to create an GigaSpaces application that can read and write from/to the Data Grid. You should have a basic knowledge of Apache Spark.
For instructions on how to install a minimum GigaSpaces cluster setup and launch it, refer to Local Machine Setup.
Project Dependencies
GigaSpaces 14.5 runs on Spark 2.4 and Scala 2.11.8. These dependencies will be included when you depend on the GigaSpaces artifacts.
GigaSpaces .jars are not published to Maven Central Repository yet. To install Maven artifacts run the following command from the $GS_HOME/insightedge/tools/maven
directory:
./gs.sh maven install
For SBT projects include the following:
resolvers += Resolver.mavenLocal
resolvers += "Openspaces Maven Repository" at "http://maven-repository.openspaces.org"
libraryDependencies += "org.gigaspaces.insightedge" % "insightedge-core" % "14.5.0" % "provided" exclude("javax.jms", "jms")
And if you are building with Maven:
<dependency>
<groupId>org.gigaspaces.insightedge</groupId>
<artifactId>insightedge-core</artifactId>
<version>14.5.0</version>
<scope>provided</scope>
</dependency>
GigaSpaces .jars are already packed in the GigaSpaces distribution, and are automatically loaded with your application if you submit them with insightedge-submit
script or run the Web Notebook. As such, there is no need to pack them into your uber .jar. However, if you want to run Spark in local[*]
mode, the dependencies should be declared with the compile
scope.
Developing a Spark Application
GigaSpaces provides an extension to the regular Spark API.
Read the Self-Contained Applications topic in the Apache Spark documentation if you are new to Spark.
InsightEdgeConfig
is the starting point in connecting Spark with the Data Grid. Create the InsightEdgeConfig
and the SparkContext
:
import org.insightedge.spark.context.InsightEdgeConfig
import org.insightedge.spark.implicits.all._
val sparkConf = new SparkConf()
.setAppName("sample-app")
.setMaster("spark://127.0.0.1:7077")
.setInsightEdgeConfig(InsightEdgeConfig("demo"))
val sc = new SparkContext(sparkConf)
It is important to import org.insightedge.spark.implicits.all._
to enable the Data Grid specific API.
demo
is the default Data Grid name that the demo mode starts automatically.
When you are running Spark applications from the Web Notebook, InsightEdgeConfig
is created implicitly with the properties defined in the Spark interpreter.
Modeling Data Grid Objects
Create a case class Product.scala
to represent a Product entity in the Data Grid:
import org.insightedge.scala.annotation._
import scala.beans.{BeanProperty, BooleanBeanProperty}
case class Product(
@BeanProperty @SpaceId var id: Long,
@BeanProperty var description: String,
@BeanProperty var quantity: Int,
@BooleanBeanProperty var featuredProduct: Boolean
) {
def this() = this(-1, null, -1, false)
}
Saving to the Data Grid
To save a Spark RDD, use the saveToGrid
method.
val rdd = sc.parallelize(1 to 1000).map(i => Product(i, "Description of product " + i, Random.nextInt(10), Random.nextBoolean()))
rdd.saveToGrid()
Loading and Analyzing Data from the Data Grid
Use the gridRdd
method of the SparkContext
to view Data Grid objects as Spark RDDs.
val gridRdd = sc.gridRdd[Product]()
println("total products quantity: " + gridRdd.map(_.quantity).sum())
Closing the Spark Context
When you are done, close the Spark context and all connections to Data Grid with the following command:
sc.stopInsightEdgeContext()
Under the hood, this will call the regular Spark sc.stop()
command, so there is no need to call it manually.
Running your Spark Application
After you have packaged a .jar, submit the Spark job via insightedge-submit
located in <$GS_HOME/insightedge/bin
instead of spark-submit
as follows:
insightedge-submit --class com.insightedge.spark.example.YourMainClass --master spark://127.0.0.1:7077 path/to/jar/insightedge-examples.jar