Running a Spark Job in Kubernetes

The InsightEdge Platform provides a first-class integration between Apache Spark and the XAP core data grid capability. This allows hybrid/transactional analytics processing by co-locating Spark jobs in place with low-latency data grid applications. InsightEdge includes a full Spark distribution.

Apache Spark 2.3.x has native Kubernetes support. Users can run Spark workloads in an existing Kubernetes 1.9+ cluster and take advantage of Apache Spark’s ability to manage distributed data processing tasks.

The Spark submission mechanism creates a Spark driver running within a Kubernetes pod. The driver creates executors running within Kubernetes pods, connects to them, and executes application code. Using InsightEdge, application code can connect to a Data Pod and interact with the distributed data grid.

This topic explains how to run the Apache Spark SparkPi example, and the InsightEdge SaveRDD example, which is one of the basic Scala examples provided in the InsightEdge software package. Do the following steps, detailed in the following sections, to run these examples in Kubernetes:

  1. Set the Spark configuration property for the InsightEdge Docker image.
  2. Get the Kubernetes Master URL for submitting the Spark jobs to Kubernetes.
  3. Configure the Kubernetes service account so it can be used by the Driver Pod.
  4. Deploy a data grid with a headless service (Lookup locator).
  5. Submit the Spark jobs for the examples.

Setting Spark Configuration Property

InsightEdge provides a Docker image designed to be used in a container runtime environment, such as Kubernetes. This Docker image is used in the examples below to demonstrate how to submit the Apache Spark SparkPi example and the InsightEdge SaveRDD example.

The following Spark configuration property spark.kubernetes.container.image is required when submitting Spark jobs for an InsightEdge application. Note how this configuration is applied to the examples in the Submitting Spark Jobs section:

--conf spark.kubernetes.container.image=gigaspaces/insightedge-enterprise:14.5

Getting the Kubernetes Master URL

You can get the Kubernetes master URL using kubectl. Type the following command to print out the URL that will be used in the Spark and InsightEdge examples when submitting Spark jobs to the Kubernetes scheduler.

kubectl cluster-info

Sample output: Kubernetes master is running at

Configuring the Kubernetes Service Accounts

In Kubernetes clusters with RBAC enabled, users can configure Kubernetes RBAC roles and service accounts used by the various Spark jobs on Kubernetes components to access the Kubernetes API server.

Spark on Kubernetes supports specifying a custom service account for use by the Driver Pod via the configuration property that is passed as part of the submit command. To create a custom service account, run the following kubectl command:

kubectl create serviceaccount spark

After the custom service account is created, you need to grant a service account Role. To grant a service account Role, a RoleBinding is needed. To create a RoleBinding or ClusterRoleBinding, use the kubectl create rolebinding (or clusterrolebinding for ClusterRoleBinding) command. For example, the following command creates an edit ClusterRole in the default namespace and grants it to the spark service account you created above.

kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=default:spark --namespace=default

After the service account has been created and configured, you can apply it in the Spark submit:

--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark

Deploying the Data Grid on Kubernetes

Run the following Helm command in the command window to start a basic data grid called demo:

helm install insightedge --name demo

For the application to connect to the demo data grid, the name of the manager must be provided. This is required when running on a Kubernetes cluster (not a minikube).


Submitting Spark Jobs with InsightEdge Submit

The insightedge-submit script is located in the InsightEdge home directory, in insightedge/bin. This script is similar to the spark-submit command used by Spark users to submit Spark jobs. The following examples run both a pure Spark example and an InsightEdge example by calling this script.

SparkPi Example

Run the following InsightEdge submit script for the SparkPi example. This example specifies a JAR file with a specific URI that uses the local:// scheme. This URI is the location of the example JAR that is already available in the Docker image. If your application’s dependencies are all hosted in remote locations (like HDFS or HTTP servers), you can use the appropriate remote URIs, such as https://path/to/examples.jar.

./insightedge-submit \
--master k8s:// \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.container.image=gigaspaces/insightedge-enterprise:14.5 \

Refer to the Apache Spark documentation for more configurations that are specific to Spark on Kubernetes. For example, to specify the Driver Pod name, add the following configuration option to the submit command:


SaveRDD Example

Run the following InsightEdge submit script for the SaveRDD example, which generates "N" products, converts them to RDD, and saves them to the data grid. This example has the following configuration:

  • The –master has the prefix k8s://<Kubernetes Master URL>:<port>.
  • The is set with the headless service of the Manager Pod (<release name>-insightedge-manager-hs).
  • The example lookup is the default Space called demo.
  • In Kubernetes clusters with RBAC enabled, the service account must be set (e.g. serviceAccountName=spark).
  • The spark.kubernetes.container.image is set with the desired Docker image (This is usually of the form gigaspaces/insightedge-enterprise:1.0.0).
./insightedge-submit \
--master k8s:// \
--deploy-mode cluster \
--name i9e-saveRdd \
--class org.insightedge.examples.basic.SaveRdd \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.container.image=gigaspaces/insightedge-enterprise:14.5 \
--conf \

Use the GigaSpaces CLI to query the number of objects in the demo data grid. The output should show 100,000 objects of type org.insightedge.examples.basic.Product.

Port 8090 is exposed as the load balancer port demo-insightedge-manager-service:30990TCP, and should be specified as part of the --server option. Navigate to the product bin directory and type the following CLI command:

./<XAP-HOME>/bin/ --server info --type-stats demo 
<XAP-HOME>\bin\gs --server info --type-stats demo 

Improved Configuration Options

These improvements are available starting from version 14.0.0 M13.

The following simplified configuration options can be used with the insightedge-submit script.

Space Name

The insightedge-submit script now accepts any Space name when running an InsightEdge example in Kubernetes.

To provide a Space name for the script, add the configuration property: --conf<space name>.

For example, the Helm commands below will install the following stateful sets: testmanager-insightedge-manager, testmanager-insightedge-zeppelin, testspace-demo-*\[i\]*

The InsightEdge submit command will submit the SaveRDD example with the testspace and testmanager configuration parameters.

$ helm install insightedge-manager --name testmanager

$ helm install insightedge-zeppelin --name testzeppelin

$ helm install insightedge-pu --name testspace --set 

$ ./insightedge-submit \
--master k8s:// \
--deploy-mode cluster \
--name i9e-saveRdd \
--class org.insightedge.examples.basic.SaveRdd \
--conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
--conf spark.kubernetes.container.image=gigaspaces/insightedge-enterprise:14.5 \
--conf \
--conf \