Machine Learning
This section describes how to reuse ML models with the Data Grid. You can train your model, save it to the Data Grid and then reuse it in different Spark contexts.
Saving ML Model to the Data Grid
To save Spark ML/MLlib model to the Data Grid, use saveToGrid
method on your model. The model is serialized with the default Java serialization mechanism, so the models stored on Spark workers (i.e. those that have a reference to RDD or DataFrame) are not currently supported.
For example:
- you can persist
org.apache.spark.mllib.clustering.KMeansModel
that holds a local array of cluster centersclusterCenters: Array[Vector]
- you cannot persist
org.apache.spark.mllib.recommendation.MatrixFactorizationModel
that holds references touserFeatures: RDD[(Int, Array[Double])]
andproductFeatures: RDD[(Int, Array[Double])]
.
Code example:
val rdd = sc.parallelize(List(Vectors.dense(1.0, 1.0, 3.0), Vectors.dense(2.0, 0.0, 1.0), Vectors.dense(2.0, 1.0, 0.0)))
val k = 2
val maxIterations = 100
val model = KMeans.train(rdd, k, maxIterations)
model.saveToGrid(sc, "KMeansModel")
Loading ML Model from the Data Grid
To load an ML model from the Data Grid, use SparkContext.loadMLModel[R]
. The type parameter R
is an ML model class.
import org.insightedge.spark.implicits.all._
val model = sc.loadMLInstance[KMeansModel]("KMeansModel").get