scala - Apache Spark K-Means clustering - RDD for input -
scala - Apache Spark K-Means clustering - RDD for input -
i'm trying run spark's k-means clustering on grouped info i'm getting variety of errors when seek cluster each group.
the input rdd looks (userid: long, coords: [vector]) i.e.:
org.apache.spark.rdd.rdd[(long, seq[org.apache.spark.mllib.linalg.vector])] the vector contains x y coordinates i.e. pairs of doubles. want identify coordinate clusters each userid, i'm mapping on rdd, , trying run k-means each group:
val userclusters = usercoordvectors.map { case (userid, coords) => val clusters = 4 val iterations = 30 // need convert coords rdd input k-means val parseddata = sc.parallelize(coords) // apply k-means val model = kmeans.train(parseddata, clusters, iterations) ... etc } but when run this, npe line:
val parseddata = sc.parallelize(coords) the problem is, have convert coords rdd k-means operation.
on other hand, if collect input rdd first, don't npe. instead, java heap error, presumably because i'm materialising whole rdd.
val userclusters = sc.parallelize(usercoordvectors.collect.map { ... }) collecting info in rdd seems wrong here, i'm assuming there ought improve way, don't know how else parseddata line work.
can see obvious mistakes in how i'm trying utilize k-means stuff here, or suggest how accomplish goal of clustering info within each group?
you cannot utilize sparkcontext or rdd within function of rdd operators. cannot serialized , sent via network.
matei zaharia answered here: http://apache-spark-user-list.1001560.n3.nabble.com/can-we-get-a-spark-context-inside-a-mapper-td9605.html
you can't utilize sparkcontext within spark task, in case you'd have phone call kind of local k-means library. 1 illustration can seek utilize weka (http://www.cs.waikato.ac.nz/ml/weka/). can load text files rdd of strings sparkcontext.wholetextfiles , phone call weka on each one.
scala machine-learning apache-spark
Comments
Post a Comment