scala - Apache Spark K-Means clustering - RDD for input -



scala - Apache Spark K-Means clustering - RDD for input -

i'm trying run spark's k-means clustering on grouped info i'm getting variety of errors when seek cluster each group.

the input rdd looks (userid: long, coords: [vector]) i.e.:

org.apache.spark.rdd.rdd[(long, seq[org.apache.spark.mllib.linalg.vector])]

the vector contains x y coordinates i.e. pairs of doubles. want identify coordinate clusters each userid, i'm mapping on rdd, , trying run k-means each group:

val userclusters = usercoordvectors.map { case (userid, coords) => val clusters = 4 val iterations = 30 // need convert coords rdd input k-means val parseddata = sc.parallelize(coords) // apply k-means val model = kmeans.train(parseddata, clusters, iterations) ... etc }

but when run this, npe line:

val parseddata = sc.parallelize(coords)

the problem is, have convert coords rdd k-means operation.

on other hand, if collect input rdd first, don't npe. instead, java heap error, presumably because i'm materialising whole rdd.

val userclusters = sc.parallelize(usercoordvectors.collect.map { ... })

collecting info in rdd seems wrong here, i'm assuming there ought improve way, don't know how else parseddata line work.

can see obvious mistakes in how i'm trying utilize k-means stuff here, or suggest how accomplish goal of clustering info within each group?

you cannot utilize sparkcontext or rdd within function of rdd operators. cannot serialized , sent via network.

matei zaharia answered here: http://apache-spark-user-list.1001560.n3.nabble.com/can-we-get-a-spark-context-inside-a-mapper-td9605.html

you can't utilize sparkcontext within spark task, in case you'd have phone call kind of local k-means library. 1 illustration can seek utilize weka (http://www.cs.waikato.ac.nz/ml/weka/). can load text files rdd of strings sparkcontext.wholetextfiles , phone call weka on each one.

scala machine-learning apache-spark

Comments

Popular posts from this blog

php - How to pass multiple values from url -

xslt - DocBook 5 to PDF transform failing with error: "fo:flow" is missing child elements. Required content model: marker* -

database - php search bar when I press submit with nothing in the search bar it shows all the data -