scala - Apache Spark K-Means clustering

scala - Apache Spark K-Means clustering - RDD for input -

May 15, 2014

scala - Apache Spark K-Means clustering - RDD for input -

i'm trying run spark's k-means clustering on grouped info i'm getting variety of errors when seek cluster each group.

the input rdd looks (userid: long, coords: [vector]) i.e.:

org.apache.spark.rdd.rdd[(long, seq[org.apache.spark.mllib.linalg.vector])]

the vector contains x y coordinates i.e. pairs of doubles. want identify coordinate clusters each userid, i'm mapping on rdd, , trying run k-means each group:

val userclusters = usercoordvectors.map {   case (userid, coords) =>     val clusters = 4     val iterations = 30     // need convert coords rdd input k-means     val parseddata = sc.parallelize(coords)      // apply k-means     val model = kmeans.train(parseddata, clusters, iterations)     ...     etc }

but when run this, npe line:

val parseddata = sc.parallelize(coords)

the problem is, have convert coords rdd k-means operation.

on other hand, if collect input rdd first, don't npe. instead, java heap error, presumably because i'm materialising whole rdd.

val userclusters = sc.parallelize(usercoordvectors.collect.map { ... })

collecting info in rdd seems wrong here, i'm assuming there ought improve way, don't know how else parseddata line work.

can see obvious mistakes in how i'm trying utilize k-means stuff here, or suggest how accomplish goal of clustering info within each group?

you cannot utilize sparkcontext or rdd within function of rdd operators. cannot serialized , sent via network.

matei zaharia answered here: http://apache-spark-user-list.1001560.n3.nabble.com/can-we-get-a-spark-context-inside-a-mapper-td9605.html

you can't utilize sparkcontext within spark task, in case you'd have phone call kind of local k-means library. 1 illustration can seek utilize weka (http://www.cs.waikato.ac.nz/ml/weka/). can load text files rdd of strings sparkcontext.wholetextfiles , phone call weka on each one.

scala machine-learning apache-spark

Search This Blog

New Th

scala - Apache Spark K-Means clustering - RDD for input -

Comments

Post a Comment

Popular posts from this blog

php - How to pass multiple values from url -

database - php search bar when I press submit with nothing in the search bar it shows all the data -

ios - How to load .png images from Documents folder of an app -