pyspark MLlib踩坑之model predict+rdd map zip,zip使用尤其注意啊啊啊!

  • 时间:
  • 浏览:0

Excluding some special cases this is guaranteed only if both RDDs have the same ancestor and there are not shuffles and operations potentially changing number of elements (filterflatMap) between the common ancestor and the current state. Typically it means only map (1-to-1) transformations.

而这只是我 万恶的根源,肯能zip在這個 状态下并非能得到你我你要的结果,只是我 说zip后的顺序是混乱的!!!我你要在项目里遇到了!!!

一过后过后刚开始是肯能这麼直接在pyspark里使用map 来做model predict,因此scala是还都还能否 的!如下:

Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transforamtion. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-20063.

Instead of that official documentation recommends something like this:

Unfortunately similar approach in PySpark doesn't work so well:

见:https://stackoverflow.com/questions/29268210/mind-blown-rdd-zip-method

This appears to imply that even the trivial a.map(f).zip(a) is not guaranteed to be equivalent to a.map(x => (f(x),x)). What is the situation when zip() results are reproducible?

When we use Scala API a recommended way of getting predictions for RDD[LabeledPoint] using DecisionTreeModel is to simply map over RDD:

最后我的出理 辦法 是:

原应:

1、直接将rdd做union操作,rdd = rdd.union(sc.parallelize([])),因此map,zip就能输出正常结果了!

zip is generally speaking a tricky operation. It requires both RDDs not only to have the same number of partitions but also the same number of elements per partition.

见:https://stackoverflow.com/questions/32084368/can-only-zip-with-rdd-which-has-the-same-number-of-partitions-error

2、肯能是直接将预测的rdd collect到driver机器,使用model predict,是比较丑陋的做法!

根源是肯能我的ancestor rdd做了shuffle和filter的操作!最后在我们都都 的子rdd上使用zip就会出错(数据乱序了)!!!青春恋爱物语太懊丧了,折腾一天這個 问题图片,感谢上帝终于出理 了!阿门!