vega
vega copied to clipboard
Tracking issue: Implementation of lacking core RDD ops
For core RDD ops we understand those which spawn in the original Apache Spark from SparkContext and/or the base RDD class and friends: SC:
- [x] range
- [x] filter
- [x] randomSplit
- [ ] sortBy
- [x] groupBy
- [x] keyBy
- [ ] zipPartitions
- [x] intersection
- [ ] pipe
- [x] zip
- [ ] substract
- [ ] treeAggregate
- [ ] treeReduce
- [x] countApprox
- [x] countByValue
- [x] countByValueApprox
- [x] min and max
- [x] top
- [x] takeOrdered
- [x] isEmpty
Non-goals for this tracking issue are any I/O related ops as we are tracking those elsewhere and doing things a little bit differently:
- textFile
- wholeTextFiles
- binary files | binary records
- Hadoop* family of methods
Intersection completed in #66
range done in #82
@iduartgomez - Isn't substract a misspelling of subtract ?
fixed @GavrielPlotke
what would the subtract operation entail, can someone give an example?
Doc: https://spark.apache.org/docs/1.0.2/api/java/org/apache/spark/rdd/RDD.html#subtract(org.apache.spark.rdd.RDD) Example:
- I have a list of customers that I want to advertise to
- I have a list of angry customers who have said "DON'T TALK TO ME!" email_list_rdd = customers_rdd.subtract(angry_rdd)