Kai Huang

Results 136 comments of Kai Huang

More discussions here: https://github.com/intel-analytics/arda-docker/issues/682 Need to confirm the discussions before merging this PR.

Why do we must have the number of partitions equal to the number of workers? Repartition is expensive, if the number of partitions is already larger than the number of...

> > Why do we must have the number of partitions equal to the number of workers? Repartition is expensive, if the number of partitions is already larger than the...

Will coalesce result in unbalanced partitions? e.g. node1 has 9 partitions and node2 has 1 partition, after coalecse to 2 partitions, will each new partition has 5 smaller partitions or...

After changing to coalesce will throw the following error: ``` > format(target_id, ".", name), value) E py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. E : org.apache.spark.scheduler.BarrierJobUnsupportedRDDChainException: [SPARK-24820][SPARK-24821]: Barrier execution mode...

@jason-dai @jenniew barrier can't be performed on the rdd that comes from coalesce... So we still keep the repartition if rdd.getNumPartitions > num workers?

> take Seems no? To reduce num partitions without shuffle, use coalesce, which can't be combined with barrier. Unsupported to increase partitions without shuffle: https://stackoverflow.com/questions/71070709/increase-the-number-of-partitions-without-repartition-on-hadoop

> Will coalesce result in unbalanced partitions? e.g. node1 has 9 partitions and node2 has 1 partition, after coalecse to 2 partitions, will each new partition has 5 smaller partitions...

Issue conclusion: - The implementation of spark backend requires the number of data partitions equal to the number of workers (a limitation from the original design, but not possible to...