Junfan Zhang
Junfan Zhang
Updated. 1. Add the doc of these configs 2. Explain why make the startup-slient-period stop default.
``` Error: Tests run: 6, Failures: 1, Errors: 0, Skipped: 0, Time elapsed: 17.998 s
PTAL @jerqi POC
> Maybe we just need to change the information that we want to log. This is not enough general and duplicate params scattered enerywhere
> Could you reproduce this problem? It occurs in our users' jobs and maybe I have to write test spark code to reproduce > Are the missing blocks disk storage...
> It may be a bug of our rss. I ever try to fix the similar problems by https://github.com/apache/incubator-uniffle/pull/40 and https://github.com/Tencent/Firestorm/pull/92 It looks this problem is not the same as...
> > transIndexDataToSegments > > It seems that we read incomplete index file. We should fail fast when encountering BufferUnderflowException instead of ignoring. https://github.com/apache/incubator-uniffle/blob/07f70ed872b87107fc7028577bd9e66d8349fd6c/common/src/main/java/org/apache/uniffle/common/util/RssUtils.java#L224
> For MEMORY_HDFS or MEMORY_LOCALFILE, it's normal to read incomplete index data, so we choose to ignore instead of failing fast. Oh, make sense. > > > We should fail...
> Revisit: do we need to fail fast? As I know, the exception of BufferUnderflowException is caused by the incomplete `FileBasedShuffleSegment`. In localfile reader, it will always get the n...
After collecting all the failed tasks whose exception is related to `org.apache.uniffle.common.exception.RssException: Blocks read inconsistent`, I found these exception is caused by the `DEADLINE_EXCEEDED` of GRPC remote call, which is...