Ethan Feng comments

Results 39 comments of


                                            Ethan Feng

[CELEBORN-1470] CelebornInputStream should retry to get next chunk starting from returned chunk for no replication

Using a bitset to record return chunks can solve this data loss scenario. For most partition splits, it won't be larger than 8 GB, I think the default value could...

[CELEBORN-1620][CIP-11] Support passing worker tags via RequestSlots message

@s0nskar Did you forget to update the transport messages proto files?

[CELEBORN-1532][HELM] Make helm charts more customizable

@ChenYi015 Can you split this PR into several small PRs? Changing one feature in one PR can be easier for reviewers.

Update celeborn conf to add S3 in default and doc for policy

Hi, you can run the following command to refresh the docs. ``` UPDATE=1 build/mvn clean test -pl common -am -Dtest=none -DwildcardSuites=org.apache.celeborn.ConfigurationSuite ```

[CELEBORN-2191] Added factor of number of disks per group when calculating allocation ratio in LoadAwareSlot allocation

@AmandeepSingh285 As you can see that the disks are split into groups evenly, why do you still want to add disk count weight? ``` int groupSizeSize = (int) Math.ceil(usableDisks.size() /...

[CELEBORN-2191] Added factor of number of disks per group when calculating allocation ratio in LoadAwareSlot allocation

@AmandeepSingh285 Thanks for your enthusiasm about this PR, but this pr's functionality can be replaced by tuning the `diskGroupGradient`. I added some calculations down here to clarify that you just...

[CELEBORN-1536] Add option to toggle between human friendly vs single line logging

We can set logger levels for components. Maybe this won't be a trouble.

[CELEBORN-2063] Parallelize the create partition writer in handleReserveSlots to speed up the reserveSlots RPC process time

Plz share some stats about this PR. I was wondering if this PR is helpful for the Spark job's e2e time.

[CELEBORN-1549] Fix networkLocation persistence into Ratis

Just hold on a second. The field is skipped for a reason. Let me find out why the field is not included in the proto.

[CELEBORN-1549] Fix networkLocation persistence into Ratis

org.apache.celeborn.service.deploy.master.clustermeta.AbstractMetaManager#restoreMetaFromFile Abstract meta manager will try to resolve workerinfos who are "DEFAULT_RACK".