alluxio Optimize the getBucket method

What changes are proposed in this pull request?

Refactor getBucket using listStatusPartial.

Why are the changes needed?

Using ListStatusPartial can avoid fetching a large amount of useless data, and can also relieve the pressure on the client and server.

Does this PR introduce any user facing changes?

No.

Sep 01 '22 10:09 Haoning-Sun

is this changing the default behavior? Changing the current behavior should be avoided usually.

Sep 02 '22 09:09 yyongycy

The result returned is unchanged, just to avoid requesting a lot of useless data.

Sep 02 '22 14:09 Haoning-Sun

@Jackson-Wang-7 Please take a look

Sep 05 '22 01:09 yyongycy

The result returned is unchanged, just to avoid requesting a lot of useless data.

The caller would see less data, am I right? If this is the case, that means the behavior is changed.

Sep 05 '22 01:09 yyongycy

The original logic would first request all the data before performing the filter, including limiting the number, so that the excess data would be discarded. The current implementation is only to avoid requesting excess data, but the actual returned result is the same.

Sep 05 '22 06:09 Haoning-Sun

When the delimiterParam is not empty also can use listStatusPartial, i will finish it.

Jan 12 '23 07:01 Haoning-Sun

This loop here https://github.com/Alluxio/alluxio/blob/79f43ef82036d90077c1c502e86edfd780255b9f/core/server/master/src/main/java/alluxio/master/file/DefaultFileSystemMaster.java#L1132-L1134 kill the performance of recursive listings, since it traverses the entire subtree on every call. But in the vast majority of cases it does nothing, there are only a few specific cases where this will trigger a metadata sync.

Eventually we plan to fix the metadata sync and remove this logic, but for now, I wonder if we can add a flag to disable this loop if the user doesn't want to metadata sync? wdyt @lucyge2022

Jan 13 '23 00:01 tcrain

This loop here

https://github.com/Alluxio/alluxio/blob/79f43ef82036d90077c1c502e86edfd780255b9f/core/server/master/src/main/java/alluxio/master/file/DefaultFileSystemMaster.java#L1132-L1134

kill the performance of recursive listings, since it traverses the entire subtree on every call. But in the vast majority of cases it does nothing, there are only a few specific cases where this will trigger a metadata sync. Eventually we plan to fix the metadata sync and remove this logic, but for now, I wonder if we can add a flag to disable this loop if the user doesn't want to metadata sync? wdyt @lucyge2022

since the syncMetadata is called before doing anything in liststatus call on the target listing path, what's the ramification by turning this off? what are the scenarios that the list will list a stale view from UFS?

Jan 13 '23 20:01 lucyge2022

A big picture question I'd like to ask is, how's the recursive ls call used? And what do you say the percentage of using ls vs delimiter and ls recursively from your application?

Jan 13 '23 20:01 lucyge2022

@lucyge2022 There are various usage scenarios, and I have not considered this issue from the usage scenarios. I refactored this method to use listStatusPartial which I think makes more sense. This avoids the unnecessary overhead of fetching large amounts of unneeded data.

Feb 15 '23 06:02 Haoning-Sun

@Jackson-Wang-7 Can you help review this code？

Feb 15 '23 06:02 Haoning-Sun

@lucyge2022 Please help take a look!

Feb 15 '23 11:02 yyongycy

One question, how much performance gain from this change? Had you measured this part?

Feb 16 '23 00:02 yyongycy

We recently added this flag to disable the areDescendantsLoaded check https://github.com/Alluxio/alluxio/blob/master/core/transport/src/main/proto/grpc/file_system_master.proto#L234-L239

I wonder if we can add this as a parameter? Or maybe a property key?

If it is disabled there are some cases where a nested directory might not load metadata, but it will let the partial listing have decent performance until the metadata sync code is updated.

Feb 16 '23 01:02 tcrain

since the syncMetadata is called before doing anything in liststatus call on the target listing path, what's the ramification by turning this off? what are the scenarios that the list will list a stale view from UFS?

The scenarios are pretty difficult to explain, but I can try.

The basic idea would be that a metadata load happened on a nested directory, but only at 1 level. Then when the root directory is listed recursively, and the time based sync is disabled, normally it will traverse the tree to find that this folder has not loaded its metadata and perform a load metadata on the root. If this is disabled, then it will not perform this check.

Feb 16 '23 01:02 tcrain

@yyongycy I have tested to get 1.5w pieces of data with recursive, and the performance has improved by nearly 40%.

Feb 16 '23 01:02 Haoning-Sun

@tcrain I will add a property for areDescendantsLoaded in getBucket. In fact, in the scenarios we need, sync metadata is a must, of course I have to consider more general scenarios.

Feb 16 '23 01:02 Haoning-Sun

@lucyge2022 @yyongycy Any other suggestions? I think the problem can be solved step by step, and the problem of pulling all files every time can be solved first, which is a very serious problem.

Feb 24 '23 15:02 Haoning-Sun

@lucyge2022 @yyongycy Any other suggestions? I think the problem can be solved step by step, and the problem of pulling all files every time can be solved first, which is a very serious problem.

sure agreed, will take a look again

Feb 24 '23 18:02 lucyge2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.

Mar 27 '23 15:03 github-actions[bot]

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.

Jun 16 '23 15:06 github-actions[bot]

alluxio alluxio copied to clipboard

Optimize the getBucket method

What changes are proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user facing changes?

alluxio
alluxio copied to clipboard