alluxio icon indicating copy to clipboard operation
alluxio copied to clipboard

Optimize the getBucket method

Open Haoning-Sun opened this issue 2 years ago • 22 comments

What changes are proposed in this pull request?

Refactor getBucket using listStatusPartial.

Why are the changes needed?

Using ListStatusPartial can avoid fetching a large amount of useless data, and can also relieve the pressure on the client and server.

Does this PR introduce any user facing changes?

No.

Haoning-Sun avatar Sep 01 '22 10:09 Haoning-Sun

is this changing the default behavior? Changing the current behavior should be avoided usually.

yyongycy avatar Sep 02 '22 09:09 yyongycy

The result returned is unchanged, just to avoid requesting a lot of useless data.

Haoning-Sun avatar Sep 02 '22 14:09 Haoning-Sun

@Jackson-Wang-7 Please take a look

yyongycy avatar Sep 05 '22 01:09 yyongycy

The result returned is unchanged, just to avoid requesting a lot of useless data.

The caller would see less data, am I right? If this is the case, that means the behavior is changed.

yyongycy avatar Sep 05 '22 01:09 yyongycy

The original logic would first request all the data before performing the filter, including limiting the number, so that the excess data would be discarded. The current implementation is only to avoid requesting excess data, but the actual returned result is the same.

Haoning-Sun avatar Sep 05 '22 06:09 Haoning-Sun

When the delimiterParam is not empty also can use listStatusPartial, i will finish it.

Haoning-Sun avatar Jan 12 '23 07:01 Haoning-Sun

This loop here https://github.com/Alluxio/alluxio/blob/79f43ef82036d90077c1c502e86edfd780255b9f/core/server/master/src/main/java/alluxio/master/file/DefaultFileSystemMaster.java#L1132-L1134 kill the performance of recursive listings, since it traverses the entire subtree on every call. But in the vast majority of cases it does nothing, there are only a few specific cases where this will trigger a metadata sync.

Eventually we plan to fix the metadata sync and remove this logic, but for now, I wonder if we can add a flag to disable this loop if the user doesn't want to metadata sync? wdyt @lucyge2022

tcrain avatar Jan 13 '23 00:01 tcrain

This loop here

https://github.com/Alluxio/alluxio/blob/79f43ef82036d90077c1c502e86edfd780255b9f/core/server/master/src/main/java/alluxio/master/file/DefaultFileSystemMaster.java#L1132-L1134

kill the performance of recursive listings, since it traverses the entire subtree on every call. But in the vast majority of cases it does nothing, there are only a few specific cases where this will trigger a metadata sync. Eventually we plan to fix the metadata sync and remove this logic, but for now, I wonder if we can add a flag to disable this loop if the user doesn't want to metadata sync? wdyt @lucyge2022

since the syncMetadata is called before doing anything in liststatus call on the target listing path, what's the ramification by turning this off? what are the scenarios that the list will list a stale view from UFS?

lucyge2022 avatar Jan 13 '23 20:01 lucyge2022

A big picture question I'd like to ask is, how's the recursive ls call used? And what do you say the percentage of using ls vs delimiter and ls recursively from your application?

lucyge2022 avatar Jan 13 '23 20:01 lucyge2022

@lucyge2022 There are various usage scenarios, and I have not considered this issue from the usage scenarios. I refactored this method to use listStatusPartial which I think makes more sense. This avoids the unnecessary overhead of fetching large amounts of unneeded data.

Haoning-Sun avatar Feb 15 '23 06:02 Haoning-Sun

@Jackson-Wang-7 Can you help review this code?

Haoning-Sun avatar Feb 15 '23 06:02 Haoning-Sun

@lucyge2022 Please help take a look!

yyongycy avatar Feb 15 '23 11:02 yyongycy

One question, how much performance gain from this change? Had you measured this part?

yyongycy avatar Feb 16 '23 00:02 yyongycy

We recently added this flag to disable the areDescendantsLoaded check https://github.com/Alluxio/alluxio/blob/master/core/transport/src/main/proto/grpc/file_system_master.proto#L234-L239

I wonder if we can add this as a parameter? Or maybe a property key?

If it is disabled there are some cases where a nested directory might not load metadata, but it will let the partial listing have decent performance until the metadata sync code is updated.

tcrain avatar Feb 16 '23 01:02 tcrain

since the syncMetadata is called before doing anything in liststatus call on the target listing path, what's the ramification by turning this off? what are the scenarios that the list will list a stale view from UFS?

The scenarios are pretty difficult to explain, but I can try.

The basic idea would be that a metadata load happened on a nested directory, but only at 1 level. Then when the root directory is listed recursively, and the time based sync is disabled, normally it will traverse the tree to find that this folder has not loaded its metadata and perform a load metadata on the root. If this is disabled, then it will not perform this check.

tcrain avatar Feb 16 '23 01:02 tcrain

@yyongycy I have tested to get 1.5w pieces of data with recursive, and the performance has improved by nearly 40%.

Haoning-Sun avatar Feb 16 '23 01:02 Haoning-Sun

@tcrain I will add a property for areDescendantsLoaded in getBucket. In fact, in the scenarios we need, sync metadata is a must, of course I have to consider more general scenarios.

Haoning-Sun avatar Feb 16 '23 01:02 Haoning-Sun

@lucyge2022 @yyongycy Any other suggestions? I think the problem can be solved step by step, and the problem of pulling all files every time can be solved first, which is a very serious problem.

Haoning-Sun avatar Feb 24 '23 15:02 Haoning-Sun

@lucyge2022 @yyongycy Any other suggestions? I think the problem can be solved step by step, and the problem of pulling all files every time can be solved first, which is a very serious problem.

sure agreed, will take a look again

lucyge2022 avatar Feb 24 '23 18:02 lucyge2022

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Mar 27 '23 15:03 github-actions[bot]

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.

github-actions[bot] avatar Jun 16 '23 15:06 github-actions[bot]