alluxio
alluxio copied to clipboard
Optimize the getBucket method
What changes are proposed in this pull request?
Refactor getBucket using listStatusPartial.
Why are the changes needed?
Using ListStatusPartial can avoid fetching a large amount of useless data, and can also relieve the pressure on the client and server.
Does this PR introduce any user facing changes?
No.
is this changing the default behavior? Changing the current behavior should be avoided usually.
The result returned is unchanged, just to avoid requesting a lot of useless data.
@Jackson-Wang-7 Please take a look
The result returned is unchanged, just to avoid requesting a lot of useless data.
The caller would see less data, am I right? If this is the case, that means the behavior is changed.
The original logic would first request all the data before performing the filter, including limiting the number, so that the excess data would be discarded. The current implementation is only to avoid requesting excess data, but the actual returned result is the same.
When the delimiterParam
is not empty also can use listStatusPartial
, i will finish it.
This loop here https://github.com/Alluxio/alluxio/blob/79f43ef82036d90077c1c502e86edfd780255b9f/core/server/master/src/main/java/alluxio/master/file/DefaultFileSystemMaster.java#L1132-L1134 kill the performance of recursive listings, since it traverses the entire subtree on every call. But in the vast majority of cases it does nothing, there are only a few specific cases where this will trigger a metadata sync.
Eventually we plan to fix the metadata sync and remove this logic, but for now, I wonder if we can add a flag to disable this loop if the user doesn't want to metadata sync? wdyt @lucyge2022
This loop here
https://github.com/Alluxio/alluxio/blob/79f43ef82036d90077c1c502e86edfd780255b9f/core/server/master/src/main/java/alluxio/master/file/DefaultFileSystemMaster.java#L1132-L1134
kill the performance of recursive listings, since it traverses the entire subtree on every call. But in the vast majority of cases it does nothing, there are only a few specific cases where this will trigger a metadata sync. Eventually we plan to fix the metadata sync and remove this logic, but for now, I wonder if we can add a flag to disable this loop if the user doesn't want to metadata sync? wdyt @lucyge2022
since the syncMetadata is called before doing anything in liststatus call on the target listing path, what's the ramification by turning this off? what are the scenarios that the list will list a stale view from UFS?
A big picture question I'd like to ask is, how's the recursive ls call used? And what do you say the percentage of using ls vs delimiter and ls recursively from your application?
@lucyge2022 There are various usage scenarios, and I have not considered this issue from the usage scenarios. I refactored this method to use listStatusPartial which I think makes more sense. This avoids the unnecessary overhead of fetching large amounts of unneeded data.
@Jackson-Wang-7 Can you help review this code?
@lucyge2022 Please help take a look!
One question, how much performance gain from this change? Had you measured this part?
We recently added this flag to disable the areDescendantsLoaded check https://github.com/Alluxio/alluxio/blob/master/core/transport/src/main/proto/grpc/file_system_master.proto#L234-L239
I wonder if we can add this as a parameter? Or maybe a property key?
If it is disabled there are some cases where a nested directory might not load metadata, but it will let the partial listing have decent performance until the metadata sync code is updated.
since the syncMetadata is called before doing anything in liststatus call on the target listing path, what's the ramification by turning this off? what are the scenarios that the list will list a stale view from UFS?
The scenarios are pretty difficult to explain, but I can try.
The basic idea would be that a metadata load happened on a nested directory, but only at 1 level. Then when the root directory is listed recursively, and the time based sync is disabled, normally it will traverse the tree to find that this folder has not loaded its metadata and perform a load metadata on the root. If this is disabled, then it will not perform this check.
@yyongycy I have tested to get 1.5w pieces of data with recursive, and the performance has improved by nearly 40%.
@tcrain I will add a property for areDescendantsLoaded in getBucket. In fact, in the scenarios we need, sync metadata is a must, of course I have to consider more general scenarios.
@lucyge2022 @yyongycy Any other suggestions? I think the problem can be solved step by step, and the problem of pulling all files every time can be solved first, which is a very serious problem.
@lucyge2022 @yyongycy Any other suggestions? I think the problem can be solved step by step, and the problem of pulling all files every time can be solved first, which is a very serious problem.
sure agreed, will take a look again
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in two weeks if no further activity occurs. Thank you for your contributions.