incubator-uniffle
incubator-uniffle copied to clipboard
[Improvement] Add hdfs path health check to AppBalanceSelectStorageStrategy
What changes were proposed in this pull request?
The detection logic of abnormal paths is also added to the APP_BALANCE strategy.
Why are the changes needed?
Anomaly detection can be performed when selecting a remote path to avoid selecting a problematic path.
Does this PR introduce any user-facing change?
The parameters of the original IO_SAMPLE strategy are reused and the names are changed.
rss.coordinator.remote.storage.select.strategysupportAPP_BALANCEandIO_SAMPLE,APP_BALANCEselection strategy based on the number of apps,IO_SAMPLEselection strategy based on time consumption of reading and writing files.rss.coordinator.remote.storage.schedule.time, if user chooseIO_SAMPLE, file will be read and written at regular intervals.rss.coordinator.remote.storage.schedule.file.size, the size of each read / write HDFS file.rss.coordinator.remote.storage.schedule.access.times, number of times to read and write HDFS files.
How was this patch tested?
Passed uts.
Codecov Report
Merging #210 (82d36c4) into master (f1cb43f) will decrease coverage by
0.15%. The diff coverage is66.66%.
@@ Coverage Diff @@
## master #210 +/- ##
============================================
- Coverage 59.31% 59.16% -0.16%
+ Complexity 1345 1339 -6
============================================
Files 161 163 +2
Lines 8780 8810 +30
Branches 828 833 +5
============================================
+ Hits 5208 5212 +4
- Misses 3304 3332 +28
+ Partials 268 266 -2
| Impacted Files | Coverage Δ | |
|---|---|---|
| ...fle/coordinator/AbstractSelectStorageStrategy.java | 20.00% <20.00%> (ø) |
|
| ...nator/LowestIOSampleCostSelectStorageStrategy.java | 71.42% <69.35%> (-2.65%) |
:arrow_down: |
| ...apache/uniffle/coordinator/ApplicationManager.java | 83.80% <70.73%> (-6.47%) |
:arrow_down: |
| ...e/coordinator/AppBalanceSelectStorageStrategy.java | 76.00% <75.00%> (+3.65%) |
:arrow_up: |
| ...a/org/apache/uniffle/common/RemoteStorageInfo.java | 96.29% <100.00%> (+1.85%) |
:arrow_up: |
| ...rg/apache/uniffle/coordinator/CoordinatorConf.java | 97.05% <100.00%> (ø) |
|
| ...pache/uniffle/server/ShuffleServerGrpcService.java | 0.90% <0.00%> (-0.02%) |
:arrow_down: |
| ...orage/handler/impl/LocalFileServerReadHandler.java | 77.96% <0.00%> (ø) |
|
| ... and 2 more |
:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more
Like discussion at dev mail list, we will freeze the code and cut 0.6 version branch in September 15, we will not merge this pr before I cut 0.6 version branch, are you ok?
Like discussion at dev mail list, we will freeze the code and cut 0.6 version branch in September 15, we will not merge this pr before I cut 0.6 version branch, are you ok?
Like discussion at dev mail list, we will freeze the code and cut 0.6 version branch in September 15, we will not merge this pr before I cut 0.6 version branch, are you ok?
No problem, I got it.
For object storage, there may be more S3, but for cos, I don't know this very well. Do you have any good suggestions? @jerqi
For object storage, there may be more S3, but for cos, I don't know this very well. Do you have any good suggestions? @jerqi
Do the S3 have the concept of path? It only have the concept of bucket.
Could we support anomaly detection of hdfs files first?
Could we support anomaly detection of hdfs files first?
We should have more general interface. We can only implement part of them.
For object storage, there may be more S3, but for cos, I don't know this very well. Do you have any good suggestions? @jerqi
Do the S3 have the concept of
path? It only have the concept of bucket.
Yes, the s3 compatible filesystem of Hadoop has implemented the abstraction of Path, but it's slower than the s3 original api when using list files/dirs. I think cos is the same with s3.
By the way, I implemented the hadoop filesystem api for internal object store in the past, so have some experience on it.
For object storage, there may be more S3, but for cos, I don't know this very well. Do you have any good suggestions? @jerqi
Do the S3 have the concept of
path? It only have the concept of bucket.Yes, the s3 compatible filesystem of Hadoop has implemented the abstraction of
Path, but it's slower than the s3 original api when usinglistfiles/dirs. I think cos is the same with s3.By the way, I implemented the hadoop filesystem api for internal object store in the past, so have some experience on it.
We shouldn't relay on the filesystem of object storage too much. Some operations of object storage is slow. For example, list operation, rename operation and append operation.
We should have more general interface. We can only implement part of them.
I think we only need to provide a path that includes the remote, and the rest of the methods can be implemented by judging different storage systems.
We should have more general interface. We can only implement part of them.
I think we only need to provide a path that includes the remote, and the rest of the methods can be implemented by judging different storage systems.
My doubt point is that the concept of path is not general one for the storage.
So I did not use the concept of path for the time being, but used a string to represent.
So I did not use the concept of
pathfor the time being, but used a string to represent.
The concept of subsequent use of buckets or paths can be implemented through the readAndWrite interface.
So I did not use the concept of
pathfor the time being, but used a string to represent.The concept of subsequent use of buckets or paths can be implemented through the readAndWrite interface.
It's ok.
PTAL @jerqi
Maybe we should have a abstract class to define the common behaviors of SelectStorageStrategy.
This modification has done three things
- An abstract class
AbstractSelectStorageStrategyis added to provide a method for reading and writing hdfs paths. - Added logs of the sorted storage list.
- Removed a global variable of
isHealthyand changed it to a local variable.
Could you open this workflow for me? @jerqi
PTAL @jerqi
https://github.com/apache/incubator-uniffle/actions/runs/3213433147 There is a flay test. Could you help me fix it? test-reports-spark3.2.0 (1).zip
https://github.com/apache/incubator-uniffle/actions/runs/3213433147 There is a flay test. Could you help me fix it? test-reports-spark3.2.0 (1).zip
OK, I will fix it.