incubator-uniffle icon indicating copy to clipboard operation
incubator-uniffle copied to clipboard

[Improvement] Add hdfs path health check to AppBalanceSelectStorageStrategy

Open smallzhongfeng opened this issue 3 years ago • 17 comments

What changes were proposed in this pull request?

The detection logic of abnormal paths is also added to the APP_BALANCE strategy.

Why are the changes needed?

Anomaly detection can be performed when selecting a remote path to avoid selecting a problematic path.

Does this PR introduce any user-facing change?

The parameters of the original IO_SAMPLE strategy are reused and the names are changed.

  1. rss.coordinator.remote.storage.select.strategy support APP_BALANCE and IO_SAMPLE, APP_BALANCE selection strategy based on the number of apps, IO_SAMPLE selection strategy based on time consumption of reading and writing files.
  2. rss.coordinator.remote.storage.schedule.time , if user choose IO_SAMPLE, file will be read and written at regular intervals.
  3. rss.coordinator.remote.storage.schedule.file.size , the size of each read / write HDFS file.
  4. rss.coordinator.remote.storage.schedule.access.times, number of times to read and write HDFS files.

How was this patch tested?

Passed uts.

smallzhongfeng avatar Sep 11 '22 14:09 smallzhongfeng

Codecov Report

Merging #210 (82d36c4) into master (f1cb43f) will decrease coverage by 0.15%. The diff coverage is 66.66%.

@@             Coverage Diff              @@
##             master     #210      +/-   ##
============================================
- Coverage     59.31%   59.16%   -0.16%     
+ Complexity     1345     1339       -6     
============================================
  Files           161      163       +2     
  Lines          8780     8810      +30     
  Branches        828      833       +5     
============================================
+ Hits           5208     5212       +4     
- Misses         3304     3332      +28     
+ Partials        268      266       -2     
Impacted Files Coverage Δ
...fle/coordinator/AbstractSelectStorageStrategy.java 20.00% <20.00%> (ø)
...nator/LowestIOSampleCostSelectStorageStrategy.java 71.42% <69.35%> (-2.65%) :arrow_down:
...apache/uniffle/coordinator/ApplicationManager.java 83.80% <70.73%> (-6.47%) :arrow_down:
...e/coordinator/AppBalanceSelectStorageStrategy.java 76.00% <75.00%> (+3.65%) :arrow_up:
...a/org/apache/uniffle/common/RemoteStorageInfo.java 96.29% <100.00%> (+1.85%) :arrow_up:
...rg/apache/uniffle/coordinator/CoordinatorConf.java 97.05% <100.00%> (ø)
...pache/uniffle/server/ShuffleServerGrpcService.java 0.90% <0.00%> (-0.02%) :arrow_down:
...orage/handler/impl/LocalFileServerReadHandler.java 77.96% <0.00%> (ø)
... and 2 more

:mega: We’re building smart automated test selection to slash your CI/CD build times. Learn more

codecov-commenter avatar Sep 11 '22 14:09 codecov-commenter

Like discussion at dev mail list, we will freeze the code and cut 0.6 version branch in September 15, we will not merge this pr before I cut 0.6 version branch, are you ok?

jerqi avatar Sep 14 '22 10:09 jerqi

Like discussion at dev mail list, we will freeze the code and cut 0.6 version branch in September 15, we will not merge this pr before I cut 0.6 version branch, are you ok?

jerqi avatar Sep 14 '22 10:09 jerqi

Like discussion at dev mail list, we will freeze the code and cut 0.6 version branch in September 15, we will not merge this pr before I cut 0.6 version branch, are you ok?

No problem, I got it.

smallzhongfeng avatar Sep 14 '22 16:09 smallzhongfeng

For object storage, there may be more S3, but for cos, I don't know this very well. Do you have any good suggestions? @jerqi

smallzhongfeng avatar Sep 20 '22 11:09 smallzhongfeng

For object storage, there may be more S3, but for cos, I don't know this very well. Do you have any good suggestions? @jerqi

Do the S3 have the concept of path? It only have the concept of bucket.

jerqi avatar Sep 20 '22 12:09 jerqi

Could we support anomaly detection of hdfs files first?

smallzhongfeng avatar Sep 20 '22 17:09 smallzhongfeng

Could we support anomaly detection of hdfs files first?

We should have more general interface. We can only implement part of them.

jerqi avatar Sep 21 '22 02:09 jerqi

For object storage, there may be more S3, but for cos, I don't know this very well. Do you have any good suggestions? @jerqi

Do the S3 have the concept of path? It only have the concept of bucket.

Yes, the s3 compatible filesystem of Hadoop has implemented the abstraction of Path, but it's slower than the s3 original api when using list files/dirs. I think cos is the same with s3.

By the way, I implemented the hadoop filesystem api for internal object store in the past, so have some experience on it.

zuston avatar Sep 21 '22 06:09 zuston

For object storage, there may be more S3, but for cos, I don't know this very well. Do you have any good suggestions? @jerqi

Do the S3 have the concept of path? It only have the concept of bucket.

Yes, the s3 compatible filesystem of Hadoop has implemented the abstraction of Path, but it's slower than the s3 original api when using list files/dirs. I think cos is the same with s3.

By the way, I implemented the hadoop filesystem api for internal object store in the past, so have some experience on it.

We shouldn't relay on the filesystem of object storage too much. Some operations of object storage is slow. For example, list operation, rename operation and append operation.

jerqi avatar Sep 21 '22 10:09 jerqi

We should have more general interface. We can only implement part of them.

I think we only need to provide a path that includes the remote, and the rest of the methods can be implemented by judging different storage systems.

smallzhongfeng avatar Sep 21 '22 18:09 smallzhongfeng

We should have more general interface. We can only implement part of them.

I think we only need to provide a path that includes the remote, and the rest of the methods can be implemented by judging different storage systems.

My doubt point is that the concept of path is not general one for the storage.

jerqi avatar Sep 22 '22 02:09 jerqi

So I did not use the concept of path for the time being, but used a string to represent.

smallzhongfeng avatar Sep 22 '22 02:09 smallzhongfeng

So I did not use the concept of path for the time being, but used a string to represent.

The concept of subsequent use of buckets or paths can be implemented through the readAndWrite interface.

smallzhongfeng avatar Sep 22 '22 02:09 smallzhongfeng

So I did not use the concept of path for the time being, but used a string to represent.

The concept of subsequent use of buckets or paths can be implemented through the readAndWrite interface.

It's ok.

jerqi avatar Sep 22 '22 02:09 jerqi

PTAL @jerqi

smallzhongfeng avatar Sep 23 '22 13:09 smallzhongfeng

Maybe we should have a abstract class to define the common behaviors of SelectStorageStrategy.

jerqi avatar Sep 27 '22 08:09 jerqi

This modification has done three things

  1. An abstract class AbstractSelectStorageStrategy is added to provide a method for reading and writing hdfs paths.
  2. Added logs of the sorted storage list.
  3. Removed a global variable of isHealthy and changed it to a local variable.

smallzhongfeng avatar Oct 08 '22 13:10 smallzhongfeng

Could you open this workflow for me? @jerqi

smallzhongfeng avatar Oct 08 '22 13:10 smallzhongfeng

PTAL @jerqi

smallzhongfeng avatar Oct 09 '22 02:10 smallzhongfeng

https://github.com/apache/incubator-uniffle/actions/runs/3213433147 There is a flay test. Could you help me fix it? test-reports-spark3.2.0 (1).zip

jerqi avatar Oct 09 '22 08:10 jerqi

https://github.com/apache/incubator-uniffle/actions/runs/3213433147 There is a flay test. Could you help me fix it? test-reports-spark3.2.0 (1).zip

OK, I will fix it.

smallzhongfeng avatar Oct 09 '22 12:10 smallzhongfeng