velox icon indicating copy to clipboard operation
velox copied to clipboard

Improve OutputBufferManager initialization

Open yingsu00 opened this issue 1 year ago • 1 comments
trafficstars

Starting from C++11, the C++ standard guarantees that the initialization of function-local static variables is thread-safe. This is better than using a global mutex, especially for subsequent getInstance() calls. This is because the overhead of using a static local variable only needs to do a simple check to see if the variable has already been initialized, while for the global mutex case, all getInstance() calls need to aquire this lock exclusively.

yingsu00 avatar Oct 25 '24 06:10 yingsu00

Deploy Preview for meta-velox canceled.

Name Link
Latest commit 9d64dd414bb37083c57e58f741e756648e057107
Latest deploy log https://app.netlify.com/sites/meta-velox/deploys/67cfa214aac4a400088538f4

netlify[bot] avatar Oct 25 '24 06:10 netlify[bot]

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Nov 04 '24 15:11 facebook-github-bot

@yingsu00 CI is red. Please, take a look. We need a green CI to merge.

mbasmanova avatar Nov 04 '24 15:11 mbasmanova

@yingsu00 CI is red. Please, take a look. We need a green CI to merge.

CI is green now. Thanks for porting it.

yingsu00 avatar Nov 06 '24 05:11 yingsu00

Thanks for working on this!

ManManson avatar Nov 06 '24 10:11 ManManson

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Nov 06 '24 16:11 facebook-github-bot

@yingsu00 Seeing failures in Prestissimo. If this change is not backwards compatible it needs to be fixed.

fbcode/github/presto-trunk/presto-native-execution/presto_cpp/main/PeriodicTaskManager.cpp:239:55: error: no member named 'lock' in 'std::shared_ptr<facebook::velox::exec::OutputBufferManager>'
  239 |       velox::exec::OutputBufferManager::getInstance().lock()->numBuffers());
      |       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^
buck-out/v2/gen/fbcode/4ac7a641bc650342/velox/common/base/__velox_stats_reporter__/buck-headers/velox/common/base/StatsReporter.h:208:43: note: expanded from macro 'RECORD_METRIC_VALUE'
  208 |         reporter->addMetricValue((key), ##__VA_ARGS__);        \
      |                                           ^~~~~~~~~~~
1 error generated.

mbasmanova avatar Nov 06 '24 17:11 mbasmanova

@yingsu00 Seeing failures in Prestissimo. If this change is not backwards compatible it needs to be fixed.

fbcode/github/presto-trunk/presto-native-execution/presto_cpp/main/PeriodicTaskManager.cpp:239:55: error: no member named 'lock' in 'std::shared_ptr<facebook::velox::exec::OutputBufferManager>'
  239 |       velox::exec::OutputBufferManager::getInstance().lock()->numBuffers());
      |       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^
buck-out/v2/gen/fbcode/4ac7a641bc650342/velox/common/base/__velox_stats_reporter__/buck-headers/velox/common/base/StatsReporter.h:208:43: note: expanded from macro 'RECORD_METRIC_VALUE'
  208 |         reporter->addMetricValue((key), ##__VA_ARGS__);        \
      |                                           ^~~~~~~~~~~
1 error generated.

I'll send a PR to Presto repo to fix this.

yingsu00 avatar Nov 07 '24 01:11 yingsu00

@yingsu00 Seeing failures in Prestissimo. If this change is not backwards compatible it needs to be fixed.

fbcode/github/presto-trunk/presto-native-execution/presto_cpp/main/PeriodicTaskManager.cpp:239:55: error: no member named 'lock' in 'std::shared_ptr<facebook::velox::exec::OutputBufferManager>'
  239 |       velox::exec::OutputBufferManager::getInstance().lock()->numBuffers());
      |       ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ ^
buck-out/v2/gen/fbcode/4ac7a641bc650342/velox/common/base/__velox_stats_reporter__/buck-headers/velox/common/base/StatsReporter.h:208:43: note: expanded from macro 'RECORD_METRIC_VALUE'
  208 |         reporter->addMetricValue((key), ##__VA_ARGS__);        \
      |                                           ^~~~~~~~~~~
1 error generated.

I'll send a PR to Presto repo to fix this.

Draft Presto PR: https://github.com/prestodb/presto/pull/23970. How do you want to proceed? Once this Velox PR is merged, I can add an advance Velox commit in https://github.com/prestodb/presto/pull/23970. The "advance Velox" will need the hash of this commit.

yingsu00 avatar Nov 07 '24 07:11 yingsu00

@yingsu00 We cannot merge Velox PR that breaks Prestissimo. The changes need to be made in backward-compatible manner.

mbasmanova avatar Nov 07 '24 12:11 mbasmanova

@yingsu00 Some examples: #5953 #6459

mbasmanova avatar Nov 07 '24 12:11 mbasmanova

@yingsu00 Some examples: #5953 #6459

@mbasmanova Thanks for the examples, but could you elaborate a bit how to follow these examples? Since OutputbufferManager::getInstance() now returns shared_ptr, all callsites, including the ones in Presto, needs to remove the lock() from OutputbufferManager::getInstance().lock(). Did you mean you wanted to keep it returning weak_ptr? Returning weak_ptr is not quite appropriate here, as @Yuhta suggested. It makes sense to keep any class fields to this object a weak_ptr, e.g. in PartitionedOutput const std::weak_ptr<exec::OutputBufferManager> bufferManager_;, but it's better to return a shared_ptr for OutputbufferManager::getInstance().

Also I don't quite understand why merging this PR would break Presto given that Presto needs to advance Velox version to pick up this change. Presto won't break unless someone else advances Velox for Presto without merging https://github.com/prestodb/presto/pull/23970 together. To avoid that situation, I will add "Advance Velox" change to https://github.com/prestodb/presto/pull/23970 once I get the hash after this change is merged to Velox. Do you think if it works this way? Or do you have other suggestions? Appreciate it if you could elaborate more. Thanks!

yingsu00 avatar Nov 07 '24 23:11 yingsu00

@ManManson @mbasmanova @Yuhta Can anyone provide some advice how to proceed? Thank you!

yingsu00 avatar Nov 13 '24 01:11 yingsu00

The most straightforward way would be to split the work into several pieces as follows:

  1. Introduce std::shared_ptr<...> getInstanceStrong() and convert uses that are strictly confined to Velox internals, but leave public APIs (which are used by Prestissimo) as is.
  2. Upgrade Velox version in Presto C++ and convert all references to OutputBufferManager to use getInstanceStrong(). This ensures that we don't have references to the old method signature anymore.
  3. Make old getInstance() to also return std::shared_ptr<...> (so getInstance() and getInstanceStrong() become the same thing).
  4. Upgrade Velox version in Presto C++ once again and convert all uses of getInstanceStrong() to getInstance().
  5. Now we are free to finally drop the getInstanceStrong() method.

ManManson avatar Nov 13 '24 14:11 ManManson

The most straightforward way would be to split the work into several pieces as follows:

  1. Introduce std::shared_ptr<...> getInstanceStrong() and convert uses that are strictly confined to Velox internals, but leave public APIs (which are used by Prestissimo) as is.
  2. Upgrade Velox version in Presto C++ and convert all references to OutputBufferManager to use getInstanceStrong(). This ensures that we don't have references to the old method signature anymore.
  3. Make old getInstance() to also return std::shared_ptr<...> (so getInstance() and getInstanceStrong() become the same thing).
  4. Upgrade Velox version in Presto C++ once again and convert all uses of getInstanceStrong() to getInstance().
  5. Now we are free to finally drop the getInstanceStrong() method.

Cool! Thanks for the idea @ManManson. I have updated this PR to the state of 1) as you suggested. Please re-review.

yingsu00 avatar Nov 13 '24 22:11 yingsu00

@ManManson @Yuhta Will you folks be able to review this PR again?

yingsu00 avatar Nov 15 '24 06:11 yingsu00

Thank you @ManManson ! @Yuhta @mbasmanova Do you have other comments? If not I'm marking this PR to be "ready-to-merge" again. Once this one is merged I'll send follow up PRs as @ManManson suggested.

yingsu00 avatar Nov 19 '24 03:11 yingsu00

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Dec 04 '24 12:12 facebook-github-bot

@yingsu00 Ying, would you rebase so we can try landing this PR?

mbasmanova avatar Dec 04 '24 13:12 mbasmanova

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Dec 04 '24 21:12 facebook-github-bot

@yingsu00 Thank you for rebasing. Seeing CI failures. Would you take a look?

mbasmanova avatar Dec 04 '24 23:12 mbasmanova

Details

@mbasmanova Thanks for importing it! The test failures seems to be caused by https://github.com/facebookincubator/velox/pull/11737 which was exported and not reviewed. https://github.com/facebookincubator/velox/pull/11743 is supposed to fix it. We can wait it to be merged first then I'll rebase again.

yingsu00 avatar Dec 05 '24 01:12 yingsu00

@mbasmanova has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

facebook-github-bot avatar Dec 05 '24 21:12 facebook-github-bot

CI is still red.

mbasmanova avatar Dec 05 '24 23:12 mbasmanova

hi @mbasmanova All tests pass now except the Facebook internal ones:

Facebook Internal - Builds & Tests: 3 warning_suppressed, 3 warning, 6 warning_preexisting, 94 warning_nosignal, 893 advice_no_signal 3 warning_suppressed, 3 warning, 6 warning_preexisting, 94 warning_nosignal, 893 advice_no_signal internally at Meta. warning: infer-checkers-fbcode-diff-analysis-with-buck

I'm unable to view them though. Would it be possible to show me what the warnings are? I can fix them once I know what's wrong. Thank you!

yingsu00 avatar Dec 06 '24 22:12 yingsu00

@yingsu00 I just noticed this has been sitting here a while, could you rebase your change? I can reimport it then and see what issues if any there are in our internal tests.

kevinwilfong avatar Jan 03 '25 21:01 kevinwilfong

@yingsu00 Can you fix conflicts if you are still interested in landing this change ?

kgpai avatar Feb 12 '25 18:02 kgpai

CI has some failures. I wonder if they are related. @yingsu00 can you please confirm?

majetideepak avatar Feb 18 '25 23:02 majetideepak

@majetideepak There were unrelated issues in the CI pipeline. It is clean now.

czentgr avatar Feb 20 '25 22:02 czentgr

@pedroerp can you please help merge? Thanks.

majetideepak avatar Feb 20 '25 23:02 majetideepak