dgl icon indicating copy to clipboard operation
dgl copied to clipboard

[GraphBolt] Make unique_and_compact deterministic

Open RamonZhou opened this issue 11 months ago • 7 comments

Description

Doc: https://docs.google.com/document/d/1fWX6GhQySniFCSmXnKw8mxZQ26WngD5jEi8vt2H41Ss

Checklist

Please feel free to remove inapplicable items for your PR.

  • [ ] The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
  • [ ] I've leverage the tools to beautify the python and c++ code.
  • [ ] The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
  • [ ] All changes have test coverage
  • [ ] Code is well-documented
  • [ ] To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
  • [ ] Related issue is referred in this PR
  • [ ] If the PR is for a new model/paper, I've updated the example index here.

Changes

RamonZhou avatar Mar 15 '24 07:03 RamonZhou

To trigger regression tests:

  • @dgl-bot run [instance-type] [which tests] [compare-with-branch]; For example: @dgl-bot run g4dn.4xlarge all dmlc/master or @dgl-bot run c5.9xlarge kernel,api dmlc/master

dgl-bot avatar Mar 15 '24 07:03 dgl-bot

Commit ID: 498b41563f65d3bb95daf14ac76ec0790069cd33

Build ID: 1

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot avatar Mar 15 '24 07:03 dgl-bot

LGTM, as per our discussion with @peizhou001.

mfbalin avatar Mar 15 '24 07:03 mfbalin

How much slower is this new implementation compared to the old one?

mfbalin avatar Mar 15 '24 07:03 mfbalin

Commit ID: b2242bc596c7f168337b3a296b57b1a6c1dbd2b7

Build ID: 2

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot avatar Mar 15 '24 07:03 dgl-bot

How much slower is this new implementation compared to the old one?

@mfbalin The time complexity should be the same. I also tested on my local machine, there's no significant slow down in any of pinned-cuda, cpu-cuda and cuda-cuda modes.

RamonZhou avatar Mar 15 '24 08:03 RamonZhou

How much slower is this new implementation compared to the old one?

@mfbalin The time complexity should be the same. I also tested on my local machine, there's no significant slow down in any of pinned-cuda, cpu-cuda and cuda-cuda modes.

@RamonZhou Could you also test the advanced pyg example with cpu-pinned-cuda? That one stresses the CPU sampling and unique_and_compact the most.

mfbalin avatar Mar 15 '24 15:03 mfbalin

How much slower is this new implementation compared to the old one?

@mfbalin The time complexity should be the same. I also tested on my local machine, there's no significant slow down in any of pinned-cuda, cpu-cuda and cuda-cuda modes.

@RamonZhou Could you also test the advanced pyg example with cpu-pinned-cuda? That one stresses the CPU sampling and unique_and_compact the most.

@mfbalin I tested it and the average epoch time is 7.94s before and 7.99s after

RamonZhou avatar Mar 18 '24 05:03 RamonZhou

How much slower is this new implementation compared to the old one?

@mfbalin The time complexity should be the same. I also tested on my local machine, there's no significant slow down in any of pinned-cuda, cpu-cuda and cuda-cuda modes.

@RamonZhou Could you also test the advanced pyg example with cpu-pinned-cuda? That one stresses the CPU sampling and unique_and_compact the most.

@mfbalin I tested it and the average epoch time is 7.94s before and 7.99s after

Thank you. Let's test it again before we merge it. The changes during the review process may impact performance.

mfbalin avatar Mar 18 '24 05:03 mfbalin

How much slower is this new implementation compared to the old one?

@mfbalin The time complexity should be the same. I also tested on my local machine, there's no significant slow down in any of pinned-cuda, cpu-cuda and cuda-cuda modes.

@RamonZhou Could you also test the advanced pyg example with cpu-pinned-cuda? That one stresses the CPU sampling and unique_and_compact the most.

@mfbalin I tested it and the average epoch time is 7.94s before and 7.99s after

we may also add an option to decide which path to go depends on whether user need deterministic or not.

frozenbugs avatar Mar 18 '24 06:03 frozenbugs

Commit ID: 9d86ab56849b587abfd77ebc3220ab28d473633e

Build ID: 3

Status: ⚪️ CI test cancelled due to overrun.

Report path: link

Full logs path: link

dgl-bot avatar Mar 18 '24 07:03 dgl-bot

Commit ID: e4cfb3a617cb8c286a14eaf5b345bdc799aeb9c2

Build ID: 4

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot avatar Mar 18 '24 08:03 dgl-bot

How much slower is this new implementation compared to the old one?

@mfbalin The time complexity should be the same. I also tested on my local machine, there's no significant slow down in any of pinned-cuda, cpu-cuda and cuda-cuda modes.

@RamonZhou Could you also test the advanced pyg example with cpu-pinned-cuda? That one stresses the CPU sampling and unique_and_compact the most.

@mfbalin I tested it and the average epoch time is 7.94s before and 7.99s after

we may also add an option to decide which path to go depends on whether user need deterministic or not.

Maybe not? because it keeps the same speed while making the result sable, there is no reason why user want the result non-deterministic w/o performance benefits.

peizhou001 avatar Mar 18 '24 09:03 peizhou001

@RamonZhou how is the performance now on the advanced PyG example?

mfbalin avatar Mar 18 '24 13:03 mfbalin

How much slower is this new implementation compared to the old one?

@mfbalin The time complexity should be the same. I also tested on my local machine, there's no significant slow down in any of pinned-cuda, cpu-cuda and cuda-cuda modes.

@RamonZhou Could you also test the advanced pyg example with cpu-pinned-cuda? That one stresses the CPU sampling and unique_and_compact the most.

@mfbalin I tested it and the average epoch time is 7.94s before and 7.99s after

we may also add an option to decide which path to go depends on whether user need deterministic or not.

Maybe not? because it keeps the same speed while making the result sable, there is no reason why user want the result non-deterministic w/o performance benefits.

I see around 5% performance loss with this PR on my machine with the advanced PyG example --mode=cpu-pinned-cuda.

mfbalin avatar Mar 18 '24 17:03 mfbalin

@RamonZhou it might be best to measure the performance by making the feature dimension smaller in case the feature fetch is the bottleneck on your system. Like features = features[:, :32].

mfbalin avatar Mar 19 '24 02:03 mfbalin

@mfbalin I tested again with the latest master branch merged. It's 7.911s (before) vs. 7.919s (after)

RamonZhou avatar Mar 19 '24 06:03 RamonZhou

Commit ID: c52435d7bc8e2086211ab079d48ab736766cec56

Build ID: 5

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot avatar Mar 19 '24 09:03 dgl-bot

I did another test and the results are (features are sliced with [:, :32]):

  • Old version: 5.817s
  • Current PR (with the if inside the loop): 5.802s
  • Current PR (without the if inside the loop): 5.783s

So it seems that deleting the if might be faster!

RamonZhou avatar Mar 20 '24 04:03 RamonZhou

I did another test and the results are (features are sliced with [:, :32]):

  • Old version: 5.817s
  • Current PR (with the if inside the loop): 5.802s
  • Current PR (without the if inside the loop): 5.783s

So it seems that deleting the if might be faster!

The fact that all the numbers are so close is suspicious to me. Could we test with no feature fetch and no forward pass?

mfbalin avatar Mar 20 '24 04:03 mfbalin

I did another test and the results are (features are sliced with [:, :32]):

  • Old version: 5.817s
  • Current PR (with the if inside the loop): 5.802s
  • Current PR (without the if inside the loop): 5.783s

So it seems that deleting the if might be faster!

The fact that all the numbers are so close is suspicious to me. Could we test with no feature fetch and no forward pass?

Without feature fetching and forward pass: 4.319s (before) and 4.707s (after). About 9% slower

RamonZhou avatar Mar 20 '24 05:03 RamonZhou

I did another test and the results are (features are sliced with [:, :32]):

  • Old version: 5.817s
  • Current PR (with the if inside the loop): 5.802s
  • Current PR (without the if inside the loop): 5.783s

So it seems that deleting the if might be faster!

The fact that all the numbers are so close is suspicious to me. Could we test with no feature fetch and no forward pass?

Without feature fetching and forward pass: 4.319s (before) and 4.707s (after). About 9% slower

Do you want to look into implementing atomic load and see if it helps before the cas loop? Or do you want to merge this first, then continue to optimize?

mfbalin avatar Mar 20 '24 05:03 mfbalin

I did another test and the results are (features are sliced with [:, :32]):

  • Old version: 5.817s
  • Current PR (with the if inside the loop): 5.802s
  • Current PR (without the if inside the loop): 5.783s

So it seems that deleting the if might be faster!

The fact that all the numbers are so close is suspicious to me. Could we test with no feature fetch and no forward pass?

Without feature fetching and forward pass: 4.319s (before) and 4.707s (after). About 9% slower

Do you want to look into implementing atomic load and see if it helps before the cas loop? Or do you want to merge this first, then continue to optimize?

@mfbalin I tried to implement it but I think the atomic load can just be a CompareAndSwap. Like old_val = CompareAndSwap(&(hash_map_data[val_pos]), empty_key, value); is a way to load it atomically. But in the end it's the same with the current code since we are doing the same thing in the loop.

RamonZhou avatar Mar 22 '24 02:03 RamonZhou

I did another test and the results are (features are sliced with [:, :32]):

  • Old version: 5.817s
  • Current PR (with the if inside the loop): 5.802s
  • Current PR (without the if inside the loop): 5.783s

So it seems that deleting the if might be faster!

The fact that all the numbers are so close is suspicious to me. Could we test with no feature fetch and no forward pass?

Without feature fetching and forward pass: 4.319s (before) and 4.707s (after). About 9% slower

Do you want to look into implementing atomic load and see if it helps before the cas loop? Or do you want to merge this first, then continue to optimize?

@mfbalin I tried to implement it but I think the atomic load can just be a CompareAndSwap. Like old_val = CompareAndSwap(&(hash_map_data[val_pos]), empty_key, value); is a way to load it atomically. But in the end it's the same with the current code since we are doing the same thing in the loop.

https://gcc.gnu.org/onlinedocs/gcc-4.1.0/gcc/Atomic-Builtins.html You can use __sync_fetch_and_or with 2nd argument as 0 to perform an atomic load operation. CompareAndSwap is one of the most expensive atomic operations.

mfbalin avatar Mar 22 '24 02:03 mfbalin

I did another test and the results are (features are sliced with [:, :32]):

  • Old version: 5.817s
  • Current PR (with the if inside the loop): 5.802s
  • Current PR (without the if inside the loop): 5.783s

So it seems that deleting the if might be faster!

The fact that all the numbers are so close is suspicious to me. Could we test with no feature fetch and no forward pass?

Without feature fetching and forward pass: 4.319s (before) and 4.707s (after). About 9% slower

Do you want to look into implementing atomic load and see if it helps before the cas loop? Or do you want to merge this first, then continue to optimize?

@mfbalin I tried to implement it but I think the atomic load can just be a CompareAndSwap. Like old_val = CompareAndSwap(&(hash_map_data[val_pos]), empty_key, value); is a way to load it atomically. But in the end it's the same with the current code since we are doing the same thing in the loop.

https://gcc.gnu.org/onlinedocs/gcc-4.1.0/gcc/Atomic-Builtins.html You can use __sync_fetch_and_or with 2nd argument as 0 to perform an atomic load operation.

I see. I tried __sync_fetch_and_or and tested. The performance is about the same (4.717s)

RamonZhou avatar Mar 22 '24 03:03 RamonZhou

Commit ID: 4104c82bf59ea48afc3883e5abe1affd0a22328a

Build ID: 6

Status: ✅ CI test succeeded.

Report path: link

Full logs path: link

dgl-bot avatar Mar 25 '24 08:03 dgl-bot

Let's merge this first and optimize in the future.

RamonZhou avatar Mar 25 '24 08:03 RamonZhou