dgl
dgl copied to clipboard
[GraphBolt] Make unique_and_compact deterministic
Description
Doc: https://docs.google.com/document/d/1fWX6GhQySniFCSmXnKw8mxZQ26WngD5jEi8vt2H41Ss
Checklist
Please feel free to remove inapplicable items for your PR.
- [ ] The PR title starts with [$CATEGORY] (such as [NN], [Model], [Doc], [Feature]])
- [ ] I've leverage the tools to beautify the python and c++ code.
- [ ] The PR is complete and small, read the Google eng practice (CL equals to PR) to understand more about small PR. In DGL, we consider PRs with less than 200 lines of core code change are small (example, test and documentation could be exempted).
- [ ] All changes have test coverage
- [ ] Code is well-documented
- [ ] To the best of my knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change
- [ ] Related issue is referred in this PR
- [ ] If the PR is for a new model/paper, I've updated the example index here.
Changes
To trigger regression tests:
-
@dgl-bot run [instance-type] [which tests] [compare-with-branch]
; For example:@dgl-bot run g4dn.4xlarge all dmlc/master
or@dgl-bot run c5.9xlarge kernel,api dmlc/master
Commit ID: 498b41563f65d3bb95daf14ac76ec0790069cd33
Build ID: 1
Status: ⚪️ CI test cancelled due to overrun.
Report path: link
Full logs path: link
LGTM, as per our discussion with @peizhou001.
How much slower is this new implementation compared to the old one?
Commit ID: b2242bc596c7f168337b3a296b57b1a6c1dbd2b7
Build ID: 2
Status: ✅ CI test succeeded.
Report path: link
Full logs path: link
How much slower is this new implementation compared to the old one?
@mfbalin The time complexity should be the same. I also tested on my local machine, there's no significant slow down in any of pinned-cuda
, cpu-cuda
and cuda-cuda
modes.
How much slower is this new implementation compared to the old one?
@mfbalin The time complexity should be the same. I also tested on my local machine, there's no significant slow down in any of
pinned-cuda
,cpu-cuda
andcuda-cuda
modes.
@RamonZhou Could you also test the advanced pyg example with cpu-pinned-cuda
? That one stresses the CPU sampling and unique_and_compact the most.
How much slower is this new implementation compared to the old one?
@mfbalin The time complexity should be the same. I also tested on my local machine, there's no significant slow down in any of
pinned-cuda
,cpu-cuda
andcuda-cuda
modes.@RamonZhou Could you also test the advanced pyg example with
cpu-pinned-cuda
? That one stresses the CPU sampling and unique_and_compact the most.
@mfbalin I tested it and the average epoch time is 7.94s before and 7.99s after
How much slower is this new implementation compared to the old one?
@mfbalin The time complexity should be the same. I also tested on my local machine, there's no significant slow down in any of
pinned-cuda
,cpu-cuda
andcuda-cuda
modes.@RamonZhou Could you also test the advanced pyg example with
cpu-pinned-cuda
? That one stresses the CPU sampling and unique_and_compact the most.@mfbalin I tested it and the average epoch time is 7.94s before and 7.99s after
Thank you. Let's test it again before we merge it. The changes during the review process may impact performance.
How much slower is this new implementation compared to the old one?
@mfbalin The time complexity should be the same. I also tested on my local machine, there's no significant slow down in any of
pinned-cuda
,cpu-cuda
andcuda-cuda
modes.@RamonZhou Could you also test the advanced pyg example with
cpu-pinned-cuda
? That one stresses the CPU sampling and unique_and_compact the most.@mfbalin I tested it and the average epoch time is 7.94s before and 7.99s after
we may also add an option to decide which path to go depends on whether user need deterministic or not.
Commit ID: 9d86ab56849b587abfd77ebc3220ab28d473633e
Build ID: 3
Status: ⚪️ CI test cancelled due to overrun.
Report path: link
Full logs path: link
Commit ID: e4cfb3a617cb8c286a14eaf5b345bdc799aeb9c2
Build ID: 4
Status: ✅ CI test succeeded.
Report path: link
Full logs path: link
How much slower is this new implementation compared to the old one?
@mfbalin The time complexity should be the same. I also tested on my local machine, there's no significant slow down in any of
pinned-cuda
,cpu-cuda
andcuda-cuda
modes.@RamonZhou Could you also test the advanced pyg example with
cpu-pinned-cuda
? That one stresses the CPU sampling and unique_and_compact the most.@mfbalin I tested it and the average epoch time is 7.94s before and 7.99s after
we may also add an option to decide which path to go depends on whether user need deterministic or not.
Maybe not? because it keeps the same speed while making the result sable, there is no reason why user want the result non-deterministic w/o performance benefits.
@RamonZhou how is the performance now on the advanced PyG example?
How much slower is this new implementation compared to the old one?
@mfbalin The time complexity should be the same. I also tested on my local machine, there's no significant slow down in any of
pinned-cuda
,cpu-cuda
andcuda-cuda
modes.@RamonZhou Could you also test the advanced pyg example with
cpu-pinned-cuda
? That one stresses the CPU sampling and unique_and_compact the most.@mfbalin I tested it and the average epoch time is 7.94s before and 7.99s after
we may also add an option to decide which path to go depends on whether user need deterministic or not.
Maybe not? because it keeps the same speed while making the result sable, there is no reason why user want the result non-deterministic w/o performance benefits.
I see around 5% performance loss with this PR on my machine with the advanced PyG example --mode=cpu-pinned-cuda
.
@RamonZhou it might be best to measure the performance by making the feature dimension smaller in case the feature fetch is the bottleneck on your system. Like features = features[:, :32]
.
@mfbalin I tested again with the latest master branch merged. It's 7.911s (before) vs. 7.919s (after)
Commit ID: c52435d7bc8e2086211ab079d48ab736766cec56
Build ID: 5
Status: ✅ CI test succeeded.
Report path: link
Full logs path: link
I did another test and the results are (features are sliced with [:, :32]
):
- Old version: 5.817s
- Current PR (with the
if
inside the loop): 5.802s - Current PR (without the
if
inside the loop): 5.783s
So it seems that deleting the if
might be faster!
I did another test and the results are (features are sliced with
[:, :32]
):
- Old version: 5.817s
- Current PR (with the
if
inside the loop): 5.802s- Current PR (without the
if
inside the loop): 5.783sSo it seems that deleting the
if
might be faster!
The fact that all the numbers are so close is suspicious to me. Could we test with no feature fetch and no forward pass?
I did another test and the results are (features are sliced with
[:, :32]
):
- Old version: 5.817s
- Current PR (with the
if
inside the loop): 5.802s- Current PR (without the
if
inside the loop): 5.783sSo it seems that deleting the
if
might be faster!The fact that all the numbers are so close is suspicious to me. Could we test with no feature fetch and no forward pass?
Without feature fetching and forward pass: 4.319s (before) and 4.707s (after). About 9% slower
I did another test and the results are (features are sliced with
[:, :32]
):
- Old version: 5.817s
- Current PR (with the
if
inside the loop): 5.802s- Current PR (without the
if
inside the loop): 5.783sSo it seems that deleting the
if
might be faster!The fact that all the numbers are so close is suspicious to me. Could we test with no feature fetch and no forward pass?
Without feature fetching and forward pass: 4.319s (before) and 4.707s (after). About 9% slower
Do you want to look into implementing atomic load and see if it helps before the cas loop? Or do you want to merge this first, then continue to optimize?
I did another test and the results are (features are sliced with
[:, :32]
):
- Old version: 5.817s
- Current PR (with the
if
inside the loop): 5.802s- Current PR (without the
if
inside the loop): 5.783sSo it seems that deleting the
if
might be faster!The fact that all the numbers are so close is suspicious to me. Could we test with no feature fetch and no forward pass?
Without feature fetching and forward pass: 4.319s (before) and 4.707s (after). About 9% slower
Do you want to look into implementing atomic load and see if it helps before the cas loop? Or do you want to merge this first, then continue to optimize?
@mfbalin I tried to implement it but I think the atomic load can just be a CompareAndSwap
. Like old_val = CompareAndSwap(&(hash_map_data[val_pos]), empty_key, value);
is a way to load it atomically. But in the end it's the same with the current code since we are doing the same thing in the loop.
I did another test and the results are (features are sliced with
[:, :32]
):
- Old version: 5.817s
- Current PR (with the
if
inside the loop): 5.802s- Current PR (without the
if
inside the loop): 5.783sSo it seems that deleting the
if
might be faster!The fact that all the numbers are so close is suspicious to me. Could we test with no feature fetch and no forward pass?
Without feature fetching and forward pass: 4.319s (before) and 4.707s (after). About 9% slower
Do you want to look into implementing atomic load and see if it helps before the cas loop? Or do you want to merge this first, then continue to optimize?
@mfbalin I tried to implement it but I think the atomic load can just be a
CompareAndSwap
. Likeold_val = CompareAndSwap(&(hash_map_data[val_pos]), empty_key, value);
is a way to load it atomically. But in the end it's the same with the current code since we are doing the same thing in the loop.
https://gcc.gnu.org/onlinedocs/gcc-4.1.0/gcc/Atomic-Builtins.html You can use __sync_fetch_and_or with 2nd argument as 0 to perform an atomic load operation. CompareAndSwap is one of the most expensive atomic operations.
I did another test and the results are (features are sliced with
[:, :32]
):
- Old version: 5.817s
- Current PR (with the
if
inside the loop): 5.802s- Current PR (without the
if
inside the loop): 5.783sSo it seems that deleting the
if
might be faster!The fact that all the numbers are so close is suspicious to me. Could we test with no feature fetch and no forward pass?
Without feature fetching and forward pass: 4.319s (before) and 4.707s (after). About 9% slower
Do you want to look into implementing atomic load and see if it helps before the cas loop? Or do you want to merge this first, then continue to optimize?
@mfbalin I tried to implement it but I think the atomic load can just be a
CompareAndSwap
. Likeold_val = CompareAndSwap(&(hash_map_data[val_pos]), empty_key, value);
is a way to load it atomically. But in the end it's the same with the current code since we are doing the same thing in the loop.https://gcc.gnu.org/onlinedocs/gcc-4.1.0/gcc/Atomic-Builtins.html You can use __sync_fetch_and_or with 2nd argument as 0 to perform an atomic load operation.
I see. I tried __sync_fetch_and_or
and tested. The performance is about the same (4.717s)
Commit ID: 4104c82bf59ea48afc3883e5abe1affd0a22328a
Build ID: 6
Status: ✅ CI test succeeded.
Report path: link
Full logs path: link
Let's merge this first and optimize in the future.