heat
heat copied to clipboard
Same numbers for random number on different split axis
Description
Currently the random array returned by ht.random.rand
is different depending on the split of the tensor.
For example:
>>> ht.random.seed(1)
>>> a = ht.random.randn(4, 4, split=0)
>>> ht.random.seed(1)
>>> b = ht.random.randn(16, split=0)
>>> self.assertTrue(np.array_equal(a.numpy().flatten(), b.numpy()))
>>> ht.random.seed(1)
>>> a = ht.random.randn(4, 4, split=1)
>>> self.asserTrue(np.array_equal(a.numpy().flatten(), b.numpy()))
AssertionError([..])
Expected behavior
The counter_sequence
function needs to be adapted to take the actual split axis into the calculation.
The problem is the following:
The Random numbers are generated by creating tuples of increasing sequences, where the second value is increased until a threshold is reached. After that the first value is increased by one and the process starts over ((0, 0), (0, 1), (0, 2), ... (0, MAX), (1, 0), (1, 1), ...
). These tuples are transformed into "random" numbers by the threefry algorithm. Each tuple creates 2 new numbers that need to be placed next to each other in order to create a random distribution.
This can be illustrated by placing the initial tuples in the shape the final result will have. For a 3x5 Matrix this would look like the following:
|(0, 0) (0, 1) (0,| 2)
(0,| 2) (0, 3) (0, 4)|
|(0, 5) (0, 6) (0,| 7)
(The parenthesis are just for highlighting the tuples and the vertical lines display the bounds of the resulting matrix)
Because of the odd number off elements in each row, some of the tuples need to be reused (The first time we use it, the first value of the created number is used, the second time, the second number is used and the first one ignored.)
Now these tuples and the resulting random number should be created distributed across all available processes. Therefore each process gets a equal share of the final shape and fills this shape with the random values. In our example and a split along the 0 axis on two processes would result in the following:
Proc 1:
|(0, 0) (0, 1) (0,| 2)
(0,| 2) (0, 3) (0, 4)|
-----------------------------
Proc 2:
|(0, 5) (0, 6) (0,| 7)
In this case the process only needs to know at what offset the tuple sequence should start and how many values are needed.
The difficult cases start with split != 0
: e.g. split along 1 axis and two processes:
Proc 1: // Proc 2:
|(0, 0) (0,| 1) // (0,| 1) (0,| 2)
(0,| 2) (0, 3)| // |(0, 4)|
|(0, 5) (0,| 6) // (0,| 6) (0,| 7)
This point is where I am stuck. I cannot think of an efficient algorithm which creates the correct tuples for a split axis != 0.
Currently I am treating these cases as if they would be the same as the split along the 0 axis. This creates a random distribution but results in different positions of the numbers depending on the initial split axis. In code:
>>> ht.random.seed(1)
>>> a = ht.random.rand(4, 4, split=0)
>>> a = a.numpy().flatten() # Unsplit tensor and flatten values to 1D shape
>>> ht.random.seed(1)
>>> b = ht.random.rand(16, split=None).numpy() # Same number of elements, same alignment
>>> np.array_equal(a, b)
True
>>> ht.random.seed(1)
>>> c = ht.random.rand(4, 4, split=1) # Same number of elements, different alignment
>>> c = c.numpy().flatten()
>>> np.array_equal(a, c)
False
>>> a, b, c = np.sort(a), np.sort(b), np.sort(c) # Reorder elements
>>> np.array_equal(a, b)
True
>>> np.array_equal(a, c)
True
@TheSlimvReal Is this behaviour supposed to occur exclusively in the distributed case? I am unable to reproduce it locally.
@lenablind Yes the problem only occurs if the array is split across multiple processes.
@lenablind Yes the problem only occurs if the array is split across multiple processes.
@TheSlimvReal Alright, that makes sense. Thank you for the feedback.
Good luck with the issue, I am curious if you can come up with a solution. I haven't been able to.
Good luck with the issue, I am curious if you can come up with a solution. I haven't been able to.
Thank you, so am I! I'll keep you up to date about the progress.
This issue is still open. @TheSlimvReal thanks again for the detailed explanation of the problem!
Reviewed within #1109