fastai icon indicating copy to clipboard operation
fastai copied to clipboard

DistributedDL missing idxs

Open kevinbird15 opened this issue 3 years ago • 3 comments

Please confirm you have the latest versions of fastai, fastcore, and nbdev prior to reporting a bug (delete one): YES

Describe the bug I believe DistributedDL is currently not behaving properly on non-divisible lengths

To Reproduce Steps to reproduce the behavior:

  1. Run the tests in 20a_distributed
  2. Look at the output from the following test
#hide
dl = TfmdDL(list(range(50)), bs=12, num_workers=2,drop_last=True)
for i in range(4):
    dl1 = DistributedDL(dl, i, 4)
    test_eq(list(dl1), [torch.arange(i*13, i*13+12)%50])

Expected behavior I would expect if we have a dataloader from 0-49 with batch size of 12, we would distribute that into 4 dataloaders.
dl0 = items 0-11 dl1 = items 12-23 dl2 = items 24-35 dl3 = items 36-47

#current check that fails with new code update
for i in range(4):
    print([torch.arange(i*13, i*13+12)%50])

Current Output

[tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])]
[tensor([13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24])]
[tensor([26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37])]
[tensor([39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49,  0])]
#proposed check
for i in range(4):
    print([torch.arange(i*12, i*12+12)%50])

Proposed Output:

[tensor([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11])]
[tensor([12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23])]
[tensor([24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35])]
[tensor([36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47])]

Additional context This came up when fixing an issue with Learner.get_preds where the dl.get_idxs was allowing items from outside the last batch to be displayed still even if drop_last=True which resulted in an index issue. After fixing that, this 20a_distributed test started failing and after digging in a bit, I believe the new behavior should be correct.

kevinbird15 avatar Feb 12 '22 22:02 kevinbird15

If the other fix is right, this test would pass with test_eq(list(dl1), [torch.arange(i*12, i*12+12)%50])

tyoc213 avatar Feb 13 '22 06:02 tyoc213

@kevinbird15 is this fixed yet? I don't think so since the test is the same and not updated... What needs to be fixed here?

tmabraham avatar May 07 '22 02:05 tmabraham

I think the question of whether there is anything that should be fixed. I am still unsure what the expected behavior is supposed to be since I am not using DistributedDL. So I think the first thing to answer is what the expected behavior of fastai is in the scenario outlined above. I think @marii-moe had the opinion that the current behavior is expected so there isn't really a bug. I think at least it would need to be compared to see if it actually improves performance to change this or if there are times where one behavior is desired over the other.

kevinbird15 avatar May 11 '22 15:05 kevinbird15