data
data copied to clipboard
Make itertomap loading more lazy
Fixes #454
Changes
- Only load from datapipe until requested element is loaded
- Add test for this behavior
Some tests in torchvision fail. Apparently a stream isn't being closed and my implemenation of itertomap
__getitem__()
causes this.
The stream in question I think is being created here: https://github.com/pytorch/vision/blob/01ef0a68b6ec00452391251fc16c38e58b92bf07/test/builtin_dataset_mocks.py#L1356
This causes this test to fail: https://github.com/pytorch/vision/blob/01ef0a68b6ec00452391251fc16c38e58b92bf07/test/test_prototype_datasets_builtin.py#L122-L140 While there is a small bug in this test (closing the stream inside the loop changes the dictionary size during iteration which causes subsequent tests to also fail as the dict isn't fully empty for the next test) fixing this wouldn't solve the issue.
https://github.com/pytorch/vision/blob/01ef0a68b6ec00452391251fc16c38e58b92bf07/torchvision/prototype/datasets/_builtin/cub200.py#L194-L208
As I don't know how to fix it I would be happy if someone else could chime in.
@ejguan I guess you have the most context on the errors given you fixed pytorch/vision#6997. We just merged pytorch/vision#7403 to make diagnosing this easier. This patch will hit with tomorrows nightly. Could you also have a look here?
@ejguan I guess you have the most context on the errors given you fixed pytorch/vision#6997. We just merged pytorch/vision#7403 to make diagnosing this easier. This patch will hit with tomorrows nightly. Could you also have a look here?
@pmeier Thanks, I think pytorch/vision#7403 does the right job. I will take a look at this PR to see why such issue happens with this PR
I think so as well.
@pmeier Thank you for linking https://github.com/pytorch/vision/pull/6997 I think I know what's happening now!
Forcefully depleting the dp (e.g by adding a Prefetcher
in cub200 with a large enough buffer) which defeats the purpose of this PR - saving space - solves the issues with the test.
Probably in the test not every value is retrieved from the map, therefore the map is never fully loaded and the iter
of the previous dp never "finishes".
How would we address this if we want lazy loading?
Forcefully depleting the dp (e.g by adding a
Prefetcher
in cub200 with a large enough buffer) which defeats the purpose of this PR - saving space - solves the issues with the test.
I think it's the problem that the iterator which opens file handles never reaches finally
clause to close them e,g, code pointer. The reason depleting works is it would forcefullly to exit the iterator to execute finally
. And, that's the reason I think removing self.itr
would help.
@SvenDS9 seems like the test now displays a proper error message: https://github.com/pytorch/data/actions/runs/4384715686/jobs/7676523545#step:9:2098. So the failure is still ongoing, but at least it is now clear what is happening.
Just as a heads-up: feel free to change anything under torchvision.prototype.datasets
if it doesn't work well with torchdata
. This is very much work in progress and nothing is set in stone. So if there is a better way to solve the issue we had that unblocks this PR, send a PR to torchvision
and ping me there.
Probably in the test not every value is retrieved from the map, therefore the map is never fully loaded and the
iter
of the previous dp never "finishes".
I am pretty sure that is what happens. That also explains why removing the reference of iterator when it's depleted doesn't help as this code is never reached. As the iterator hasn't finished yet, the stream needs to remain open so that more elements can be retrieved from the dp. So in a way this is expected behavior.
I think we should make load_map()
public so that users can choose to load the entire map in one go. Currently (in the nightly build) this is done the first time an element is requested from the map.
To fix the tests in torchvision we could either:
- Somehow make sure that the dp is depleted in the test. This seems non-trivial as in
cub200
the map isn't used directly but insteaditertomap.__getitem__
is used as a map function for two other dps. - Explicitly call
load_map()
at https://github.com/pytorch/vision/blob/7d2acaa7d7fc600fa08fca18e9230f8651147025/torchvision/prototype/datasets/_builtin/cub200.py#L200 incub200
.
WDYT @ejguan @pmeier ?
Of the two options provided, I strongly favor 2. since 1. would only fix the test, but not the behavior. I'll let @ejguan comment on whether the actual proposal is the way to go here.
That being said, we are currently not actively working on the datasets and thus I'm also ok with 1. to unblock. However, this means if we pick this up again in the future, we need a proper solution then.
So, I guess we need to figure out a way to let users to indicate when they have done with MapDataPipe
then deleting/depleting the iterator of prior DataPipe (it would be better if we can make it automatically).
Another note: adding __del__
to itertomap
won't help because it's only invoked periodically by gc
. So, at the time of TorchVision's test, there is a chance that self._itr
has not been removed then file handles are still not closed.
Here is my third proposal, which requires changes from PyTorch, TorchData and TorchVision. In PyTorch Core, add a base callback function for all MapDataPipe
at the end of epoch. For cub200
, instead of using Mapper
, we can rely on MapKeyZipper
to combine them then using merge_fn
to drop the value from split_dp
. In TorchData, MapKeyZipper
knows when the IterDataPipe
ends, then invoke the callback function for the MapDataPipe
.
Any suggestion is welcomed!