data Make itertomap loading more lazy

Fixes #454

Changes

Only load from datapipe until requested element is loaded
Add test for this behavior

Feb 15 '23 15:02 SvenDS9

Some tests in torchvision fail. Apparently a stream isn't being closed and my implemenation of itertomap __getitem__() causes this.

The stream in question I think is being created here: https://github.com/pytorch/vision/blob/01ef0a68b6ec00452391251fc16c38e58b92bf07/test/builtin_dataset_mocks.py#L1356

This causes this test to fail: https://github.com/pytorch/vision/blob/01ef0a68b6ec00452391251fc16c38e58b92bf07/test/test_prototype_datasets_builtin.py#L122-L140 While there is a small bug in this test (closing the stream inside the loop changes the dictionary size during iteration which causes subsequent tests to also fail as the dict isn't fully empty for the next test) fixing this wouldn't solve the issue.

https://github.com/pytorch/vision/blob/01ef0a68b6ec00452391251fc16c38e58b92bf07/torchvision/prototype/datasets/_builtin/cub200.py#L194-L208

As I don't know how to fix it I would be happy if someone else could chime in.

Feb 24 '23 18:02 SvenDS9

@ejguan I guess you have the most context on the errors given you fixed pytorch/vision#6997. We just merged pytorch/vision#7403 to make diagnosing this easier. This patch will hit with tomorrows nightly. Could you also have a look here?

Mar 09 '23 17:03 pmeier

@ejguan I guess you have the most context on the errors given you fixed pytorch/vision#6997. We just merged pytorch/vision#7403 to make diagnosing this easier. This patch will hit with tomorrows nightly. Could you also have a look here?

@pmeier Thanks, I think pytorch/vision#7403 does the right job. I will take a look at this PR to see why such issue happens with this PR

Mar 09 '23 17:03 ejguan

I think so as well.

@pmeier Thank you for linking https://github.com/pytorch/vision/pull/6997 I think I know what's happening now! Forcefully depleting the dp (e.g by adding a Prefetcher in cub200 with a large enough buffer) which defeats the purpose of this PR - saving space - solves the issues with the test.

Probably in the test not every value is retrieved from the map, therefore the map is never fully loaded and the iter of the previous dp never "finishes".

How would we address this if we want lazy loading?

Mar 09 '23 18:03 SvenDS9

Forcefully depleting the dp (e.g by adding a Prefetcher in cub200 with a large enough buffer) which defeats the purpose of this PR - saving space - solves the issues with the test.

I think it's the problem that the iterator which opens file handles never reaches finally clause to close them e,g, code pointer. The reason depleting works is it would forcefullly to exit the iterator to execute finally. And, that's the reason I think removing self.itr would help.

Mar 09 '23 19:03 ejguan

@SvenDS9 seems like the test now displays a proper error message: https://github.com/pytorch/data/actions/runs/4384715686/jobs/7676523545#step:9:2098. So the failure is still ongoing, but at least it is now clear what is happening.

Mar 10 '23 13:03 pmeier

Just as a heads-up: feel free to change anything under torchvision.prototype.datasets if it doesn't work well with torchdata. This is very much work in progress and nothing is set in stone. So if there is a better way to solve the issue we had that unblocks this PR, send a PR to torchvision and ping me there.

Mar 10 '23 21:03 pmeier

Probably in the test not every value is retrieved from the map, therefore the map is never fully loaded and the iter of the previous dp never "finishes".

I am pretty sure that is what happens. That also explains why removing the reference of iterator when it's depleted doesn't help as this code is never reached. As the iterator hasn't finished yet, the stream needs to remain open so that more elements can be retrieved from the dp. So in a way this is expected behavior.

I think we should make load_map() public so that users can choose to load the entire map in one go. Currently (in the nightly build) this is done the first time an element is requested from the map.

To fix the tests in torchvision we could either:

Somehow make sure that the dp is depleted in the test. This seems non-trivial as in cub200 the map isn't used directly but instead itertomap.__getitem__ is used as a map function for two other dps.
Explicitly call load_map() at https://github.com/pytorch/vision/blob/7d2acaa7d7fc600fa08fca18e9230f8651147025/torchvision/prototype/datasets/_builtin/cub200.py#L200 in cub200.

WDYT @ejguan @pmeier ?

Mar 13 '23 08:03 SvenDS9

Of the two options provided, I strongly favor 2. since 1. would only fix the test, but not the behavior. I'll let @ejguan comment on whether the actual proposal is the way to go here.

That being said, we are currently not actively working on the datasets and thus I'm also ok with 1. to unblock. However, this means if we pick this up again in the future, we need a proper solution then.

Mar 13 '23 10:03 pmeier

So, I guess we need to figure out a way to let users to indicate when they have done with MapDataPipe then deleting/depleting the iterator of prior DataPipe (it would be better if we can make it automatically). Another note: adding __del__ to itertomap won't help because it's only invoked periodically by gc. So, at the time of TorchVision's test, there is a chance that self._itr has not been removed then file handles are still not closed.

Here is my third proposal, which requires changes from PyTorch, TorchData and TorchVision. In PyTorch Core, add a base callback function for all MapDataPipe at the end of epoch. For cub200, instead of using Mapper, we can rely on MapKeyZipper to combine them then using merge_fn to drop the value from split_dp. In TorchData, MapKeyZipper knows when the IterDataPipe ends, then invoke the callback function for the MapDataPipe.

Any suggestion is welcomed!

Mar 13 '23 16:03 ejguan

data data copied to clipboard

Make itertomap loading more lazy

Changes

data
data copied to clipboard