Erjia Guan comments

Results 170 comments of


Erjia Guan

Make itertomap loading more lazy

So, I guess we need to figure out a way to let users to indicate when they have done with `MapDataPipe` then deleting/depleting the iterator of prior DataPipe (it would...

Dataloader2 with FullSyncIterDataPipe throws error during initilization

Are you running multiple DPP at the same time?

What does it mean for a DataPipe to be 'replicable'?

`replicable` means the `DataPipe` can be copied multiple times for multiprocessing workers. If it's not, it will be either kept in a dispatching process when `ShardingRoundRobinDispatcher` is used or kept...

Default constructing ShardingRoundRobinDispatcher results in exception

The problem with `ShardingRoundRobinDispatcher` is that it currently only supports `SHARDING_PRIORITY.MULTIPROCESSING`.

MPRS: Keeping references to datapipes in wrappers can cause sequentialization of pipes

Here are a few things in my mind to help users easily find this problem: - First, add explicit documentation about it and add instruction to use `weakref` to wrap...

Serialize np.ndarray via shared memory

A little bit context on old DataLoader. It always tries to collate samples into Tensor via collate_fn. Therefore, it would help reduce overhead of transmitting samples from worker process to...

Serialize np.ndarray via shared memory

> This resulted in a degradation of the performance to the single-threaded case which lets me believe that my main performance overhead right now is actually the `collate`. I am...

Allow custom sharding datapipes

Related to https://github.com/pytorch/pytorch/issues/96975 We should allow users to provide custom sharding DataPipe. Will send a PR shortly.

Add memmap cache for Tensor

> Hence, storing tensor data in a (potentially large file) to share it between processes and to improve reading time? Correct. This is inspired by `tensordict` to help accelerating MP.

Update out-of-date example and colab

@NivekT Could you please change the colab link to the new one?