[Nodes] Add a ToDevice node, or combine with pin memory
🚀 The feature
We should add a node that will send batches to device (probably one at a time). We could either separate this, add it on to pre-fetcher (ie always call .to(device) on the head of the queue, or maybe part of pin-memory
Motivation, pitch
Sending data to device can be slow, and often users want this done in a background thread. DataLoader should do this in the backgroudn as it consolidates state management
Alternatives
No response
Additional context
No response
I wonder, how different can this be from doing the transfer within a Mapper, similar to a collate_fn doing tensor.to(device)
For cases where we have multiple threads reading from data, we might be able to create multiple thread local CUDA streams to transfer data onto the GPU. WDYT @andrewkho ?