lance
lance copied to clipboard
Using lance with PyTorch dataloaders
Hello,
I am looking at lance for a pytorch dataloader. I am having issues with a lance based loader (like this one https://lancedb.github.io/lance/examples/llm_training.html) when using it in a distributed setting. Two questions:
1 - was the example provided ever tested under a distributed setting (multi-gpu)?
2 - has anyone got it to work in a distributed setting (multi-gpu)?
I am using torchrun to launch the training job. An almost identical loader works with in memory csv. It seems to hang at the point that lance is instantiated
ds = lance.dataset(input_filename)
Finally, is there some other way I should be using lance for data loading?
Thanks
@jn2clark This example is only for single GPU training; however, we are working on multi-GPU dataloader support.
We will investigate your issue of the dataloader hanging at dataset instantiation and get back to you. Thanks a lot for reporting it!
Also, we now have a dedicated repository for Deep learning recipes using Lance: https://github.com/lancedb/lance-deeplearning-recipes
Thanks @tanaymeh, that would be great.
@jn2clark could you set spawn method
from multiprocessing import set_start_method
set_start_method("spawn")
Before running Pytorch loader?
@jn2clark could you set spawn method
from multiprocessing import set_start_method set_start_method("spawn")
Before running Pytorch loader?
to add some color here.
pytorch dataloader uses multiprocessing
and python forks processed by default.
CUDAContext in pytorch is not compatible with fork
so we need to use spawn
in multi-gpu DDP (distributed data parallel) env. see torch doc here
Additionally, lance doesn't fork
well either.
Thanks for the suggestions. I found an alternative but would like to try and get this to work at least for a benchmark to compare. I can try next week
@jn2clark Any updates? I'm also looking for ways to use multi-GPU.
@jn2clark @baorepo I added an example to Lance Deep learning recipes about training GPT-2 using FSDP strategy. It might be useful: https://github.com/lancedb/lance-deeplearning-recipes/tree/main/examples/fsdp-llm-pretraining