lance icon indicating copy to clipboard operation
lance copied to clipboard

Using lance with PyTorch dataloaders

Open jn2clark opened this issue 10 months ago • 7 comments

Hello,

I am looking at lance for a pytorch dataloader. I am having issues with a lance based loader (like this one https://lancedb.github.io/lance/examples/llm_training.html) when using it in a distributed setting. Two questions: 1 - was the example provided ever tested under a distributed setting (multi-gpu)? 2 - has anyone got it to work in a distributed setting (multi-gpu)? I am using torchrun to launch the training job. An almost identical loader works with in memory csv. It seems to hang at the point that lance is instantiated ds = lance.dataset(input_filename) Finally, is there some other way I should be using lance for data loading?

Thanks

jn2clark avatar Apr 16 '24 08:04 jn2clark

@jn2clark This example is only for single GPU training; however, we are working on multi-GPU dataloader support.

We will investigate your issue of the dataloader hanging at dataset instantiation and get back to you. Thanks a lot for reporting it!

Also, we now have a dedicated repository for Deep learning recipes using Lance: https://github.com/lancedb/lance-deeplearning-recipes

tanaymeh avatar Apr 16 '24 08:04 tanaymeh

Thanks @tanaymeh, that would be great.

jn2clark avatar Apr 16 '24 11:04 jn2clark

@jn2clark could you set spawn method

from multiprocessing import set_start_method
set_start_method("spawn")

Before running Pytorch loader?

eddyxu avatar Apr 24 '24 23:04 eddyxu

@jn2clark could you set spawn method

from multiprocessing import set_start_method
set_start_method("spawn")

Before running Pytorch loader?

to add some color here.

pytorch dataloader uses multiprocessing and python forks processed by default.

CUDAContext in pytorch is not compatible with fork so we need to use spawn in multi-gpu DDP (distributed data parallel) env. see torch doc here

Additionally, lance doesn't fork well either.

chebbyChefNEQ avatar Apr 24 '24 23:04 chebbyChefNEQ

Thanks for the suggestions. I found an alternative but would like to try and get this to work at least for a benchmark to compare. I can try next week

jn2clark avatar Jun 07 '24 09:06 jn2clark

@jn2clark Any updates? I'm also looking for ways to use multi-GPU.

baorepo avatar Jul 29 '24 09:07 baorepo

@jn2clark @baorepo I added an example to Lance Deep learning recipes about training GPT-2 using FSDP strategy. It might be useful: https://github.com/lancedb/lance-deeplearning-recipes/tree/main/examples/fsdp-llm-pretraining

tanaymeh avatar Jul 29 '24 09:07 tanaymeh