chongxiaoc comments

Results 37 comments of


                                            chongxiaoc

Support newer version of pytorch_lightning

@serena-ruan PTL 1.6.3 changed data module hook function behavior, and broke horovod lightning estimator in validation step. I've tried a few work-around before but unfortunately, it didn't work out. Contribution...

Horovod.torch with 1 worker does not reproduce not distributed training

Is there a simple reproducer you can provide? A simple toy model with dummy data for example?

[BUG] (NVMe Offload with Zero3) Not enough buffers 0 for swapping 1

Hi, I'm getting same issue when using deepspeed 0.10.0 with huggingface transformers. ``` 727AssertionError: Not enough buffers 0 for swapping 1726 assert len(swap_in_paths)

Saving a checkpoint when training with NVMe offloading?

+1. Would like this feature to be supported.

Errors with ZERO2 in the encoder-decoder model

same here. Model is OpenAssistant/reward-model-deberta-v3-large-v2

[llm_text_generation] RuntimeError: Expected all tensors to be on the same device,

I added `deepspeed` config below but it still failed with same error above. ``` backend: type: ray trainer: use_gpu: true strategy: type: deepspeed zero_optimization: stage: 3 offload_optimizer: device: cpu pin_memory:...

[llm_text_generation] RuntimeError: Expected all tensors to be on the same device,

Look like `class 'ludwig.trainers.trainer_llm.NoneTrainer'` is the root cause, which doesn't init distributed backend.