Mihir Patel

Results 172 comments of Mihir Patel

Hm... I'm not super sure what happens if you are on one GPU -- it might be some error with torch dist initialization that is buried. I unfortunately do not...

@nik-mosaic is this still relevant

Hm... both work for me using the LLMFoundry [integration](https://github.com/mosaicml/llm-foundry/tree/main/llmfoundry). I would start tracing back from here: https://github.com/tgale96/grouped_gemm/blob/ebeae0bb3ded459886309b2a30410deb16937af4/csrc/grouped_gemm.cu#L250-L253 It's probably helpful to start by also logging shapes, cuda version, etc and...

I haven't tried it, so I'm honestly not sure 🤷. I'd recommend trying it out and see what happens. I would guess it would be messy given the varying shapes...

Can you provide some more information on what your custom dataloader is? It looks like you are having some trouble running in a distributed setting with your dataloader. Torch dataloaders...

Can you please provide a full trace / logs?

Hm... this is a bit tricky since this would affect all metrics... Two proposed workarounds: 1. Store metric separately in callback, which easily gives control over frequency 2. Have code...

@callmekris if you install the same package versions on the older image (`pip freeze > requirements.txt`, copy paste it into the old image, `pip install -r requirements.txt`), can you see...

Good point! Would you mind opening a PR and tagging me for review?