Mihir Patel comments

Results 172 comments of


                                            Mihir Patel

what devices are supported?

Hm... I'm not super sure what happens if you are on one GPU -- it might be some error with torch dist initialization that is buried. I unfortunately do not...

Fix TRT-LLM Multigpu Compatibility

@nik-mosaic is this still relevant

Grouped GEMM execution not possible with HW

Hm... both work for me using the LLMFoundry [integration](https://github.com/mosaicml/llm-foundry/tree/main/llmfoundry). I would start tracing back from here: https://github.com/tgale96/grouped_gemm/blob/ebeae0bb3ded459886309b2a30410deb16937af4/csrc/grouped_gemm.cu#L250-L253 It's probably helpful to start by also logging shapes, cuda version, etc and...

Does it work with torch.compile?

I haven't tried it, so I'm honestly not sure 🤷. I'd recommend trying it out and see what happens. I would guess it would be messy given the varying shapes...

Training verbose logs

Can you provide some more information on what your custom dataloader is? It looks like you are having some trouble running in a distributed setting with your dataloader. Torch dataloaders...

Training verbose logs

Can you please provide a full trace / logs?

Computing train metrics at a given frequency

Hm... this is a bit tricky since this would affect all metrics... Two proposed workarounds: 1. Store metric separately in callback, which easily gives control over frequency 2. Have code...

Unable to script model

@callmekris if you install the same package versions on the older image (`pip freeze > requirements.txt`, copy paste it into the old image, `pip install -r requirements.txt`), can you see...

TypeError: Subscripted generics cannot be used with class and instance checks

Good point! Would you mind opening a PR and tagging me for review?

TypeError: Subscripted generics cannot be used with class and instance checks

Quick fix just did it. Thanks for flagging!