Masaki Kozuki comments

Results 167 comments of


                                            Masaki Kozuki

Using nvidia_dlprof_pytorch_nvtx.init() with apex errors out as "ModuleNotFoundError: No module named 'xentropy_cuda' "

`xentropy_cuda` is not compiled with `--cuda_ext` option, but `--xentropy`: https://github.com/NVIDIA/apex#custom-ccuda-extensions-and-install-options

`pyproject.toml` missing `packaging` dependency

that sounds right, would you mind opening a pull request?

`pyproject.toml` missing `packaging` dependency

Personally I recommend using `--no-build-isolation` as even `packaging` is installed, I guess it would be a bit tricky to install the same torch as in the environment into a build...

`pyproject.toml` missing `packaging` dependency

hmm, I haven't found myself in the same situation. what if the latest pip and multiple `--config-settigns`?

`pyproject.toml` missing `packaging` dependency

one way (I wouldn't recommend though) to dodge pyproject.toml dependency management could be to use `python setup.py install ---cpp_ext --cuda_ext ...` to avoid pip being called

Calling general_thunder_jit inside lookasides doesn't work

A tidy reproducible code is shared, as in the description. I confirmed that we can reproduce the error (with the slightly different KeyError message, with high probability)

`apex.contrib.group_norm` would better have an import guard of `group_norm_cuda`

Alternative: have a check in test and skip accordingly

Support nvFP4 with thunder (+nvFuser) and enable running the evaluation script with thunder + nvFP4

https://github.com/Lightning-AI/lightning-thunder/pull/2633 would be related or could be a reference point

Workaround for Adam.step's lazy initialization of its inner state

> PyTorch decorates the `_init_group` method of every optimizer class with a wrapper that disables Dynamo to trace it. Thus `_init_group` is always executed in the eager mode. I'd expect...

Collect popular models' `GraphModule`s with distributed collective communication operators in there

Can you check llama 4, deep seek v3.1, and qwen3 next? I understand the models are quite heavy though