Automodel fix: Use uv Python for MCore dataset compilation (#438)

Description

Changed Makefile to use uv run python instead of system python3, ensuring the compiled extension matches the uv Python environment.

Also added -undefined dynamic_lookup linking flag for macOS to fix 'Undefined symbols' errors during compilation.

Testing

Verified with system Python 3.11 and uv Python 3.12 - the compiled .so file now correctly uses the uv Python version (3.12).

Fixes #438

First-time contributor here. I'm a research engineer transitioning from edge models to LLM infrastructure and algorithms. Happy to help with more tasks in the future. Thanks for reviewing!

Nov 16 '25 22:11 yuhezhang-ai

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Nov 16 '25 22:11 copy-pr-bot[bot]

Thank you so much @yuhezhang-ai ! We really appreciate community contributions :)

@nvidia-nemo/automation @thomasdhc can you help verify the changes?

Nov 16 '25 23:11 adil-a

/ok to test 8d95c2c2eb92109297813d9031d05f133612237c

Nov 17 '25 18:11 akoumpa

/ok to test 986d48e61f9096f73c96c71e3a1319a4690cfcf6

Nov 18 '25 00:11 akoumpa

Hi, I updated branch to include the Makefile package-data fix from main.

About the previous CI failures:

Looking at the logs, the failures were due to:

RuntimeError: PyTorch has CUDA Version=12.9 and torchvision has CUDA Version=13.0

This appears to be a dependency resolution issue in the CI environment, unrelated to the Makefile changes (the compilation test itself passed).

Should I wait to see if this persists after the package-data fix, or would you like me to investigate a torchvision version constraint? such as add a torchvision version pin to pyproject.toml?

Thanks!

Nov 18 '25 17:11 yuhezhang-ai

Hey @yuhezhang-ai Thanks so this update. The uv run python is actually re-installing torch when it should not be and is causing this error. This is caused by some of our testing setup incorrectly mounting another copy of Automodel. I'll need to make changes to the overall test workflow. When that PR is done I will apply those changes to this PR.

No further action needs to be taken from your side.

Thanks!

Nov 18 '25 20:11 thomasdhc

Thanks for clarifying! I appreciate you taking the time to explain the root cause.

I'm interested in contributing more to the project as I learn about LLM infrastructure. Are there other issues that might be suitable for me to work on?

My Background:

Computer vision research engineer with algorithm experience (and actively learning LLM/VLM)
Some Triton kernel knowledge, but limited distributed training experience
No GPU cluster access, but can test single-GPU scenarios via Colab

I can probably help with algorithm, code quality, bug fixes, kernel optimization - tasks that can be developed/verified on single-GPU.

for example, I noticed https://github.com/NVIDIA-NeMo/Automodel/issues/780 (sequence classification metrics bug) seems suitable for me. It's about correctness and can be tested on Colab, though it has an assignee.

Happy to help with whatever you think would be suitable! 🙏

Nov 19 '25 03:11 yuhezhang-ai

Hey @yuhezhang-ai thank you so much for your enthusiasm! It'd be great to have more hands on-board :) We usually file any open issues on the GitHub Issues tab so feel free to pick up anything interesting to you. https://github.com/NVIDIA-NeMo/Automodel/issues/780 might be a good and easy one to pick up

Nov 19 '25 04:11 adil-a

/ok to test ecde1487656b4b8eadb15fe509ef383ecf5b860e

Nov 19 '25 18:11 akoumpa