fix: Use uv Python for MCore dataset compilation (#438)
Description
Changed Makefile to use uv run python instead of system python3, ensuring the compiled extension matches the uv Python environment.
Also added -undefined dynamic_lookup linking flag for macOS to fix 'Undefined symbols' errors during compilation.
Testing
Verified with system Python 3.11 and uv Python 3.12 - the compiled .so file now correctly uses the uv Python version (3.12).
Fixes #438
First-time contributor here. I'm a research engineer transitioning from edge models to LLM infrastructure and algorithms. Happy to help with more tasks in the future. Thanks for reviewing!
This pull request requires additional validation before any workflows can run on NVIDIA's runners.
Pull request vetters can view their responsibilities here.
Contributors can view more details about this message here.
Thank you so much @yuhezhang-ai ! We really appreciate community contributions :)
@nvidia-nemo/automation @thomasdhc can you help verify the changes?
/ok to test 8d95c2c2eb92109297813d9031d05f133612237c
/ok to test 986d48e61f9096f73c96c71e3a1319a4690cfcf6
Hi, I updated branch to include the Makefile package-data fix from main.
About the previous CI failures:
Looking at the logs, the failures were due to:
RuntimeError: PyTorch has CUDA Version=12.9 and torchvision has CUDA Version=13.0
This appears to be a dependency resolution issue in the CI environment, unrelated to the Makefile changes (the compilation test itself passed).
Should I wait to see if this persists after the package-data fix, or would you like me to investigate a torchvision version constraint? such as add a torchvision version pin to pyproject.toml?
Thanks!
Hey @yuhezhang-ai Thanks so this update. The uv run python is actually re-installing torch when it should not be and is causing this error. This is caused by some of our testing setup incorrectly mounting another copy of Automodel. I'll need to make changes to the overall test workflow. When that PR is done I will apply those changes to this PR.
No further action needs to be taken from your side.
Thanks!
Thanks for clarifying! I appreciate you taking the time to explain the root cause.
I'm interested in contributing more to the project as I learn about LLM infrastructure. Are there other issues that might be suitable for me to work on?
My Background:
- Computer vision research engineer with algorithm experience (and actively learning LLM/VLM)
- Some Triton kernel knowledge, but limited distributed training experience
- No GPU cluster access, but can test single-GPU scenarios via Colab
I can probably help with algorithm, code quality, bug fixes, kernel optimization - tasks that can be developed/verified on single-GPU.
for example, I noticed https://github.com/NVIDIA-NeMo/Automodel/issues/780 (sequence classification metrics bug) seems suitable for me. It's about correctness and can be tested on Colab, though it has an assignee.
Happy to help with whatever you think would be suitable! 🙏
Hey @yuhezhang-ai thank you so much for your enthusiasm! It'd be great to have more hands on-board :) We usually file any open issues on the GitHub Issues tab so feel free to pick up anything interesting to you. https://github.com/NVIDIA-NeMo/Automodel/issues/780 might be a good and easy one to pick up
/ok to test ecde1487656b4b8eadb15fe509ef383ecf5b860e