Sam Stoelinga
Sam Stoelinga
This PR is still in draft mode. I will be able to update it once I get 405B working on trillium.
The implementation is wrong, hidden_dim should be 16384 and ffn_dim should be 53248 right? I will update this PR once I get my `trillium-405b` branch working.
Unable to run golden_config_test: ``` _______________________ ERROR collecting axlearn/experiments/golden_config_test.py ________________________ ImportError while importing test module '/Users/stoelinga/workspace/axlearn/axlearn/experiments/golden_config_test.py'. Hint: make sure your test modules/packages have valid Python names. Traceback: ../../miniforge3/envs/axlearn-10/lib/python3.10/importlib/__init__.py:126: in import_module...
@kelvin-zou could you give it another review? I added the TransformerLayer input checkpointing offload to host. This is required in order to run 405B. I did something similar in my...
this needs to be rebased since shared_lm_head is a new thing in latest main.
I think the issue is related to creating a pod without GPU but still being able to access the GPU. I am having this issue when I only run 1...
I just tried out minikube with GPU support and it's working flawlessly. I don't see the same issue there. So it does seem related to kind + GPU support when...
@ashvinnihalani are you still working on this? This would be also be helpful to be able to load large models in environments where disk space isn't enough. The issue with...
Definitely something we're interested in. Would love perspectives from others on why we should or shouldn't do it. @kaiehrhardt since you upvoted this issue and have contributed. Also if you...
We want to go ahead with this. @kiwansky would appreciate any help with moving it to CNCF. We're also open to adding more maintainers and giving up control of the...