llm-foundry
                                
                                 llm-foundry copied to clipboard
                                
                                    llm-foundry copied to clipboard
                            
                            
                            
                        Fine-tuning error in conda environment without docker image
Environment
python 3.11.9 cuda 11.8 torch 2.4.0+cu118
PyTorch information
PyTorch version: 2.4.0+cu118 Is debug build: False CUDA used to build PyTorch: 11.8 ROCM used to build PyTorch: N/A
OS: Ubuntu 20.04.6 LTS (x86_64) GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 Clang version: Could not collect CMake version: version 3.30.3 Libc version: glibc-2.31
Python version: 3.11.9 (main, Apr 19 2024, 16:48:06) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-5.4.0-192-generic-x86_64-with-glibc2.31 Is CUDA available: True CUDA runtime version: 11.8.89 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100 80GB PCIe Nvidia driver version: 550.54.14 cuDNN version: Probably one of the following: /usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7 /usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7 /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn.so.8.9.4 /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.9.4 /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.9.4 /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.9.4 /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.9.4 /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.9.4 /usr/local/cuda-11.8/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.9.4 /usr/local/cuda-12.1/targets/x86_64-linux/lib/libcudnn.so.8.9.4 /usr/local/cuda-12.1/targets/x86_64-linux/lib/libcudnn_adv_infer.so.8.9.4 /usr/local/cuda-12.1/targets/x86_64-linux/lib/libcudnn_adv_train.so.8.9.4 /usr/local/cuda-12.1/targets/x86_64-linux/lib/libcudnn_cnn_infer.so.8.9.4 /usr/local/cuda-12.1/targets/x86_64-linux/lib/libcudnn_cnn_train.so.8.9.4 /usr/local/cuda-12.1/targets/x86_64-linux/lib/libcudnn_ops_infer.so.8.9.4 /usr/local/cuda-12.1/targets/x86_64-linux/lib/libcudnn_ops_train.so.8.9.4 HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
CPU: Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian Address sizes: 43 bits physical, 48 bits virtual CPU(s): 64 On-line CPU(s) list: 0-63 Thread(s) per core: 2 Core(s) per socket: 32 Socket(s): 1 NUMA node(s): 1 Versions of relevant libraries: [pip3] numpy==1.26.3 [pip3] onnx==1.16.2 [pip3] onnxruntime==1.19.0 [pip3] pytorch-ranger==0.1.1 [pip3] torch==2.4.0+cu118 [pip3] torch-optimizer==0.3.0 [pip3] torchaudio==2.4.0+cu118 [pip3] torchmetrics==1.4.0.post0 [pip3] torchvision==0.19.0+cu118 [pip3] triton==3.0.0 [conda] numpy 1.26.3 pypi_0 pypi [conda] pytorch-ranger 0.1.1 pypi_0 pypi [conda] torch 2.4.0+cu118 pypi_0 pypi [conda] torch-optimizer 0.3.0 pypi_0 pypi [conda] torchaudio 2.4.0+cu118 pypi_0 pypi [conda] torchmetrics 1.4.0.post0 pypi_0 pypi [conda] torchvision 0.19.0+cu118 pypi_0 pypi [conda] triton 3.0.0 pypi_0 pypi
Composer information
Composer Version: 0.24.1 Composer Commit Hash: None CPU Model: AMD EPYC 7542 32-Core Processor CPU Count: 32 Number of Nodes: 1 GPU Model: NVIDIA A100 80GB PCIe GPUs per Node: 1 GPU Count: 1 CUDA Device Count: 1
-->
To reproduce
Steps to reproduce the behavior:
1.pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu118 --force-reinstall
2.pip install -e .
3.cd scripts/train
4. composer train.py finetune_example/gpt2-arc-easy--cpu.yaml
It gives the following error when run on cpu : omegaconf.errors.InterpolationKeyError: Interpolation key 'global_seed' not found
5. composer train.py finetune_example/mpt-7b-arc-easy--gpu.yaml
It gives the following error when run on gpu: ValueError: Unused parameters ['global_seed'] found in cfg. Please check your yaml to ensure these parameters are necessary. Please place any variables under the variables key.
When run on gpu:
Expected behavior
The fine-tuning should work