mistral-finetune
mistral-finetune copied to clipboard
[BUG: validate_data.py ModuleNotFoundError (finetune & tensorflow)
Python Version
Python 3.10.12
Pip Freeze
absl-py==2.1.0
annotated-types==0.7.0
astunparse==1.6.3
attrs==24.2.0
beautifulsoup4==4.12.3
blis==0.7.11
bs4==0.0.2
catalogue==2.0.10
certifi==2024.8.30
charset-normalizer==3.3.2
click==8.1.7
cloudpathlib==0.19.0
confection==0.1.5
cramjam==2.8.3
cymem==2.0.8
docstring_parser==0.16
fastparquet==2024.5.0
filelock==3.15.4
finetune==0.10.0
fire==0.6.0
flatbuffers==24.3.25
fsspec==2024.9.0
ftfy==6.2.3
gast==0.6.0
google-pasta==0.2.0
grpcio==1.66.1
h5py==3.11.0
huggingface-hub==0.24.6
idna==3.8
Jinja2==3.1.4
joblib==1.4.2
jsonschema==4.23.0
jsonschema-specifications==2023.12.1
keras==3.5.0
langcodes==3.4.0
language_data==1.2.0
libclang==18.1.1
lxml==5.3.0
marisa-trie==1.2.0
Markdown==3.7
markdown-it-py==3.0.0
MarkupSafe==2.1.5
mdurl==0.1.2
mistral_common==1.3.4
ml-dtypes==0.4.0
mpmath==1.3.0
murmurhash==1.0.10
namex==0.0.8
networkx==3.3
nltk==3.9.1
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.19.3
nvidia-nvjitlink-cu12==12.6.68
nvidia-nvtx-cu12==12.1.105
opt-einsum==3.3.0
optree==0.12.1
packaging==24.1
pandas==2.2.2
preshed==3.0.9
protobuf==4.25.4
psutil==5.7.0
pyarrow==17.0.0
pydantic==2.9.0
pydantic_core==2.23.2
Pygments==2.18.0
python-dateutil==2.9.0.post0
pytz==2024.1
PyYAML==6.0.2
referencing==0.35.1
regex==2024.7.24
requests==2.32.3
rich==13.8.0
rpds-py==0.20.0
safetensors==0.4.5
scikit-learn==1.5.1
scipy==1.14.1
sentencepiece==0.2.0
shellingham==1.5.4
simple-parsing==0.1.6
six==1.16.0
smart-open==7.0.4
soupsieve==2.6
spacy==3.7.6
spacy-legacy==3.0.12
spacy-loggers==1.0.5
srsly==2.4.8
sympy==1.13.2
tabulate==0.8.10
tensorboard==2.17.1
tensorboard-data-server==0.7.2
tensorflow==2.17.0
tensorflow-addons==0.16.1
tensorflow-estimator==2.11.0
tensorflow-io-gcs-filesystem==0.37.1
termcolor==2.4.0
thinc==8.2.5
threadpoolctl==3.5.0
tiktoken==0.7.0
tokenizers==0.13.3
torch==2.2.0
tqdl==0.0.4
tqdm==4.66.5
transformers==4.25.1
triton==2.2.0
typeguard==4.3.0
typer==0.12.5
typing_extensions==4.12.2
tzdata==2024.1
urllib3==2.2.2
wasabi==1.1.3
wcwidth==0.2.13
weasel==0.4.1
Werkzeug==3.0.4
wrapt==1.16.0
xformers==0.0.24
Reproduction Steps
- Follow instructions from the README
- Run the validation script in a python virtual environment as
python ./mistral-finetune/utils/validate_data.py --train_yaml ./mistral-finetune/example/7B.yaml
Expected Behavior
According to the README, it should return a "a summary of the data input and training parameters" such as:
Train States
--------------------
{
"expected": {
"eta": "00:52:44",
"data_tokens": 25169147,
"train_tokens": 131072000,
"epochs": "5.21",
"max_steps": 500,
"data_tokens_per_dataset": {
"/Users/johndoe/data/ultrachat_chunk_train.jsonl": "25169147.0"
},
"train_tokens_per_dataset": {
"/Users/johndoe/data/ultrachat_chunk_train.jsonl": "131072000.0"
},
"epochs_per_dataset": {
"/Users/johndoe/data/ultrachat_chunk_train.jsonl": "5.2"
}
},
}
Additional Context
The script returns the following error:
Traceback (most recent call last):
File "/cluster/flash/wichtco/ai-fine-tuning/./mistral-finetune/utils/validate_data.py", line 16, in <module>
from finetune.args import TrainArgs
ModuleNotFoundError: No module named 'finetune'
When installing the latest 'finetune-0.10.0' release, it returns a second error also related to a missing package:
Traceback (most recent call last):
File "/cluster/flash/wichtco/ai-fine-tuning/./mistral-finetune/utils/validate_data.py", line 16, in <module>
from finetune.args import TrainArgs
File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.10/site-packages/finetune/__init__.py", line 12, in <module>
import tensorflow as tf
ModuleNotFoundError: No module named 'tensorflow'
Suggested Solutions
When installing the second missing package 'tensorflow-2.17.0' the problem should be fixed though it returns a pip's depencendy conflict:
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
finetune 0.10.0 requires numpy<1.24.0,>=1.18.4, but you have numpy 1.26.4 which is incompatible.
Since finetune 0.10.0 requires numpy <1.24.0 while tensorflow-2.17.0 requires version numpy 1.26.4, I really don't see how I could make your script work.
Any idea?
Best,
C.
Follow up:
Command torchrun --nproc-per-node 8 --master_port $RANDOM -m train example/7B.yamltorchrun --nproc-per-node 8 --master_port $RANDOM -m train example/7B.yaml
seems to fail as well due to a missing package:
[2024-09-06 16:02:17,003] torch.distributed.run: [WARNING]
[2024-09-06 16:02:17,003] torch.distributed.run: [WARNING] *****************************************
[2024-09-06 16:02:17,003] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-09-06 16:02:17,003] torch.distributed.run: [WARNING] *****************************************
/cluster/flash/wichtco/ai-fine-tuning/.venv/bin/python: No module named train
/cluster/flash/wichtco/ai-fine-tuning/.venv/bin/python: No module named train
/cluster/flash/wichtco/ai-fine-tuning/.venv/bin/python: No module named train/cluster/flash/wichtco/ai-fine-tuning/.venv/bin/python: No module named train
/cluster/flash/wichtco/ai-fine-tuning/.venv/bin/python: No module named train
/cluster/flash/wichtco/ai-fine-tuning/.venv/bin/python: No module named train
/cluster/flash/wichtco/ai-fine-tuning/.venv/bin/python: No module named train
/cluster/flash/wichtco/ai-fine-tuning/.venv/bin/python: No module named train
[2024-09-06 16:02:22,013] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 422322) of binary: /cluster/flash/wichtco/ai-fine-tuning/.venv/bin/python
Traceback (most recent call last):
File "/cluster/flash/wichtco/ai-fine-tuning/.venv/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.10/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/cluster/flash/wichtco/ai-fine-tuning/.venv/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
train FAILED
------------------------------------------------------------
And when trying to install 'train-0.0.5', I got another pip's dependency conflict with the same packages as above:
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
thinc 8.2.5 requires numpy<2.0.0,>=1.19.0; python_version >= "3.9", but you have numpy 2.1.1 which is incompatible.
tensorflow 2.17.0 requires numpy<2.0.0,>=1.23.5; python_version <= "3.11", but you have numpy 2.1.1 which is incompatible.
finetune 0.10.0 requires numpy<1.24.0,>=1.18.4, but you have numpy 2.1.1 which is incompatible.