lighteval
lighteval copied to clipboard
[BUG] Nanotron batch detection doesn't work
Describe the bug
Running nanotron with batch_size = 0 causes nanotron to crash during batch detection.
(lighteval-main) hynek_kydlicek@ip-26-0-162-233:/fsx/hynek_kydlicek/projects/lighteval-main-branch$ torchrun --standalone --nnodes=1 --nproc-per-node=1 src/lighteval/__main__.py nanotron --checkpoint_config_path ./nanotron/checkpoints/0/config.yaml --lighteval_config_path examples/nanotron/lighteval_config_override_template.yaml
WARNING:lighteval.logging.hierarchical_logger:main: (0, './nanotron/checkpoints/0/config.yaml'), (1, 'examples/nanotron/lighteval_config_override_template.yaml'), (2, '/fsx/hynek_kydlicek/.cache/huggingface'), {
WARNING:lighteval.logging.hierarchical_logger: Load nanotron config {
skip_unused_config_keys set
Skip_null_keys set
WARNING:lighteval.logging.hierarchical_logger: } [0:00:00.013603]
WARNING:lighteval.logging.hierarchical_logger: WARNING: --max_samples WAS SET. THESE NUMBERS ARE ONLY PARTIAL AND SHOULD NOT BE USED FOR COMPARISON UNLESS YOU KNOW WHAT YOU ARE DOING.
WARNING:lighteval.logging.hierarchical_logger: Test all gather {
WARNING:lighteval.logging.hierarchical_logger: Test gather tensor
WARNING:lighteval.logging.hierarchical_logger:[TEST] Running NCCL sync for ranks [0]
WARNING:lighteval.logging.hierarchical_logger:[TEST] NCCL sync for ranks [0]
WARNING:lighteval.logging.hierarchical_logger: } [0:00:00.661526]
WARNING:lighteval.logging.hierarchical_logger: Model loading {
/fsx/hynek_kydlicek/miniconda3/envs/lighteval-main/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1601: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be depracted in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
WARNING:lighteval.models.nanotron_model:Building model
WARNING:lighteval.models.nanotron_model:Sanity checks on model
WARNING:lighteval.models.nanotron_model:Loading checkpoint from ./nanotron/checkpoints/0:
Loading weights: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:00<00:00, 1288.92it/s]
WARNING:lighteval.logging.hierarchical_logger: } [0:00:00.361026]
WARNING:lighteval.logging.hierarchical_logger: Tasks loading {
WARNING:lighteval.logging.hierarchical_logger: If you want to use extended_tasks, make sure you installed their dependencies using `pip install -e .[extended_tasks]`.
WARNING:lighteval.logging.hierarchical_logger: gsm8k main
WARNING:lighteval.logging.hierarchical_logger: Loading documents, and requests
Token indices sequence length is longer than the specified maximum sequence length for this model (985 > 256). Running this sequence through the model will result in indexing errors
WARNING:lighteval.logging.hierarchical_logger: } [0:00:01.286350]
WARNING:lighteval.logging.hierarchical_logger: Setting seeds and waiting for all processes {
WARNING:lighteval.logging.hierarchical_logger: setting seed to 1234 for random and numpy
WARNING:lighteval.logging.hierarchical_logger: } [0:00:00.000133]
WARNING:lighteval.logging.hierarchical_logger: Evaluation {
WARNING:lighteval.logging.hierarchical_logger: Evaluate on 1 tasks.
WARNING:lighteval.logging.hierarchical_logger: Running RequestType.GREEDY_UNTIL requests
WARNING:lighteval.logging.hierarchical_logger: You cannot select the number of dataset splits for a generative evaluation at the moment. Automatically inferring.
greedy -- Node 0: 0%| | 0/1 [00:00<?, ?it/s]WARNING:lighteval.models.nanotron_model:Detecting largest batch size
WARNING:lighteval.models.nanotron_model:Testing batch size 512
greedy -- Node 0: 0%| | 0/1 [00:00<?, ?it/s]
WARNING:lighteval.logging.hierarchical_logger: } [0:00:00.164193]
WARNING:lighteval.logging.hierarchical_logger:} [0:00:02.496358]
[rank0]: Traceback (most recent call last):
[rank0]: File "/fsx/hynek_kydlicek/projects/lighteval-main-branch/src/lighteval/__main__.py", line 93, in <module>
[rank0]: cli_evaluate()
[rank0]: File "/fsx/hynek_kydlicek/projects/lighteval-main-branch/src/lighteval/__main__.py", line 63, in cli_evaluate
[rank0]: main_nanotron(args.checkpoint_config_path, args.lighteval_config_path, args.cache_dir)
[rank0]: File "/fsx/hynek_kydlicek/projects/lighteval-main-branch/src/lighteval/logging/hierarchical_logger.py", line 175, in wrapper
[rank0]: return fn(*args, **kwargs)
[rank0]: File "/fsx/hynek_kydlicek/projects/lighteval-main-branch/src/lighteval/main_nanotron.py", line 97, in main
[rank0]: pipeline.evaluate()
[rank0]: File "/fsx/hynek_kydlicek/projects/lighteval-main-branch/src/lighteval/pipeline.py", line 235, in evaluate
[rank0]: sample_id_to_responses = self._run_model()
[rank0]: File "/fsx/hynek_kydlicek/projects/lighteval-main-branch/src/lighteval/pipeline.py", line 264, in _run_model
[rank0]: responses = run_model(requests, override_bs=self.pipeline_parameters.override_batch_size)
[rank0]: File "/fsx/hynek_kydlicek/miniconda3/envs/lighteval-main/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
[rank0]: return func(*args, **kwargs)
[rank0]: File "/fsx/hynek_kydlicek/projects/lighteval-main-branch/src/lighteval/models/nanotron_model.py", line 1149, in greedy_until
[rank0]: batch_size = self._get_batch_size(
[rank0]: File "/fsx/hynek_kydlicek/projects/lighteval-main-branch/src/lighteval/models/nanotron_model.py", line 320, in _get_batch_size
[rank0]: batch_size = forward_batch()
[rank0]: File "/fsx/hynek_kydlicek/projects/lighteval-main-branch/src/lighteval/utils/parallelism.py", line 104, in decorator
[rank0]: return function(batch_size, *args, **kwargs)
[rank0]: File "/fsx/hynek_kydlicek/projects/lighteval-main-branch/src/lighteval/models/nanotron_model.py", line 317, in forward_batch
[rank0]: F.log_softmax(self._model_call(test_batch).float(), dim=-1).cpu()
[rank0]: File "/fsx/hynek_kydlicek/projects/lighteval-main-branch/src/lighteval/models/nanotron_model.py", line 342, in _model_call
[rank0]: return self.model(inputs)
[rank0]: File "/fsx/hynek_kydlicek/miniconda3/envs/lighteval-main/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
[rank0]: return self._call_impl(*args, **kwargs)
[rank0]: File "/fsx/hynek_kydlicek/miniconda3/envs/lighteval-main/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
[rank0]: return forward_call(*args, **kwargs)
[rank0]: TypeError: LlamaModel.forward() missing 1 required positional argument: 'input_mask'
E0903 13:22:41.743000 140200056006464 torch/distributed/elastic/multiprocessing/api.py:833] failed (exitcode: 1) local_rank: 0 (pid: 1010958) of binary: /fsx/hynek_kydlicek/miniconda3/envs/lighteval-main/bin/python
Traceback (most recent call last):
File "/fsx/hynek_kydlicek/miniconda3/envs/lighteval-main/bin/torchrun", line 8, in <module>
sys.exit(main())
File "/fsx/hynek_kydlicek/miniconda3/envs/lighteval-main/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "/fsx/hynek_kydlicek/miniconda3/envs/lighteval-main/lib/python3.10/site-packages/torch/distributed/run.py", line 901, in main
run(args)
File "/fsx/hynek_kydlicek/miniconda3/envs/lighteval-main/lib/python3.10/site-packages/torch/distributed/run.py", line 892, in run
elastic_launch(
File "/fsx/hynek_kydlicek/miniconda3/envs/lighteval-main/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 133, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/fsx/hynek_kydlicek/miniconda3/envs/lighteval-main/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 264, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
src/lighteval/__main__.py FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2024-09-03_13:22:41
host : ip-26-0-162-233.ec2.internal
rank : 0 (local_rank: 0)
exitcode : 1 (pid: 1010958)
error_file: <N/A>
traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
To Reproduce
torchrun --standalone --nnodes=1 --nproc-per-node=1 src/lighteval/__main__.py nanotron --checkpoint_config_path ./nanotron/checkpoints/0/config.yaml --lighteval_config_path examples/nanotron/lighteval_config_override_template.yaml
Where you substitute 0 for batch size to config.
Expected behavior
Batch size is correctly detected and run finishes
Version info
git+ssh://[email protected]/huggingface/lighteval.git@80b460f496e729077850f379d40da88298489a8f#egg=lighteval