[BUG] EOF error in pickle when reading arrow file
Bug report checklist
- [X] I provided code that demonstrates a minimal reproducible example.
- [X] I confirmed bug exists on the latest mainline of Chronos via source install.
Describe the bug An error happens when executing the training script on dataset generated using the process mentioned here. Data files used can be downloaded from here Issue is similar to https://github.com/amazon-science/chronos-forecasting/issues/149. The error message is shown below
D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\gluonts\json.py:102: UserWarning: Using `json`-module for json-handling. Consider installing one of `orjson`, `ujson` to speed up serialization and deserialization.
warnings.warn(
2024-07-22 23:38:15,404 - D:\Chronos-Finetune\train.py - INFO - Using SEED: 3565056063
2024-07-22 23:38:15,404 - D:\Chronos-Finetune\train.py - INFO - Logging dir: output\run-7
2024-07-22 23:38:15,404 - D:\Chronos-Finetune\train.py - INFO - Loading and filtering 2 datasets for training: ['D://Chronos-Finetune//noise-data.arrow', 'D://Chronos-Finetune//kernelsynth-data.arrow']
2024-07-22 23:38:15,404 - D:\Chronos-Finetune\train.py - INFO - Mixing probabilities: [0.9, 0.1]
2024-07-22 23:38:15,418 - D:\Chronos-Finetune\train.py - INFO - Initializing model
2024-07-22 23:38:15,418 - D:\Chronos-Finetune\train.py - INFO - Using random initialization
max_steps is given, it will override any value given in num_train_epochs
2024-07-22 23:38:16,324 - D:\Chronos-Finetune\train.py - INFO - Training
0%| | 0/200000 [00:00<?, ?it/s]Traceback (most recent call last):
File "D:\Chronos-Finetune\train.py", line 692, in <module>
app()
File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\typer\main.py", line 326, in __call__
raise e
File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\typer\main.py", line 309, in __call__
return get_command(self)(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\click\core.py", line 1157, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\typer\core.py", line 661, in main
return _main(
^^^^^^
File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\typer\core.py", line 193, in _main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\click\core.py", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\click\core.py", line 783, in invoke
return __callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\typer\main.py", line 692, in wrapper
return callback(**use_params)
^^^^^^^^^^^^^^^^^^^^^^
File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\typer_config\decorators.py", line 92, in wrapped
return cmd(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^
File "D:\Chronos-Finetune\train.py", line 679, in main
trainer.train()
File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\transformers\trainer.py", line 1932, in train
return inner_training_loop(
^^^^^^^^^^^^^^^^^^^^
File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\transformers\trainer.py", line 2230, in _inner_training_loop
for step, inputs in enumerate(epoch_iterator):
File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\accelerate\data_loader.py", line 671, in __iter__
main_iterator = super().__iter__()
^^^^^^^^^^^^^^^^^^
File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\torch\utils\data\dataloader.py", line 439, in __iter__
return self._get_iterator()
^^^^^^^^^^^^^^^^^^^^
File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\torch\utils\data\dataloader.py", line 387, in _get_iterator
return _MultiProcessingDataLoaderIter(self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\torch\utils\data\dataloader.py", line 1040, in __init__
w.start()
File "C:\Users\avpaul\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\process.py", line 121, in start
self._popen = self._Popen(self)
^^^^^^^^^^^^^^^^^
File "C:\Users\avpaul\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\context.py", line 224, in _Popen
return _default_context.get_context().Process._Popen(process_obj)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\avpaul\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\context.py", line 336, in _Popen
return Popen(process_obj)
^^^^^^^^^^^^^^^^^^
File "C:\Users\avpaul\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\popen_spawn_win32.py", line 95, in __init__
reduction.dump(process_obj, to_child)
File "C:\Users\avpaul\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\reduction.py", line 60, in dump
ForkingPickler(file, protocol).dump(obj)
File "<stringsource>", line 2, in pyarrow.lib._RecordBatchFileReader.__reduce_cython__
TypeError: no default __reduce__ due to non-trivial __cinit__
0%| | 0/200000 [00:00<?, ?it/s]
(Chronos_venv) D:\Chronos-Finetune>D:\Chronos-Finetune\Chronos_venv\Lib\site-packages\gluonts\json.py:102: UserWarning: Using `json`-module for json-handling. Consider installing one of `orjson`, `ujson` to speed up serialization and deserialization.
warnings.warn(
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Users\avpaul\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\spawn.py", line 122, in spawn_main
exitcode = _main(fd, parent_sentinel)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\avpaul\AppData\Local\Programs\Python\Python311\Lib\multiprocessing\spawn.py", line 132, in _main
self = reduction.pickle.load(from_parent)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
EOFError: Ran out of input
Expected behavior Training/fine tuning should proceed smoothly
To reproduce
- Download the data files from from here
- Change the path location in config script for the data files
- Run the training script using
python train.py --config chronos-t5-small.yaml
Environment description Operating system: Windows 11 CUDA version: 12.4 NVCC version: cuda_12.3.r12.3/compiler.33567101_0 PyTorch version: 2.3.1+cu121 HuggingFace transformers version: 4.42.4 HuggingFace accelerate version: 0.32.1
This seems relevant, see also the first answer here.
TLDR: we probably need to add freeze_support() after if __name__ == "__main__": in the training script
@AvisP could you check if the fix proposed in #156 makes it work for you?
@lostella added the freeze_support() after this line but not working. In the example from python website that you shared, there is a call to Process which I don't see happening in the training code, maybe it needs to be inserted before that?
I tried to get wsl on windows and make it from work from there but unfortunately it is not working properly.
@AvisP can you share the exact config.yaml that you're using?
Sure here it is. I tried with two datasets also and setting probability to 0.9,0.1
training_data_paths:
# - "D://Chronos-Finetune//noise-data.arrow"
- "D://Chronos-Finetune//kernelsynth-data.arrow"
probability:
- 1.0
# - 0.1
context_length: 512
prediction_length: 64
min_past: 60
max_steps: 200_000
save_steps: 100_000
log_steps: 500
per_device_train_batch_size: 32
learning_rate: 0.001
optim: adamw_torch_fused
num_samples: 20
shuffle_buffer_length: 100_000
gradient_accumulation_steps: 1
model_id: google/t5-efficient-small
model_type: seq2seq
random_init: true
tie_embeddings: true
output_dir: ./output/
tf32: true
torch_compile: true
tokenizer_class: "MeanScaleUniformBins"
tokenizer_kwargs:
low_limit: -15.0
high_limit: 15.0
n_tokens: 4096
lr_scheduler_type: linear
warmup_ratio: 0.0
dataloader_num_workers: 1
max_missing_prop: 0.9
use_eos_token: true
Config looks okay to me. Could you try the following and try again?
- Set
torch_compile: false. - Set
dataloader_num_workers: 0.
Let's use just one kernel synth dataset like you have.
It is running now after making these two changes. Does setting the dataloader_num_workers to 0 cause any slow down of data laoding process? I will try out the evaluation script next. Thanks for your time!
@AvisP This looks like a multiprocessing on Windows issue. Setting dataloader_num_workers=0 may lead to some loss in training speed.
I'm having this issue on macos, setting dataloader_num_workers=0 does "fix" it.
The only difference is that it crash at:
TypeError: no default __reduce__ due to non-trivial __cinit__
@RemiKalbe did you convert the dataset into GluonTS-style arrow format correctly, as described in the readme?
Closing due to inactivity. Please feel free to re-open if you have further questions.