OLMo icon indicating copy to clipboard operation
OLMo copied to clipboard

RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

Open Jimmy-Yang1217 opened this issue 1 year ago • 2 comments

🐛 Describe the bug

I am new to OLMo and I want to retrain(like finetune) several checkpoints provided by the csv from checkpoints/official. `` However, I followed the instructions in readme and downloaded the checkpoint via the link, but the 'RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory' always throws out.

According to some solution to this kind of questions from Stackoverflow, they pointed out it might caused by the corrupted checkpoint file or wrong torch version.I changed different checkpoints and varied torch version from 2.0.0 to 2.3.0, but the error is still there. Also, the checkpoints download progress seems done, reaching 100%, so the ckpt files should not be corrupted.

Here is my terminal command: torchrun --nproc_per_node=1 scripts/train.py configs/official/OLMo-1B.yaml --load_path=https://olmo-checkpoints.org/ai2-llm/olmo-small/46zc5fly/step369000-unsharded --save_folder=/opt/data/private/OLMo/olmo/step369000 --wandb=null

AND THE ERROR: RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory

Versions

Python 3.8.10 accelerate==0.25.0 ai2-olmo==0.3.1 aiofiles==23.2.1 aiohttp==3.8.6 aiosignal==1.3.1 albumentations==1.3.1 altair==5.1.2 annotated-types==0.6.0 antlr4-python3-runtime==4.9.3 anyio==4.0.0 apache-beam==2.55.1 async-timeout==4.0.3 attrs==23.1.0 backports.zoneinfo==0.2.1 beautifulsoup4==4.12.3 bitsandbytes==0.41.3.post2 boto3==1.34.93 botocore==1.34.93 braceexpand==0.1.7 cached-path==1.6.2 cachetools==5.3.3 certifi==2019.11.28 chardet==3.0.4 charset-normalizer==3.3.1 click==8.1.7 clip==0.2.0 clip-benchmark==1.5.0 cloudpickle==2.2.1 cmake==3.27.7 contourpy==1.1.1 crcmod==1.7 cycler==0.12.1 dataclasses==0.6 datasets==2.14.5 dbus-python==1.2.16 dill==0.3.1.1 dnspython==2.6.1 docker-pycreds==0.4.0 docopt==0.6.2 exceptiongroup==1.1.3 ExifRead-nocycle==3.0.1 fastapi==0.104.1 fastavro==1.9.4 fasteners==0.19 ffmpy==0.3.1 filelock==3.12.4 fire==0.4.0 fonttools==4.44.0 frozenlist==1.4.0 fsspec==2023.9.2 ftfy==6.1.1 gdown==5.1.0 gitdb==4.0.11 GitPython==3.1.40 google-api-core==2.18.0 google-auth==2.29.0 google-cloud-core==2.4.1 google-cloud-storage==2.16.0 google-crc32c==1.5.0 google-resumable-media==2.7.0 googleapis-common-protos==1.63.0 gradio==3.39.0 gradio-client==0.7.0 grpcio==1.62.2 h11==0.14.0 hdfs==2.7.3 httpcore==1.0.2 httplib2==0.22.0 httpx==0.25.1 huggingface-hub==0.22.2 idna==2.8 imageio==2.31.6 img2dataset==1.42.0 importlib-resources==6.1.1 Jinja2==3.1.2 jmespath==1.0.1 joblib==1.3.2 Js2Py==0.74 jsonpickle==3.0.4 jsonschema==4.20.0 jsonschema-specifications==2023.11.1 kiwisolver==1.4.5 lazy-loader==0.3 lightning-utilities==0.11.2 linkify-it-py==2.0.2 lit==17.0.3 loguru==0.7.2 loralib==0.1.2 markdown-it-py==3.0.0 MarkupSafe==2.1.3 matplotlib==3.7.3 mdit-py-plugins==0.3.3 mdurl==0.1.2 mpmath==1.3.0 multidict==6.0.4 multiprocess==0.70.15 networkx==3.1 numpy==1.21.0 nvidia-cublas-cu11==11.10.3.66 nvidia-cublas-cu12==12.1.3.1 nvidia-cuda-cupti-cu11==11.7.101 nvidia-cuda-cupti-cu12==12.1.105 nvidia-cuda-nvrtc-cu11==11.7.99 nvidia-cuda-nvrtc-cu12==12.1.105 nvidia-cuda-runtime-cu11==11.7.99 nvidia-cuda-runtime-cu12==12.1.105 nvidia-cudnn-cu11==8.5.0.96 nvidia-cudnn-cu12==8.9.2.26 nvidia-cufft-cu11==10.9.0.58 nvidia-cufft-cu12==11.0.2.54 nvidia-curand-cu11==10.2.10.91 nvidia-curand-cu12==10.3.2.106 nvidia-cusolver-cu11==11.4.0.1 nvidia-cusolver-cu12==11.4.5.107 nvidia-cusparse-cu11==11.7.4.91 nvidia-cusparse-cu12==12.1.0.106 nvidia-nccl-cu11==2.14.3 nvidia-nccl-cu12==2.19.3 nvidia-nvjitlink-cu12==12.4.127 nvidia-nvtx-cu11==11.7.91 nvidia-nvtx-cu12==12.1.105 objsize==0.7.0 omegaconf==2.3.0 open-clip-torch==2.23.0 openai-clip==1.0.1 opencv-python-headless==4.8.1.78 orjson==3.9.10 packaging==23.2 pandas==1.5.3 pathtools==0.1.2 peft==0.7.1 Pillow==9.1.1 pkgutil-resolve-name==1.3.10 promise==2.3 proto-plus==1.23.0 protobuf==3.20.3 psutil==5.9.6 pyarrow==10.0.1 pyarrow-hotfix==0.6 pyasn1==0.6.0 pyasn1-modules==0.4.0 pycocoevalcap==1.2 pycocotools==2.0.7 pydantic==2.5.1 pydantic-core==2.14.3 pydot==1.4.2 pydub==0.25.1 pygments==2.17.2 PyGObject==3.36.0 pyjsparser==2.7.1 pymongo==4.7.0 pyparsing==3.1.1 PySocks==1.7.1 python-apt==2.0.0+ubuntu0.20.4.7 python-dateutil==2.8.2 python-multipart==0.0.6 pytz==2023.3.post1 PyWavelets==1.4.1 PyYAML==6.0.1 quant-cuda==0.0.0 qudida==0.0.4 referencing==0.31.0 regex==2023.10.3 requests==2.31.0 requests-unixsocket==0.2.0 rich==13.7.1 rpds-py==0.13.0 rsa==4.9 s3transfer==0.10.1 safetensors==0.4.3 scikit-image==0.21.0 scikit-learn==1.3.2 scipy==1.10.1 semantic-version==2.10.0 sentencepiece==0.1.99 sentry-sdk==1.33.1 setproctitle==1.3.3 shortuuid==1.0.11 six==1.14.0 smmap==5.0.1 sniffio==1.3.0 soupsieve==2.5 starlette==0.27.0 sympy==1.12 termcolor==2.3.0 threadpoolctl==3.2.0 tifffile==2023.7.10 timm==0.9.10 tokenizers==0.19.1 toolz==0.12.0 torch==2.0.0 torch-summary==1.4.5 torchaudio==0.9.0 torchmetrics==1.3.2 torchvision==0.15.2+cu117 tqdm==4.66.1 transformers==4.40.1 triton==2.0.0 typing-extensions==4.11.0 tzdata==2023.3 tzlocal==5.2 uc-micro-py==1.0.2 urllib3==1.26.18 uvicorn==0.24.0.post1 wandb==0.12.21 wcwidth==0.2.8 webdataset==0.2.72 websockets==11.0.3 xxhash==3.4.1 yarl==1.9.2 zipp==3.17.0 zstandard==0.22.0

Jimmy-Yang1217 avatar Apr 27 '24 16:04 Jimmy-Yang1217

@Jimmy-Yang1217 - could you please include the log before the error occurs? I'm curious when exactly the error is thrown. Thank you!

dumitrac avatar Apr 30 '24 22:04 dumitrac

Hi, I have already tackled this problem by finding out that some of the download files were broken and the solution is going to the cache to ensure the specific download ckpt file is complete. Here comes another problem I want your guys help. Due to the checkpoint models are different from the released hf-OLMo, if I want to evaluate the checkpoint model on lm-evaluation-harness, I found it hard to directly evaluate on lm_eval for there is no suitable model form the checkpoint of OLMo can fit in. So how you guys evaluate the checkpoint of OLMo on the downstream tasks? Is there any fast way? Thanks for your time!

------------------ 原始邮件 ------------------ 发件人: "Constantin @.>; 发送时间: 2024年5月1日(星期三) 上午6:48 收件人: @.>; 抄送: @.>; @.>; 主题: Re: [allenai/OLMo] RuntimeError: PytorchStreamReader failed reading zip archive: failed finding central directory (Issue #563)

@Jimmy-Yang1217 - could you please include the log before the error occurs? I'm curious when exactly the error is thrown. Thank you!

— Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you were mentioned.Message ID: @.***>

Jimmy-Yang1217 avatar May 31 '24 04:05 Jimmy-Yang1217