olmocr icon indicating copy to clipboard operation
olmocr copied to clipboard

Fails to process S3 files with a comma in the name

Open Dennitz opened this issue 5 days ago • 0 comments

🐛 Describe the bug

When I do

python -m olmocr.pipeline s3://my-bucket/workspace --pdfs s3://my-bucket/inputs/*.pdf

any PDF files containing a comma in their name won't get processed.

Say for example a file is named test, ab.pdf, then there is this warning in the logs:

WARNING:olmocr.s3_utils:Attempt 8 failed to get_s3_bytes for  ab.pdf: s3_path must start with s3://, gs://, or weka://.

Versions

Python 3.11.11 aiohappyeyeballs==2.4.6 aiohttp==3.11.13 aiosignal==1.3.2 annotated-types==0.7.0 anthropic==0.47.2 anyio==4.8.0 asttokens==3.0.0 attrs==25.1.0 beaker-py==1.34.1 bleach==6.2.0 boto3==1.37.1 botocore==1.37.1 cached_path==1.6.7 cachetools==5.5.2 certifi==2025.1.31 cffi==1.17.1 charset-normalizer==3.4.1 click==8.1.8 cloudpickle==3.1.1 compressed-tensors==0.8.0 cryptography==44.0.1 cuda-bindings==12.8.0 cuda-python==12.8.0 datasets==3.3.2 decorator==5.2.1 decord==0.6.0 dill==0.3.8 diskcache==5.6.3 distro==1.9.0 docker==7.1.0 einops==0.8.1 executing==2.2.0 fastapi==0.115.8 filelock==3.17.0 flashinfer==0.1.6+cu124torch2.4 frozenlist==1.5.0 fsspec==2024.12.0 ftfy==6.3.1 fuzzysearch==0.7.3 gguf==0.10.0 google-api-core==2.24.1 google-auth==2.38.0 google-cloud-core==2.4.2 google-cloud-storage==2.19.0 google-crc32c==1.6.0 google-resumable-media==2.7.2 googleapis-common-protos==1.68.0 h11==0.14.0 hf_transfer==0.1.9 httpcore==1.0.7 httptools==0.6.4 httpx==0.28.1 huggingface-hub==0.27.1 idna==3.10 importlib_metadata==8.6.1 iniconfig==2.0.0 interegular==0.3.3 ipython==8.32.0 jedi==0.19.2 Jinja2==3.1.5 jiter==0.8.2 jmespath==1.0.1 jsonschema==4.23.0 jsonschema-specifications==2024.10.1 lark==1.2.2 lingua-language-detector==2.0.2 litellm==1.61.16 llvmlite==0.44.0 lm-format-enforcer==0.10.10 markdown-it-py==3.0.0 markdown2==2.5.3 MarkupSafe==3.0.2 matplotlib-inline==0.1.7 mdurl==0.1.2 mistral_common==1.5.3 modelscope==1.23.1 mpmath==1.3.0 msgpack==1.1.0 msgspec==0.19.0 multidict==6.1.0 multiprocess==0.70.16 nest-asyncio==1.6.0 networkx==3.4.2 numba==0.61.0 numpy==1.26.4 nvidia-cublas-cu12==12.4.5.8 nvidia-cuda-cupti-cu12==12.4.127 nvidia-cuda-nvrtc-cu12==12.4.127 nvidia-cuda-runtime-cu12==12.4.127 nvidia-cudnn-cu12==9.1.0.70 nvidia-cufft-cu12==11.2.1.3 nvidia-curand-cu12==10.3.5.147 nvidia-cusolver-cu12==11.6.1.9 nvidia-cusparse-cu12==12.3.1.170 nvidia-cusparselt-cu12==0.6.2 nvidia-ml-py==12.570.86 nvidia-nccl-cu12==2.21.5 nvidia-nvjitlink-cu12==12.4.127 nvidia-nvtx-cu12==12.4.127 -e git+https://github.com/allenai/olmocr.git@d4b902cea235bb64a252d1e3f53cad41e22eb6ea#egg=olmocr openai==1.64.0 opencv-python-headless==4.11.0.86 orjson==3.10.15 outlines==0.0.46 packaging==24.2 pandas==2.2.3 parso==0.8.4 partial-json-parser==0.2.1.1.post5 pexpect==4.9.0 pillow==11.1.0 pluggy==1.5.0 prometheus-fastapi-instrumentator==7.0.2 prometheus_client==0.21.1 prompt_toolkit==3.0.50 propcache==0.3.0 proto-plus==1.26.0 protobuf==5.29.3 psutil==7.0.0 ptyprocess==0.7.0 pure_eval==0.2.3 py-cpuinfo==9.0.0 pyairports==2.1.1 pyarrow==19.0.1 pyasn1==0.6.1 pyasn1_modules==0.4.1 pybind11==2.13.6 pycountry==24.6.1 pycparser==2.22 pydantic==2.10.6 pydantic_core==2.27.2 Pygments==2.19.1 pypdf==5.3.0 pypdfium2==4.30.1 pytest==8.3.4 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 python-multipart==0.0.20 pytz==2025.1 PyYAML==6.0.2 pyzmq==26.2.1 RapidFuzz==3.12.1 ray==2.42.1 referencing==0.36.2 regex==2024.11.6 requests==2.32.3 rich==13.9.4 rpds-py==0.23.1 rsa==4.9 s3transfer==0.11.2 safetensors==0.5.3 sentencepiece==0.2.0 setproctitle==1.3.5 sgl-kernel==0.0.3.post1 sglang==0.4.2 six==1.17.0 smart-open==7.1.0 sniffio==1.3.1 stack-data==0.6.3 starlette==0.45.3 sympy==1.13.1 tiktoken==0.9.0 tokenizers==0.21.0 torch==2.5.1 torchao==0.8.0 torchvision==0.20.1 tqdm==4.67.1 traitlets==5.14.3 transformers==4.49.0 triton==3.1.0 typing_extensions==4.12.2 tzdata==2025.1 urllib3==2.3.0 uvicorn==0.34.0 uvloop==0.21.0 vllm==0.6.4.post1 watchfiles==1.0.4 wcwidth==0.2.13 webencodings==0.5.1 websockets==15.0 wrapt==1.17.2 xformers==0.0.28.post3 xgrammar==0.1.13 xxhash==3.5.0 yarl==1.18.3 zipp==3.21.0 zstandard==0.23.0

Dennitz avatar Feb 27 '25 09:02 Dennitz