olmocr icon indicating copy to clipboard operation
olmocr copied to clipboard

Muliple PDFs Error, [Errno 110] Connect call failed

Open xcvil opened this issue 9 months ago • 12 comments

🐛 Describe the bug

I keep countering errors when converting multiple PDFs

WARNING - Client error on attempt 0 for /path-to-pdf/xxx.pdf-1782: <class 'TimeoutError'> [Errno 110] Connect call failed ('127.0.0.1', 31824)

which causes me only around 15% of the PDFs can be converted successfully

Versions

Python 3.11.11 aiohappyeyeballs==2.6.1 aiohttp==3.11.14 aiosignal==1.3.2 annotated-types==0.7.0 anthropic==0.49.0 anyio==4.9.0 asttokens==3.0.0 attrs==25.3.0 beaker-py==1.34.1 bleach==6.2.0 boto3==1.37.14 botocore==1.37.14 cached_path==1.7.1 cachetools==5.5.2 certifi==2025.1.31 cffi==1.17.1 charset-normalizer==3.4.1 click==8.1.8 cloudpickle==3.1.1 compressed-tensors==0.8.0 cryptography==44.0.2 cuda-bindings==12.8.0 cuda-python==12.8.0 datasets==3.4.1 decorator==5.2.1 decord==0.6.0 dill==0.3.8 diskcache==5.6.3 distro==1.9.0 docker==7.1.0 einops==0.8.1 executing==2.2.0 fastapi==0.115.11 filelock==3.18.0 flashinfer==0.1.6+cu124torch2.4 frozenlist==1.5.0 fsspec==2024.12.0 ftfy==6.3.1 gguf==0.10.0 google-api-core==2.24.2 google-auth==2.38.0 google-cloud-core==2.4.3 google-cloud-storage==2.19.0 google-crc32c==1.7.0 google-resumable-media==2.7.2 googleapis-common-protos==1.69.2 h11==0.14.0 hf_transfer==0.1.9 httpcore==1.0.7 httptools==0.6.4 httpx==0.28.1 huggingface-hub==0.27.1 idna==3.10 importlib_metadata==8.6.1 interegular==0.3.3 ipython==9.0.2 ipython_pygments_lexers==1.1.1 jedi==0.19.2 Jinja2==3.1.6 jiter==0.9.0 jmespath==1.0.1 jsonschema==4.23.0 jsonschema-specifications==2024.10.1 lark==1.2.2 lingua-language-detector==2.0.2 litellm==1.63.11 llvmlite==0.44.0 lm-format-enforcer==0.10.11 markdown-it-py==3.0.0 markdown2==2.5.3 MarkupSafe==3.0.2 matplotlib-inline==0.1.7 mdurl==0.1.2 mistral_common==1.5.4 modelscope==1.23.2 mpmath==1.3.0 msgpack==1.1.0 msgspec==0.19.0 multidict==6.2.0 multiprocess==0.70.16 nest-asyncio==1.6.0 networkx==3.4.2 ninja==1.11.1.3 numba==0.61.0 numpy==1.26.4 nvidia-cublas-cu12==12.4.5.8 nvidia-cuda-cupti-cu12==12.4.127 nvidia-cuda-nvrtc-cu12==12.4.127 nvidia-cuda-runtime-cu12==12.4.127 nvidia-cudnn-cu12==9.1.0.70 nvidia-cufft-cu12==11.2.1.3 nvidia-curand-cu12==10.3.5.147 nvidia-cusolver-cu12==11.6.1.9 nvidia-cusparse-cu12==12.3.1.170 nvidia-ml-py==12.570.86 nvidia-nccl-cu12==2.21.5 nvidia-nvjitlink-cu12==12.4.127 nvidia-nvtx-cu12==12.4.127 -e git+https://github.com/allenai/olmocr.git@3c22cf3430467a4cd3683dfab2652089f0e7a4ce#egg=olmocr openai==1.66.3 opencv-python-headless==4.11.0.86 orjson==3.10.15 outlines==0.0.46 packaging==24.2 pandas==2.2.3 parso==0.8.4 partial-json-parser==0.2.1.1.post5 pexpect==4.9.0 pillow==11.1.0 prometheus-fastapi-instrumentator==7.0.2 prometheus_client==0.21.1 prompt_toolkit==3.0.50 propcache==0.3.0 proto-plus==1.26.1 protobuf==6.30.1 psutil==7.0.0 ptyprocess==0.7.0 pure_eval==0.2.3 py-cpuinfo==9.0.0 pyairports==2.1.1 pyarrow==19.0.1 pyasn1==0.6.1 pyasn1_modules==0.4.1 pycountry==24.6.1 pycparser==2.22 pydantic==2.10.6 pydantic_core==2.27.2 Pygments==2.19.1 pypdf==5.4.0 pypdfium2==4.30.1 python-dateutil==2.9.0.post0 python-dotenv==1.0.1 python-multipart==0.0.20 pytz==2025.1 PyYAML==6.0.2 pyzmq==26.3.0 ray==2.43.0 referencing==0.36.2 regex==2024.11.6 requests==2.32.3 rich==13.9.4 rpds-py==0.23.1 rsa==4.9 s3transfer==0.11.4 safetensors==0.5.3 sentencepiece==0.2.0 setproctitle==1.3.5 sgl-kernel==0.0.3.post1 sglang==0.4.2 six==1.17.0 smart-open==7.1.0 sniffio==1.3.1 stack-data==0.6.3 starlette==0.46.1 sympy==1.13.1 tiktoken==0.9.0 tokenizers==0.20.3 torch==2.5.1 torchao==0.9.0 torchvision==0.20.1 tqdm==4.67.1 traitlets==5.14.3 transformers==4.46.2 triton==3.1.0 typing_extensions==4.12.2 tzdata==2025.1 urllib3==2.3.0 uvicorn==0.34.0 uvloop==0.21.0 vllm==0.6.4.post1 watchfiles==1.0.4 wcwidth==0.2.13 webencodings==0.5.1 websockets==15.0.1 wrapt==1.17.2 xformers==0.0.28.post3 xgrammar==0.1.16 xxhash==3.5.0 yarl==1.18.3 zipp==3.21.0 zstandard==0.23.0

xcvil avatar Mar 19 '25 11:03 xcvil

If no one is working on this, I would love to contribute and fix this

SkaarFacee avatar Mar 19 '25 23:03 SkaarFacee

If no one is working on this, I would love to contribute and fix this

Please go ahead and let me know if you want to discuss.

xcvil avatar Mar 20 '25 14:03 xcvil

@jakep-allenai could you check if this is a bug or it is normal

xcvil avatar Mar 21 '25 12:03 xcvil

@SkaarFacee If you have an idea, then please let me know.

Interesting, it's definitely not normal. What sort of host are you running this on, is it within docker? What is your networking setup?

It should be starting Sglang for you on port 30024, so 31824 is weird.

jakep-allenai avatar Mar 21 '25 15:03 jakep-allenai

@jakep-allenai I am running several olmocr.pipelines in one node so the port 30024 is occupied. That is why I used a different one instead. Does the port number matter? Please note that I also have the same issue with port 30024!

xcvil avatar Mar 21 '25 16:03 xcvil

I am using all default setting in SLURM/LSF multiple GPU nodes with Ubuntu. I tried both slurm and LSF.

xcvil avatar Mar 21 '25 16:03 xcvil

Hmm, interesting, when we run multi-gpus, we would run it as 8 separate docker containers on one host for example. Does the error go away if you run just 1 GPU at a time?

jakep-allenai avatar Mar 21 '25 18:03 jakep-allenai

Wow! Good idea. When I run single GPU, the error still exists...

xcvil avatar Mar 21 '25 18:03 xcvil

https://github.com/allenai/olmocr/blob/3edae0ac7110efb735d39a7cc699847e76d92114/olmocr/pipeline.py#L514-L535

Hmm, what if you change this code to add "--host", "0.0.0.0" to the array to specify binding to all hosts?

jakep-allenai avatar Mar 21 '25 20:03 jakep-allenai

Okay, I can take a look and see what is going on. Please gimme a day or so

SkaarFacee avatar Mar 23 '25 10:03 SkaarFacee

Ahh sorry. I got a bit busy the last few weeks. @xcvil Can you help me replicate this? I am not sure how to go about with that

SkaarFacee avatar Apr 04 '25 05:04 SkaarFacee

Ahh sorry. I got a bit busy the last few weeks. @xcvil Can you help me replicate this? I am not sure how to go about with that

he meant to change the code to this so that, the server will listen on all network interfaces, making it accessible from external machines (if firewall and network settings allow it)

in this file: olmocr/olmocr/pipeline.py

cmd = [ 
     "python3", 
     "-m", 
     "sglang.launch_server", 
     "--model-path", 
     model_name_or_path, 
     "--chat-template", 
     args.model_chat_template, 
     # "--context-length", str(args.model_max_context),  # Commented out due to crashes 
     "--port", 
     str(SGLANG_SERVER_PORT), 
     "--log-level-http", 
     "warning", 
     "--host", # New line
     "0.0.0.0", # New line
 ] 
 cmd.extend(mem_fraction_arg) 
  
 proc = await asyncio.create_subprocess_exec( 
     *cmd, 
     stdout=asyncio.subprocess.PIPE, 
     stderr=asyncio.subprocess.PIPE, 
 ) 
  

likith1908 avatar Apr 25 '25 08:04 likith1908

Closing this issue for now, feel free to reopen if you want to discuss more on this.

aman-17 avatar Jul 03 '25 21:07 aman-17