markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

while converting a URI/URL, requests needs timeout when server doesn't close the stream

Open ThePatelCode opened this issue 9 months ago • 4 comments

        elif uri.startswith("http:") or uri.startswith("https:"):
            response = self._requests_session.get(uri, stream=True)
            response.raise_for_status()
            return self.convert_response(
                response,
                stream_info=stream_info,
                file_extension=file_extension,
                url=mock_url,
                **kwargs,
            )
        else:
            raise ValueError(
                f"Unsupported URI scheme: {uri.split(':')[0]}. Supported schemes are: file:, data:, http:, https:"
            )

I have encountered issues when server doesn't close the steam and it hangs forever trying to buffer the stream

How do reproduce:

from markitdown import MarkItDown

client = OpenAI()
md = MarkItDown()
result = md.convert("https://alletting.dot.state.al.us/")
markdown = result.text_content

I propose adding a timeout value. I have tested this locally, and if you give me direction, i.e how do we pass the timeout value to this function, I can make the change myself.

I propose modifying stream_info blob to add timeout.

ThePatelCode avatar Apr 05 '25 19:04 ThePatelCode

Yes, this is a good point and an easy fix. Let me pencil this in for the next update.

afourney avatar Apr 07 '25 16:04 afourney

@afourney , if you give me sense of direction which you are think, I am happy to contribute as well

ThePatelCode avatar Apr 07 '25 16:04 ThePatelCode

@afourney , if you give me sense of direction which you are think, I am happy to contribute as well

I'm thinking of something in this area: https://github.com/microsoft/markitdown/blob/3fcd48cdfc651cbf508071c8d2fb7d82aeb075de/packages/markitdown/src/markitdown/_markitdown.py#L439C1-L465C38

Either when calling get or reading from the stream. Let's start with a good/sane default, then we can parameterize it.

afourney avatar Apr 07 '25 19:04 afourney

I was about to report the same issue. This seems to be an easy fix. Looking forward to the next build. Thanks to the team.

sonnylaskar avatar Apr 08 '25 14:04 sonnylaskar

It is a very easy fix, just fixed in a PR, hope it can be merged soon.

sammydeprez avatar Sep 09 '25 11:09 sammydeprez