while converting a URI/URL, requests needs timeout when server doesn't close the stream
elif uri.startswith("http:") or uri.startswith("https:"):
response = self._requests_session.get(uri, stream=True)
response.raise_for_status()
return self.convert_response(
response,
stream_info=stream_info,
file_extension=file_extension,
url=mock_url,
**kwargs,
)
else:
raise ValueError(
f"Unsupported URI scheme: {uri.split(':')[0]}. Supported schemes are: file:, data:, http:, https:"
)
I have encountered issues when server doesn't close the steam and it hangs forever trying to buffer the stream
How do reproduce:
from markitdown import MarkItDown
client = OpenAI()
md = MarkItDown()
result = md.convert("https://alletting.dot.state.al.us/")
markdown = result.text_content
I propose adding a timeout value. I have tested this locally, and if you give me direction, i.e how do we pass the timeout value to this function, I can make the change myself.
I propose modifying stream_info blob to add timeout.
Yes, this is a good point and an easy fix. Let me pencil this in for the next update.
@afourney , if you give me sense of direction which you are think, I am happy to contribute as well
@afourney , if you give me sense of direction which you are think, I am happy to contribute as well
I'm thinking of something in this area: https://github.com/microsoft/markitdown/blob/3fcd48cdfc651cbf508071c8d2fb7d82aeb075de/packages/markitdown/src/markitdown/_markitdown.py#L439C1-L465C38
Either when calling get or reading from the stream. Let's start with a good/sane default, then we can parameterize it.
I was about to report the same issue. This seems to be an easy fix. Looking forward to the next build. Thanks to the team.
It is a very easy fix, just fixed in a PR, hope it can be merged soon.