markitdown icon indicating copy to clipboard operation
markitdown copied to clipboard

Cloud not convert stream / pdf to markdown

Open TorgeStahl opened this issue 9 months ago • 3 comments

Hey there,

i wanted to generate a markdown of a really long pdf document (roughly around 100 pages). Simple print works, but as soon as it should be converted to markdown, it gives the following issue below. Is there a now limitation to the length of a document?

Traceback (most recent call last): File "/Users/user/Desktop/Repositories/markitdown/script/markdown.py", line 73, in main() ~~~~^^ File "/Users/user/Desktop/Repositories/markitdown/script/markdown.py", line 34, in main text = process_file(file_path) File "/Users/user/Desktop/Repositories/markitdown/script/markdown.py", line 19, in process_file result = md.convert(file_path) File "/Users/user/Desktop/Repositories/markitdown/packages/markitdown/src/markitdown/_markitdown.py", line 259, in convert return self.convert_local(source, stream_info=stream_info, **kwargs) ~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/user/Desktop/Repositories/markitdown/packages/markitdown/src/markitdown/_markitdown.py", line 310, in convert_local return self._convert(file_stream=fh, stream_info_guesses=guesses, **kwargs) ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/user/Desktop/Repositories/markitdown/packages/markitdown/src/markitdown/_markitdown.py", line 541, in _convert raise UnsupportedFormatException( f"Could not convert stream to Markdown. No converter attempted a conversion, suggesting that the filetype is simply not supported." ) markitdown._exceptions.UnsupportedFormatException: Could not convert stream to Markdown. No converter attempted a conversion, suggesting that the filetype is simply not supported

TorgeStahl avatar Mar 16 '25 15:03 TorgeStahl

Thanks for the report. Let's get to the bottom of this.

What version of the library are you using? Did you install it with [all] or at least [pdf]? Is this a problem with all (e.g., smaller) PDFs? Or just this one? Are you using the python library or the command line?

On my plate is to add a debug option and more python logging, to better support debugging these types of scenarios.

afourney avatar Mar 16 '25 16:03 afourney

seeing the same. installed markitdown version 0.1.1

using: "pip install -e packages/markitdown[all]" returns: "zsh: no matches found: packages/markitdown[all]" and similarly for [pdf] and other options.

The only install command that didn't fail was this (below), but it leads to something like OP's reported error above when used:

pip install -e packages/markitdown
Obtaining file:///users/name/localpath/somedir/markitdown/packages/markitdown
Installing build dependencies ... done
Checking if build backend supports build_editable ... done
Getting requirements to build editable ... done
Installing backend dependencies ... done
Preparing editable metadata (pyproject.toml) ... done

==================== Traceback:

/opt/anaconda3/lib/python3.12/site-packages/executing/executing.py:713: DeprecationWarning: ast.Str is deprecated and will be removed in Python 3.14; use ast.Constant instead
  right=ast.Str(s=sentinel),
/opt/anaconda3/lib/python3.12/site-packages/executing/executing.py:713: DeprecationWarning: ast.Str is deprecated and will be removed in Python 3.14; use ast.Constant instead
  right=ast.Str(s=sentinel),
/opt/anaconda3/lib/python3.12/ast.py:587: DeprecationWarning: Attribute s is deprecated and will be removed in Python 3.14; use value instead
  return Constant(*args, **kwargs)
/opt/anaconda3/lib/python3.12/site-packages/executing/executing.py:713: DeprecationWarning: ast.Str is deprecated and will be removed in Python 3.14; use ast.Constant instead
  right=ast.Str(s=sentinel),
/opt/anaconda3/lib/python3.12/ast.py:587: DeprecationWarning: Attribute s is deprecated and will be removed in Python 3.14; use value instead
  return Constant(*args, **kwargs)
/opt/anaconda3/lib/python3.12/site-packages/executing/executing.py:713: DeprecationWarning: ast.Str is deprecated and will be removed in Python 3.14; use ast.Constant instead
  right=ast.Str(s=sentinel),
/opt/anaconda3/lib/python3.12/ast.py:587: DeprecationWarning: Attribute s is deprecated and will be removed in Python 3.14; use value instead
  return Constant(*args, **kwargs)
/opt/anaconda3/lib/python3.12/site-packages/executing/executing.py:713: DeprecationWarning: ast.Str is deprecated and will be removed in Python 3.14; use ast.Constant instead
  right=ast.Str(s=sentinel),
/opt/anaconda3/lib/python3.12/ast.py:587: DeprecationWarning: Attribute s is deprecated and will be removed in Python 3.14; use value instead
  return Constant(*args, **kwargs)
---------------------------------------------------------------------------
FileConversionException                   Traceback (most recent call last)
Cell In[2], line 2
      1 md = MarkItDown()
----> 2 result = md.convert('../test_report.pdf')

File ~/some-path-to-here/markitdown/packages/markitdown/src/markitdown/_markitdown.py:273, in MarkItDown.convert(self, source, stream_info, **kwargs)
    271         return self.convert_uri(source, stream_info=stream_info, **_kwargs)
    272     else:
--> 273         return self.convert_local(source, stream_info=stream_info, **kwargs)
    274 # Path object
    275 elif isinstance(source, Path):

File ~/some-path-to-here/markitdown/packages/markitdown/src/markitdown/_markitdown.py:327, in MarkItDown.convert_local(self, path, stream_info, file_extension, url, **kwargs)
    323 with open(path, "rb") as fh:
    324     guesses = self._get_stream_info_guesses(
    325         file_stream=fh, base_guess=base_guess
    326     )
--> 327     return self._convert(file_stream=fh, stream_info_guesses=guesses, **kwargs)

File ~/some-path-to-here/markitdown/packages/markitdown/src/markitdown/_markitdown.py:613, in MarkItDown._convert(self, file_stream, stream_info_guesses, **kwargs)
    611 # If we got this far without success, report any exceptions
    612 if len(failed_attempts) > 0:
--> 613     raise FileConversionException(attempts=failed_attempts)
    615 # Nothing can handle it!
    616 raise UnsupportedFormatException(
    617     f"Could not convert stream to Markdown. No converter attempted a conversion, suggesting that the filetype is simply not supported."
    618 )

FileConversionException: File conversion failed after 1 attempts:
 - PdfConverter threw MissingDependencyException with message: PdfConverter recognized the input as a potential .pdf file, but the dependencies needed to read .pdf files have not been installed. To resolve this error, include the optional dependency [pdf] or [all] when installing MarkItDown. For example:

* pip install markitdown[pdf]
* pip install markitdown[all]
* pip install markitdown[pdf, ...]
* etc.

bdnguyen-ds avatar Apr 15 '25 09:04 bdnguyen-ds

Ignore my previous comment, it was a "me" issue. Referencing here in case anyone runs into the same thing. Adding quotation marks the around the target ( 'markitdown[all]' ) allowed proper install.

bdnguyen-ds avatar Apr 15 '25 10:04 bdnguyen-ds