thepipe
thepipe copied to clipboard
Extract clean markdown from PDFs, URLs, Word docs, slides, videos, and more, ready for any LLM. ⚡
If Tesseract OCR is not installed correctly, image extraction with text_only=True will yield `tesseract is not installed or it's not in your PATH. See README file for more information.`. This...
I'm running thepipe locally to extract some page URLs for processing with GPT4o, and it seems that the image generated for each page only captures the content above the fold...
I was wondering whether it is possible to extract all images from a document and reference them at their position in the generated markdown? As I understand the documentation it...
The result part of the app seems to be having a KeyError issue. I've been processing the same files and just started to have this issue today.
When processing a .PPT file through the pipe (I'm using a local installation), if the .PPT file has a transparent image, the following error gets thrown: `error":"/usr/local/lib/python3.11/site-packages/PIL/Image.py:1056: UserWarning: Palette images...
This builds on the globbing branch PR I submitted earlier at: https://github.com/emcf/thepipe/pull/26 it replaces pytube and broadly expands the amount of websites supported for automatically scraping videos. I also attach...
This is a version which adds globbing based file filtering within the directories you scrape. The changes in this are already included in the forthcoming yt-dlp pull request as well....
running on the python interpreter i don;t see any output? is the converted file saved somewhere? why does the readme not mention it?
I am trying to perform extraction on a pdf file. I am able to scrape the file using the tool but when trying to extract the information I am getting...