thepipe issues

Pytesseract error when text_only is True within GitHub Action

If Tesseract OCR is not installed correctly, image extraction with text_only=True will yield `tesseract is not installed or it's not in your PATH. See README file for more information.`. This...

emcf

bug

Full-page screenshot when extracting page URL

4

I'm running thepipe locally to extract some page URLs for processing with GPT4o, and it seems that the image generated for each page only captures the content above the fold...

michael-supreme

Extracting and referencing images

I was wondering whether it is possible to extract all images from a document and reference them at their position in the generated markdown? As I understand the documentation it...

RichardSieg

Issue with Extract API

The result part of the app seems to be having a KeyError issue. I've been processing the same files and just started to have this issue today.

TheRoundEyes

Issues with transparent PNGs when processing .PPT files

1

When processing a .PPT file through the pipe (I'm using a local installation), if the .PPT file has a transparent image, the following error gets thrown: `error":"/usr/local/lib/python3.11/site-packages/PIL/Image.py:1056: UserWarning: Palette images...

michael-supreme

Switched video backend to use yt-dlp

This builds on the globbing branch PR I submitted earlier at: https://github.com/emcf/thepipe/pull/26 it replaces pytube and broadly expands the amount of websites supported for automatically scraping videos. I also attach...

skyler14

Globbing based scraping

1

This is a version which adds globbing based file filtering within the directories you scrape. The changes in this are already included in the forthcoming yt-dlp pull request as well....

skyler14

output file/folder?

1

running on the python interpreter i don;t see any output? is the converted file saved somewhere? why does the readme not mention it?

gilbrotheraway

TypeError: scrape_file() got an unexpected keyword argument 'ai_extraction'

2

gilbrotheraway

Unterminated string in JSON at position 16384 (line 1 column 16385)

2

I am trying to perform extraction on a pdf file. I am able to scrape the file using the tool but when trying to extract the information I am getting...

BC-Naman

thepipe
thepipe copied to clipboard

Metadata

Pytesseract error when text_only is True within GitHub Action

Full-page screenshot when extracting page URL

Extracting and referencing images

Issue with Extract API

Issues with transparent PNGs when processing .PPT files

Switched video backend to use yt-dlp

Globbing based scraping

output file/folder?

TypeError: scrape_file() got an unexpected keyword argument 'ai_extraction'

Unterminated string in JSON at position 16384 (line 1 column 16385)

← Metadata

Owner

Metadata

thepipe thepipe copied to clipboard

Metadata

← Metadata

Owner

Metadata

thepipe
thepipe copied to clipboard