thepipe icon indicating copy to clipboard operation
thepipe copied to clipboard

`ai_extraction=True` not working locally

Open sisyga opened this issue 2 months ago • 2 comments

Hi! Not sure if this is a bug or a feature, but I'd love to use the ai_extraction option to improve the handling of PDF documents. However, enabling this option overwrites the local=True option.

MWE:

from thepipe.thepipe_api import thepipe 
source = 'example.pdf'
messages = thepipe.extract(source, local=True, verbose=True, ai_extraction=True)

Throws the error: Failed to extract from example.pdf: No valid API key given. Visit https://thepi.pe/docs to learn more.

It works without enabling ai_extraction, but I don't like that it adds every page as an image to the messages because this massively increases the token count for longer PDFs. As a workaround, I adapted the extract_pdf function only to extract PDF pages as images if the page contains an image. It would be great to have this as an option. (I know this approach is not optimal as it misses tables and some images containing only SVG objects; maybe a better option is possible only based on the fitz library, but I am no expert in this package).

def extract_pdf(file_path: str, ai_extraction: bool = False, text_only: bool = False, verbose: bool = False, limit: int = None) -> List[Chunk]:
    chunks = []
    if ai_extraction:
        with open(file_path, "rb") as f:
            response = requests.post(
                url=API_URL,
                files={'file': (file_path, f)},
                data={'api_key': THEPIPE_API_KEY, 'ai_extraction': ai_extraction, 'text_only': text_only, 'limit': limit}
            )
        try:
            response_json = response.json()
        except json.JSONDecodeError:
            raise ValueError(f"Our backend likely couldn't handle this request. This can happen with large content such as videos, streams, or very large files/websites. Re")
        if 'error' in response_json:
            raise ValueError(f"{response_json['error']}")
        messages = response_json['messages']
        chunks = create_chunks_from_messages(messages)
    else:
        import fitz
        # extract text and images of each page from the PDF
        with open(file_path, 'rb') as file:
            doc = fitz.open(file_path)
            for page in doc:
                text = page.get_text()
                image_list = page.get_images(full=True)
                if text_only:
                    chunks.append(Chunk(path=file_path, text=text, image=None, source_type=SourceTypes.PDF))
                elif image_list:
                    pix = page.get_pixmap()
                    img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
                    chunks.append(Chunk(path=file_path, text=text, image=img, source_type=SourceTypes.PDF))

                else: chunks.append(Chunk(path=file_path, text=text, image=None, source_type=SourceTypes.PDF))

            doc.close()
    return chunks

sisyga avatar Apr 26 '24 13:04 sisyga