thepipe
thepipe copied to clipboard
`ai_extraction=True` not working locally
Hi! Not sure if this is a bug or a feature, but I'd love to use the ai_extraction
option to improve the handling of PDF documents. However, enabling this option overwrites the local=True
option.
MWE:
from thepipe.thepipe_api import thepipe
source = 'example.pdf'
messages = thepipe.extract(source, local=True, verbose=True, ai_extraction=True)
Throws the error:
Failed to extract from example.pdf: No valid API key given. Visit https://thepi.pe/docs to learn more.
It works without enabling ai_extraction,
but I don't like that it adds every page as an image to the messages because this massively increases the token count for longer PDFs.
As a workaround, I adapted the extract_pdf
function only to extract PDF pages as images if the page contains an image. It would be great to have this as an option. (I know this approach is not optimal as it misses tables and some images containing only SVG objects; maybe a better option is possible only based on the fitz
library, but I am no expert in this package).
def extract_pdf(file_path: str, ai_extraction: bool = False, text_only: bool = False, verbose: bool = False, limit: int = None) -> List[Chunk]:
chunks = []
if ai_extraction:
with open(file_path, "rb") as f:
response = requests.post(
url=API_URL,
files={'file': (file_path, f)},
data={'api_key': THEPIPE_API_KEY, 'ai_extraction': ai_extraction, 'text_only': text_only, 'limit': limit}
)
try:
response_json = response.json()
except json.JSONDecodeError:
raise ValueError(f"Our backend likely couldn't handle this request. This can happen with large content such as videos, streams, or very large files/websites. Re")
if 'error' in response_json:
raise ValueError(f"{response_json['error']}")
messages = response_json['messages']
chunks = create_chunks_from_messages(messages)
else:
import fitz
# extract text and images of each page from the PDF
with open(file_path, 'rb') as file:
doc = fitz.open(file_path)
for page in doc:
text = page.get_text()
image_list = page.get_images(full=True)
if text_only:
chunks.append(Chunk(path=file_path, text=text, image=None, source_type=SourceTypes.PDF))
elif image_list:
pix = page.get_pixmap()
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
chunks.append(Chunk(path=file_path, text=text, image=img, source_type=SourceTypes.PDF))
else: chunks.append(Chunk(path=file_path, text=text, image=None, source_type=SourceTypes.PDF))
doc.close()
return chunks