haystack
haystack copied to clipboard
Add Flexible Conversion Parameters to PDF Converters
Feature Request
Currently, our library's PDF converters only support static, predefined conversion parameters. This limitation makes it difficult to adapt to varied use cases where dynamic parameters (like specific page numbers to convert) are necessary. I propose we add a feature that allows users to pass dynamic conversion parameters as a dictionary to our PDF converters.
Proposed Solution
Introduce a system for flexible conversion parameters for PDF converters. This can be achieved by accepting a conversion_params
dictionary in the PyPDFToDocument
constructor and passing these parameters to the converter via **kwargs
.
Benefits
-
Increased Flexibility: Allows users to specify dynamic conversion parameters, such as
start_page
,end_page
, and other converter-specific options. - Ease of Extension: Adding new conversion parameters becomes trivial, without needing modifications in method or class signatures.
- Compatibility: Maintains compatibility with existing converters that do not require additional parameters.
Example Usage
conversion_params = {"start_page": 1, "end_page": 10}
converter = PyPDFToDocument(converter_name="custom", conversion_params=conversion_params)
`class CustomConverter:
"""
Le convertisseur personnalisé qui extrait le texte des pages d'un objet PdfReader et retourne un objet Document,
en tenant compte des paramètres supplémentaires tels que start_page et end_page.
"""
def convert(self, reader: "PdfReader", **kwargs) -> Document:
# Extraire les paramètres start_page et end_page de kwargs, avec des valeurs par défaut
start_page = kwargs.get('start_page', 0)
end_page = kwargs.get('end_page', len(reader.pages) - 1)
text_with_pages = ""
page_starts = []
current_length = 0
# Si end_page est défini comme -1, traiter jusqu'à la fin du document
if end_page == -1 or end_page >= len(reader.pages):
end_page = len(reader.pages) - 1
for page_num, page in enumerate(reader.pages[start_page:end_page + 1], start=start_page):
page_text = page.extract_text()
if page_text:
# Ajouter un marqueur de début de page si ce n'est pas la première page de texte
if current_length > 0:
page_starts.append(current_length)
text_with_pages += f"{page_text}\n"
current_length = len(text_with_pages)
# Ajouter le titre et le sujet depuis les métadonnées du PDF, si disponibles
title = self.add_title(reader)
subject = self.add_subject(reader)
return Document(content=text_with_pages, meta={"page_starts": page_starts, 'title': title, 'subject': subject})
def add_title(self, reader: "PdfReader"):
# Extraction du titre depuis les métadonnées du PDF
metadata = reader.metadata
title = metadata.get('/Title', '')
return title
def add_subject(self, reader: "PdfReader"):
# Extraction du sujet depuis les métadonnées du PDF
metadata = reader.metadata
subject = metadata.get('/Subject', '')
return subject`
`# This registry is used to store converters names and instances.
# It can be used to register custom converters.
CONVERTERS_REGISTRY: Dict[str, PyPDFConverter] = {"default": DefaultConverter(), "custom": CustomConverter()}
@component
class PyPDFToDocument:
"""
Converts PDF files to Document objects.
It uses a converter that follows the PyPDFConverter protocol to perform the conversion.
A default text extraction converter is used if no custom converter is provided.
Usage example:
```python
from haystack.components.converters.pypdf import PyPDFToDocument
converter = PyPDFToDocument()
results = converter.run(sources=["sample.pdf"], meta={"date_added": datetime.now().isoformat()})
documents = results["documents"]
print(documents[0].content)
# 'This is a text from the PDF file.'
```
"""
def __init__(self, converter_name: str = "default", conversion_params: dict = None):
"""
Initializes the PyPDFToDocument component with an optional custom converter.
:param converter_name: A converter name that is registered in the CONVERTERS_REGISTRY.
Defaults to 'default'.
"""
pypdf_import.check()
try:
converter = CONVERTERS_REGISTRY[converter_name]
except KeyError:
msg = (
f"Invalid converter_name: {converter_name}.\n Available converters: {list(CONVERTERS_REGISTRY.keys())}"
)
raise ValueError(msg) from KeyError
self.converter_name = converter_name
self._converter: PyPDFConverter = converter
self.conversion_params = conversion_params or {}
def to_dict(self):
# do not serialize the _converter instance
return default_to_dict(self, converter_name=self.converter_name)
@component.output_types(documents=List[Document])
def run(self, sources: List[Union[str, Path, ByteStream]], meta: Optional[List[Dict[str, Any]]] = None):
"""
Converts a list of PDF sources into Document objects using the configured converter.
:param sources: A list of PDF data sources, which can be file paths or ByteStream objects.
:param meta: Optional metadata to attach to the Documents.
This value can be either a list of dictionaries or a single dictionary.
If it's a single dictionary, its content is added to the metadata of all produced Documents.
If it's a list, the length of the list must match the number of sources, because the two lists will be zipped.
Defaults to `None`.
:return: A dictionary containing a list of Document objects under the 'documents' key.
"""
documents = []
meta_list = normalize_metadata(meta, sources_count=len(sources))
for source, metadata in zip(sources, meta_list):
try:
bytestream = get_bytestream_from_source(source)
except Exception as e:
logger.warning("Could not read %s. Skipping it. Error: %s", source, e)
continue
try:
pdf_reader = PdfReader(io.BytesIO(bytestream.data))
document = self._converter.convert(pdf_reader, **self.conversion_params)
except Exception as e:
logger.warning("Could not read %s and convert it to Document, skipping. %s", source, e)
continue
merged_metadata = {**bytestream.meta, **metadata, **document.meta} # War
document.meta = merged_metadata
documents.append(document)
return {"documents": documents}
`
Thanks @warichet for the detailed feature request!
I think we should definitely implement this one, but as we're finalising 2.0.0 I'm not sure we can prioritise this feature at this very moment. I'm adding the "contributions wanted" label, in case someone wants to give it a try without waiting for us.
Thanks If you need help don't hesitate
@masci I'd like to work on this. I can create a draft PR soon for early feedback.
After #7361 and #7362, defining a custom converter that allows specifying start page and end page can be done as follows
from typing import Optional
from pypdf import PdfReader
from haystack import Document, default_from_dict, default_to_dict
from haystack.components.converters.pypdf import PyPDFToDocument
class ConverterWithPages:
def __init__(self, start_page: Optional[int] = None, end_page: Optional[int] = None):
self.start_page = start_page or 0
self.end_page = end_page
self._upper_bound = end_page+1 if end_page is not None else -1
def convert(self, reader: "PdfReader") -> Document:
text_pages = []
for page in reader.pages[self.start_page : self._upper_bound]:
text_pages.append(page.extract_text())
text = "\f".join(text_pages)
return Document(content=text)
def to_dict(self):
"""Serialize the converter to a dictionary."""
return default_to_dict(self, start_page=self.start_page, end_page=self.end_page)
@classmethod
def from_dict(cls, data):
"""Deserialize the converter from a dictionary."""
return default_from_dict(cls, data)
pypdf_converter = PyPDFToDocument(converter=ConverterWithPages(start_page=0, end_page=2))
res = pypdf_converter.run(sources=["/home/anakin87/apps/haystack/test/test_files/pdf/react_paper.pdf"])
print(res)
print(res["documents"][0].content)
The solution is not entirely straightforward, but not too complex either. I am closing this issue for now. If more requests in this direction come in the future, we can improve or modify the component.