haystack Add Flexible Conversion Parameters to PDF Converters

Feature Request

Currently, our library's PDF converters only support static, predefined conversion parameters. This limitation makes it difficult to adapt to varied use cases where dynamic parameters (like specific page numbers to convert) are necessary. I propose we add a feature that allows users to pass dynamic conversion parameters as a dictionary to our PDF converters.

Proposed Solution

Introduce a system for flexible conversion parameters for PDF converters. This can be achieved by accepting a conversion_params dictionary in the PyPDFToDocument constructor and passing these parameters to the converter via **kwargs.

Benefits

Increased Flexibility: Allows users to specify dynamic conversion parameters, such as start_page, end_page, and other converter-specific options.
Ease of Extension: Adding new conversion parameters becomes trivial, without needing modifications in method or class signatures.
Compatibility: Maintains compatibility with existing converters that do not require additional parameters.

Example Usage

conversion_params = {"start_page": 1, "end_page": 10}
converter = PyPDFToDocument(converter_name="custom", conversion_params=conversion_params)

`class CustomConverter:
    """
    Le convertisseur personnalisé qui extrait le texte des pages d'un objet PdfReader et retourne un objet Document,
    en tenant compte des paramètres supplémentaires tels que start_page et end_page.
    """
    def convert(self, reader: "PdfReader", **kwargs) -> Document:
        # Extraire les paramètres start_page et end_page de kwargs, avec des valeurs par défaut
        start_page = kwargs.get('start_page', 0)
        end_page = kwargs.get('end_page', len(reader.pages) - 1)

        text_with_pages = ""
        page_starts = []
        current_length = 0

        # Si end_page est défini comme -1, traiter jusqu'à la fin du document
        if end_page == -1 or end_page >= len(reader.pages):
            end_page = len(reader.pages) - 1

        for page_num, page in enumerate(reader.pages[start_page:end_page + 1], start=start_page):
            page_text = page.extract_text()
            if page_text:
                # Ajouter un marqueur de début de page si ce n'est pas la première page de texte
                if current_length > 0:
                    page_starts.append(current_length)
                text_with_pages += f"{page_text}\n"
                current_length = len(text_with_pages)

        # Ajouter le titre et le sujet depuis les métadonnées du PDF, si disponibles
        title = self.add_title(reader)
        subject = self.add_subject(reader)

        return Document(content=text_with_pages, meta={"page_starts": page_starts, 'title': title, 'subject': subject})

    def add_title(self, reader: "PdfReader"):
        # Extraction du titre depuis les métadonnées du PDF
        metadata = reader.metadata
        title = metadata.get('/Title', '')
        return title

    def add_subject(self, reader: "PdfReader"):
        # Extraction du sujet depuis les métadonnées du PDF
        metadata = reader.metadata
        subject = metadata.get('/Subject', '')
        return subject`

`# This registry is used to store converters names and instances.
# It can be used to register custom converters.
CONVERTERS_REGISTRY: Dict[str, PyPDFConverter] = {"default": DefaultConverter(), "custom": CustomConverter()}


@component
class PyPDFToDocument:
    """
    Converts PDF files to Document objects.
    It uses a converter that follows the PyPDFConverter protocol to perform the conversion.
    A default text extraction converter is used if no custom converter is provided.

    Usage example:
    ```python
    from haystack.components.converters.pypdf import PyPDFToDocument

    converter = PyPDFToDocument()
    results = converter.run(sources=["sample.pdf"], meta={"date_added": datetime.now().isoformat()})
    documents = results["documents"]
    print(documents[0].content)
    # 'This is a text from the PDF file.'
    ```
    """

    def __init__(self, converter_name: str = "default", conversion_params: dict = None):
        """
        Initializes the PyPDFToDocument component with an optional custom converter.
        :param converter_name: A converter name that is registered in the CONVERTERS_REGISTRY.
            Defaults to 'default'.
        """
        pypdf_import.check()

        try:
            converter = CONVERTERS_REGISTRY[converter_name]
        except KeyError:
            msg = (
                f"Invalid converter_name: {converter_name}.\n Available converters: {list(CONVERTERS_REGISTRY.keys())}"
            )
            raise ValueError(msg) from KeyError
        self.converter_name = converter_name
        self._converter: PyPDFConverter = converter
        self.conversion_params = conversion_params or {}

    def to_dict(self):
        # do not serialize the _converter instance
        return default_to_dict(self, converter_name=self.converter_name)

    @component.output_types(documents=List[Document])
    def run(self, sources: List[Union[str, Path, ByteStream]], meta: Optional[List[Dict[str, Any]]] = None):
        """
        Converts a list of PDF sources into Document objects using the configured converter.

        :param sources: A list of PDF data sources, which can be file paths or ByteStream objects.
        :param meta: Optional metadata to attach to the Documents.
          This value can be either a list of dictionaries or a single dictionary.
          If it's a single dictionary, its content is added to the metadata of all produced Documents.
          If it's a list, the length of the list must match the number of sources, because the two lists will be zipped.
          Defaults to `None`.
        :return: A dictionary containing a list of Document objects under the 'documents' key.
        """
        documents = []
        meta_list = normalize_metadata(meta, sources_count=len(sources))

        for source, metadata in zip(sources, meta_list):
            try:
                bytestream = get_bytestream_from_source(source)
            except Exception as e:
                logger.warning("Could not read %s. Skipping it. Error: %s", source, e)
                continue
            try:
                pdf_reader = PdfReader(io.BytesIO(bytestream.data))
                document = self._converter.convert(pdf_reader, **self.conversion_params)

            except Exception as e:
                logger.warning("Could not read %s and convert it to Document, skipping. %s", source, e)
                continue

            merged_metadata = {**bytestream.meta, **metadata, **document.meta} # War
            document.meta = merged_metadata
            documents.append(document)
        return {"documents": documents}
`

Feb 02 '24 14:02 warichet

Thanks @warichet for the detailed feature request!

I think we should definitely implement this one, but as we're finalising 2.0.0 I'm not sure we can prioritise this feature at this very moment. I'm adding the "contributions wanted" label, in case someone wants to give it a try without waiting for us.

Feb 05 '24 08:02 masci

Thanks If you need help don't hesitate

Feb 05 '24 09:02 warichet

@masci I'd like to work on this. I can create a draft PR soon for early feedback.

Feb 27 '24 12:02 mohitlal31

After #7361 and #7362, defining a custom converter that allows specifying start page and end page can be done as follows

from typing import Optional

from pypdf import PdfReader
from haystack import Document, default_from_dict, default_to_dict
from haystack.components.converters.pypdf import PyPDFToDocument

class ConverterWithPages:
    def __init__(self, start_page: Optional[int] = None, end_page: Optional[int] = None):
        self.start_page = start_page or 0
        self.end_page = end_page

        self._upper_bound = end_page+1 if end_page is not None else -1

    def convert(self, reader: "PdfReader") -> Document:
        text_pages = []
        for page in reader.pages[self.start_page : self._upper_bound]:
            text_pages.append(page.extract_text())
        text = "\f".join(text_pages)
        return Document(content=text)

    def to_dict(self):
        """Serialize the converter to a dictionary."""
        return default_to_dict(self, start_page=self.start_page, end_page=self.end_page)

    @classmethod
    def from_dict(cls, data):
        """Deserialize the converter from a dictionary."""
        return default_from_dict(cls, data)


pypdf_converter = PyPDFToDocument(converter=ConverterWithPages(start_page=0, end_page=2))
res = pypdf_converter.run(sources=["/home/anakin87/apps/haystack/test/test_files/pdf/react_paper.pdf"])

print(res)
print(res["documents"][0].content)

The solution is not entirely straightforward, but not too complex either. I am closing this issue for now. If more requests in this direction come in the future, we can improve or modify the component.

May 08 '24 11:05 anakin87

haystack haystack copied to clipboard

Add Flexible Conversion Parameters to PDF Converters

Feature Request

Proposed Solution

Benefits

Example Usage

haystack
haystack copied to clipboard