open-parse icon indicating copy to clipboard operation
open-parse copied to clipboard

add langchain document support

Open priamai opened this issue 1 year ago • 3 comments
trafficstars

Description

Love the project, we need to add a langchain Document interface, which I am more than happy to do it but just a few questions:

  • each node will become a document
  • the content will become the text field
  • the metadata can be added as bbox and node_id

What is the embedding field for? Will that be filled eventually with an openai embedding vector? What are tokens and how they are calculated base on what model? are you using tiktoken? Within each node you have something called Lines, is that basically the text but split into detected lines?

Cheers.

priamai avatar Jul 16 '24 10:07 priamai

Great!

embedding is used for semantic processing (combining chunks by similarity) - yes it's a vector from OpenAI (long term maybe agnostic).

Tiktoken is our current method for calculating tokens since (unfortunately) semantic processing is OpenAI centric at the moment.

I wouldn't worry about lines - they're used internally to assemble nodes. Once the node is created they're no longer needed.

Feel free to ask anything else!

Filimoa avatar Jul 16 '24 14:07 Filimoa

@Filimoa enjoy this simple class that is compatible.

from typing import AsyncIterator, Iterator

from langchain_core.document_loaders import BaseLoader
from langchain_core.documents import Document

import openparse

class OpenParseDocumentLoader(BaseLoader):
    """An example document loader that reads a file line by line."""

    def __init__(self, file_path: str) -> None:
        """Initialize the loader with a file path.

        Args:
            file_path: The path to the file to load.
        """
        self.file_path = file_path

    def lazy_load(self) -> Iterator[Document]:  # <-- Does not take any arguments
        """A lazy loader that reads a file line by line.

        When you're implementing lazy load methods, you should use a generator
        to yield documents one by one.
        """

        parser = openparse.DocumentParser()
        parsed_basic_doc = parser.parse(self.file_path)

        for node in parsed_basic_doc.nodes:
            yield Document(
                page_content=node.text,
                metadata={"tokens": node.tokens,
                          "num_pages":node.num_pages,
                          "node_id":node.node_id,
                          "start_page":node.start_page,
                          "end_page":node.end_page,
                          "source": self.file_path},
            )

Usage:


from OpenTextLoader import OpenParseDocumentLoader

loader = OpenParseDocumentLoader("./sample_docs/companies-list.pdf")

## Test out the lazy load interface
for doc in loader.lazy_load():
    print()
    print(type(doc))
    print(doc)

Feel free to add to the code base.

priamai avatar Jul 19 '24 22:07 priamai

How do I extract tabels and images from a pdf??

ITHealer avatar Jul 29 '24 03:07 ITHealer