open-parse
open-parse copied to clipboard
add langchain document support
Description
Love the project, we need to add a langchain Document interface, which I am more than happy to do it but just a few questions:
- each node will become a document
- the content will become the text field
- the metadata can be added as bbox and node_id
What is the embedding field for? Will that be filled eventually with an openai embedding vector? What are tokens and how they are calculated base on what model? are you using tiktoken? Within each node you have something called Lines, is that basically the text but split into detected lines?
Cheers.
Great!
embedding is used for semantic processing (combining chunks by similarity) - yes it's a vector from OpenAI (long term maybe agnostic).
Tiktoken is our current method for calculating tokens since (unfortunately) semantic processing is OpenAI centric at the moment.
I wouldn't worry about lines - they're used internally to assemble nodes. Once the node is created they're no longer needed.
Feel free to ask anything else!
@Filimoa enjoy this simple class that is compatible.
from typing import AsyncIterator, Iterator
from langchain_core.document_loaders import BaseLoader
from langchain_core.documents import Document
import openparse
class OpenParseDocumentLoader(BaseLoader):
"""An example document loader that reads a file line by line."""
def __init__(self, file_path: str) -> None:
"""Initialize the loader with a file path.
Args:
file_path: The path to the file to load.
"""
self.file_path = file_path
def lazy_load(self) -> Iterator[Document]: # <-- Does not take any arguments
"""A lazy loader that reads a file line by line.
When you're implementing lazy load methods, you should use a generator
to yield documents one by one.
"""
parser = openparse.DocumentParser()
parsed_basic_doc = parser.parse(self.file_path)
for node in parsed_basic_doc.nodes:
yield Document(
page_content=node.text,
metadata={"tokens": node.tokens,
"num_pages":node.num_pages,
"node_id":node.node_id,
"start_page":node.start_page,
"end_page":node.end_page,
"source": self.file_path},
)
Usage:
from OpenTextLoader import OpenParseDocumentLoader
loader = OpenParseDocumentLoader("./sample_docs/companies-list.pdf")
## Test out the lazy load interface
for doc in loader.lazy_load():
print()
print(type(doc))
print(doc)
Feel free to add to the code base.
How do I extract tabels and images from a pdf??