langchain
langchain copied to clipboard
Add new types of document transformers
- Description: Add two new document transformers that translates documents into different languages and converts documents into q&a format to improve vector search results. Uses OpenAI function calling via the doctran library.
- Issue: N/A
- Dependencies:
doctran = "^0.0.5" - Tag maintainer: @rlancemartin @eyurtsev @hwchase17
- Twitter handle: @psychicapi or @jfan001
Notes
- Adheres to the
DocumentTransformerabstraction set by @dev2049 in #3182 - refactored
EmbeddingsRedundantFilterto put it in a file under a newdocument_transformersmodule - Added basic docs for
DocumentInterrogator,DocumentTransformeras well as the existingEmbeddingsRedundantFilter
The latest updates on your projects. Learn more about Vercel for Git ↗︎
| Name | Status | Preview | Comments | Updated (UTC) |
|---|---|---|---|---|
| langchain | ✅ Ready (Inspect) | Visit Preview | 💬 Add feedback | Jul 13, 2023 3:51am |
Added the notebook examples - let me know if it's good to merge @rlancemartin and I can add the blob stuff in a followup PR, or if you want me to include it into this one!
Added the notebook examples - let me know if it's good to merge @rlancemartin and I can add the blob stuff in a followup PR, or if you want me to include it into this one!
Thanks!
Added the notebook examples - let me know if it's good to merge @rlancemartin and I can add the blob stuff in a followup PR, or if you want me to include it into this one!
Great! Just added comments.
IMO, easier to make it a parser. Minimal code changes. I can help if needed.
Will simplify UX and be consistent w/ other recent transformers.
Also, functionality here is: 1/ QA 2/ Translate
Looks like Doctran can do more in terms of metadata extraction.
Plans to add that in a follow-up? We can discuss, too.
QA generation is cool, but impact is somewhat mild b/c we already have QAGeneration chain.
Translation is neat.
Adding context to chunks (in metadata) is the high impact addition that would be great to get in.
@rlancemartin ack, will add this in as a parser tomorrow.
Re: naming I avoided including QA because there's already a lot of modules in LangChain that have some variation of "QA" in its name so I thought it might just increase confusion - but happy to call it whatever you think makes sense!
Re: additional functionality yeah I can add those too in a follow up, we can discuss on monday to see which ones are most useful and the best way to include them. Would also be curious to understand the vision is for the document transformer module, I assumed it would be for these kinds of transformations
Added translate and QA as parsers, as well as metadata extraction since you mentioned that was valuable. Will clean up the old files and update the notebooks early this week.
I ran into a couple of issues though:
Passing arguments to parsers
The BaseBlobParser class doesn't support additional arguments when passed into loaders. It seems that other parsers don't need to accept additional arguments, but it's necessary for translation (need to specify the language) and metadata extraction (need to pass in a JSON schema to describe the metadata to extract).
I can extend BaseBlobParser and BaseLoader to accept **kwargs if we want to keep these modules as parsers, but that might make the scope of these changes further reaching than anticipated.
class DoctranTranslateParser(BaseBlobParser):
"""Translates text documents into other languages using doctran."""
def lazy_parse(self, blob: Blob, language: str, **kwargs) -> Iterator[Document]:
"""Lazily parse the blob."""
openai_api_key = get_from_dict_or_env(kwargs, "openai_api_key", "OPENAI_API_KEY")
doctran = Doctran(openai_api_key=openai_api_key)
doctran_doc = doctran.parse(content=blob.as_string()).translate(language=language).execute()
yield [Document(page_content=doctran_doc.properties_as_yaml() if doctran_doc.extracted_properties else doctran_doc.transformed_content)]
Splitting When splitting documents, the metadata for the parent document is copied to all child documents. This means any extracted properties (including QA pairs) will be copied over to all chunks, rather than mapped to only the relevant chunks that contain those properties. Therefore vector retrieval using these metadata properties won't be able to retrieve the correct documents if the document gets split.
I can:
- Keep the functionality as is, but that would make metadata extraction useless if the user intends to split the document after parsing.
- Keep the metadata extraction/QA generation step as a document transformation so it can happen after splitting
- Add functionality to handle document splitting while mapping properties to the correct chunks. Not sure if there's a great way to do this since I'm not super familiar with the LangChain codebase yet.
Let me know what you think! Can also discuss more on our call but it's a lot to cover in 30 min haha.
Added translate and QA as parsers, as well as metadata extraction since you mentioned that was valuable. Will clean up the old files and update the notebooks early this week.
I ran into a couple of issues though:
Passing arguments to parsers
The
BaseBlobParserclass doesn't support additional arguments when passed into loaders. It seems that other parsers don't need to accept additional arguments, but it's necessary for translation (need to specify the language) and metadata extraction (need to pass in a JSON schema to describe the metadata to extract).I can extend
BaseBlobParserandBaseLoaderto accept **kwargs if we want to keep these modules as parsers, but that might make the scope of these changes further reaching than anticipated.class DoctranTranslateParser(BaseBlobParser): """Translates text documents into other languages using doctran.""" def lazy_parse(self, blob: Blob, language: str, **kwargs) -> Iterator[Document]: """Lazily parse the blob.""" openai_api_key = get_from_dict_or_env(kwargs, "openai_api_key", "OPENAI_API_KEY") doctran = Doctran(openai_api_key=openai_api_key) doctran_doc = doctran.parse(content=blob.as_string()).translate(language=language).execute() yield [Document(page_content=doctran_doc.properties_as_yaml() if doctran_doc.extracted_properties else doctran_doc.transformed_content)]Splitting
When splitting documents, the metadata for the parent document is copied to all child documents. This means any extracted properties (including QA pairs) will be copied over to all chunks, rather than mapped to only the relevant chunks that contain those properties. Therefore vector retrieval using these metadata properties won't be able to retrieve the correct documents if the document gets split.
I can:
Keep the functionality as is, but that would make metadata extraction useless if the user intends to split the document after parsing.
Keep the metadata extraction/QA generation step as a document transformation so it can happen after splitting
Add functionality to handle document splitting while mapping properties to the correct chunks. Not sure if there's a great way to do this since I'm not super familiar with the LangChain codebase yet.
Let me know what you think! Can also discuss more on our call but it's a lot to cover in 30 min haha.
Thanks! Hmm, you should be able to pass args to the parser. I'll have a close look later today! See here for example Grobid parser accepting args and applying metadata to chunks.
''' loader = GenericLoader.from_filesystem( "../Papers/", glob="*", suffixes=[".pdf"], parser= GrobidParser(segment_sentences=False) ) docs = loader.load() '''
re: arguments, ah yeah nevermind - didn't notice that the loader accepts an instance of a parser rather than the class, so I can just include set them in the constructor 👍
re: arguments, ah yeah nevermind - didn't notice that the loader accepts an instance of a parser rather than the class, so I can just include set them in the constructor 👍
Nice! Ya. And I think chunk-wise metadata should work (let's have a quick skim at Grobid as an example). I'll look when back on my computer later.
@rlancemartin I might be missing something here, but is there an easy way to load strings into Blobs? Right now I have to define a new BlobLoader class in order to load strings:
class StringBlobLoader(BlobLoader):
def yield_blobs(self) -> Iterable[Blob]:
yield Blob(data=sample_text)
loader = GenericLoader(
blob_loader=StringBlobLoader(),
blob_parser= DoctranTranslateParser(language="spanish")
)
docs = loader.load()
@rlancemartin I might be missing something here, but is there an easy way to load strings into Blobs? Right now I have to define a new BlobLoader class in order to load strings:
class StringBlobLoader(BlobLoader): def yield_blobs(self) -> Iterable[Blob]: yield Blob(data=sample_text) loader = GenericLoader( blob_loader=StringBlobLoader(), blob_parser= DoctranTranslateParser(language="spanish") ) docs = loader.load()
Blobs are files, so you can just put an example txt file w/ your string in ../example data.
@rlancemartin Good to know! It wasn't a large amount of data so I just kept it in the notebook :)
All the changes are in!
- fixed the bug that was causing the issue in python <3.10, turns out it was due to the new Union notation using
|. - removed doctran as a dependency but will raise an
ImportErrorif these new parsers are used without the package being installed
I think it should be good to merge in now, but let me know if I missed anything!
@rlancemartin Good to know! It wasn't a large amount of data so I just kept it in the notebook :)
All the changes are in!
fixed the bug that was causing the issue in python <3.10, turns out it was due to the new Union notation using
|.removed doctran as a dependency but will raise an
ImportErrorif these new parsers are used without the package being installedI think it should be good to merge in now, but let me know if I missed anything!
Nice! I'll have a look tmrw and plan to get it in!
Also, see tests:
poetry run mypy .
tests/integration_tests/test_document_transformers.py:2: error: Module "langchain.document_transformers" has no attribute "EmbeddingsClusteringFilter" [attr-defined]
Fixed the test failure as well! It was a result of moving document_transformers.py into a directory with an __init__.py Thanks for catching these @rlancemartin
Fixed the test failure as well! It was a result of moving
document_transformers.pyinto a directory with an__init__.pyThanks for catching these @rlancemartin
hmm, pulled your latest but still see issues:
File ~/anaconda3/envs/lcn2/lib/python3.9/site-packages/doctran/doctran.py:165, in Doctran.__init__(self, openai_api_key, openai_model, openai_token_limit)
164 def __init__(self, openai_api_key: str = None, openai_model: str = "gpt-4", openai_token_limit: int = 8000):
--> 165 self.config = DoctranConfig(
166 openai_api_key=openai_api_key,
167 openai_model=openai_model,
168 openai_completions_url="https://api.openai.com/v1/completions",
169 openai=openai,
170 openai_token_limit=openai_token_limit
171 )
172 if openai_api_key:
173 self.config.openai.api_key = openai_api_key
File ~/anaconda3/envs/lcn2/lib/python3.9/site-packages/pydantic/main.py:341, in pydantic.main.BaseModel.__init__()
ValidationError: 1 validation error for DoctranConfig
openai_api_key
none is not an allowed value (type=type_error.none.not_allowed)
@rlancemartin sorry about that I was running the notebooks with the API key passed in as a param so wasn't running into that bug, but tested it with the key set as an environment variable and fixed the issue.
@rlancemartin sorry about that I was running the notebooks with the API key passed in as a param so wasn't running into that bug, but tested it with the key set as an environment variable and fixed the issue.
Cool, Ill give it one more check later.
In the meantime, see failing test:
langchain/document_transformers/__init__.py:1: error: Module "langchain.document_transformers.embeddings_redundant_filter" has no attribute "_filter_cluster_embeddingsgit"; maybe "_filter_cluster_embeddings" or "_filter_similar_embeddings"? [attr-defined]
Found 1 error in 1 file (checked 1332 source files)
make: *** [Makefile:34: lint] Error 1
@jasonwcfan i chatted w/ @baskaryan. this is better as a document transformer since it clearly ingests a document (rather than loading from a novel source) and transforms it. i moved it back to being a transformer. please have a quick look, and then we can merge tomorrow. please add a bit more detail to Extract Properties notebook header and also make sure that part is working as expected. it looks a bit off to me.
Note: I'm using poetry lock --no-update to re-generate poetry.lock after checking out master pyproject.toml. Good for future reference when you see conflicts w/ poetry.lock.
Oof that's odd, I can't think of any way it would have leaked through Doctran. It's not in the repo and we don't send openai keys to any API. I'll comb through doctran tomorrow and double check.
In the meantime I've pushed some updates to the notebooks + transformers, I'll check if they pass the tests once it's done deploying. Thanks for bearing with me @rlancemartin!
Oof that's odd, I can't think of any way it would have leaked through Doctran. It's not in the repo and we don't send openai keys to any API. I'll comb through doctran tomorrow and double check.
In the meantime I've pushed some updates to the notebooks + transformers, I'll check if they pass the tests once it's done deploying. Thanks for bearing with me @rlancemartin!
Sounds good! Ya it's really weird.
Error due to lib flipped back to async -
╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <module>:1 │
│ │
│ ❱ 1 transformed_document = await qa_transformer.atransform_documents(documents) │
│ 2 print(json.dumps(transformed_document[0].metadata, indent=2)) │
│ 3 │
│ │
│ /Users/rlm/Desktop/Code/langchain/langchain/document_transformers/text_qa.py:36 in │
│ atransform_documents │
│ │
│ 33 │ │ │ │ "Install doctran to use this parser. (pip install doctran)" │
│ 34 │ │ │ ) │
│ 35 │ │ for d in documents: │
│ ❱ 36 │ │ │ doctran_doc = await doctran.parse(content=d.page_content).interrogate().exec │
│ 37 │ │ │ questions_and_answers = doctran_doc.extracted_properties.get( │
│ 38 │ │ │ │ "questions_and_answers" │
│ 39 │ │ │ ) │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: object Document can't be used in 'await' expression
@jasonwcfan will add test to Doctrac to protect against this.
Adding @baskaryan re: creation of document_transformers subdir and moving document_transformers.py to langchain/document_transformers/embeddings_redundant_filter.py
This is good to go in once @jasonwcfan adds init methods and does a final sweep of Notebooks w/ init.
@baskaryan minor point that Vercel deploy test is failing (non blocking).
can fix vercel and lint issues and land from here @jasonwcfan!
Thanks @baskaryan! Great. Should we wait for the Vercel fix?