langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Add new types of document transformers

Open jasonwcfan opened this issue 2 years ago • 13 comments

  • Description: Add two new document transformers that translates documents into different languages and converts documents into q&a format to improve vector search results. Uses OpenAI function calling via the doctran library.
  • Issue: N/A
  • Dependencies: doctran = "^0.0.5"
  • Tag maintainer: @rlancemartin @eyurtsev @hwchase17
  • Twitter handle: @psychicapi or @jfan001

Notes

  • Adheres to the DocumentTransformer abstraction set by @dev2049 in #3182
  • refactored EmbeddingsRedundantFilter to put it in a file under a new document_transformers module
  • Added basic docs for DocumentInterrogator, DocumentTransformer as well as the existing EmbeddingsRedundantFilter

jasonwcfan avatar Jul 07 '23 22:07 jasonwcfan

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jul 13, 2023 3:51am

vercel[bot] avatar Jul 07 '23 22:07 vercel[bot]

Added the notebook examples - let me know if it's good to merge @rlancemartin and I can add the blob stuff in a followup PR, or if you want me to include it into this one!

jasonwcfan avatar Jul 08 '23 01:07 jasonwcfan

Added the notebook examples - let me know if it's good to merge @rlancemartin and I can add the blob stuff in a followup PR, or if you want me to include it into this one!

Thanks!

Added the notebook examples - let me know if it's good to merge @rlancemartin and I can add the blob stuff in a followup PR, or if you want me to include it into this one!

Great! Just added comments.

IMO, easier to make it a parser. Minimal code changes. I can help if needed.

Will simplify UX and be consistent w/ other recent transformers.

rlancemartin avatar Jul 08 '23 19:07 rlancemartin

Also, functionality here is: 1/ QA 2/ Translate

Looks like Doctran can do more in terms of metadata extraction.

Plans to add that in a follow-up? We can discuss, too.

QA generation is cool, but impact is somewhat mild b/c we already have QAGeneration chain.

Translation is neat.

Adding context to chunks (in metadata) is the high impact addition that would be great to get in.

rlancemartin avatar Jul 08 '23 19:07 rlancemartin

@rlancemartin ack, will add this in as a parser tomorrow.

Re: naming I avoided including QA because there's already a lot of modules in LangChain that have some variation of "QA" in its name so I thought it might just increase confusion - but happy to call it whatever you think makes sense!

Re: additional functionality yeah I can add those too in a follow up, we can discuss on monday to see which ones are most useful and the best way to include them. Would also be curious to understand the vision is for the document transformer module, I assumed it would be for these kinds of transformations

jasonwcfan avatar Jul 09 '23 03:07 jasonwcfan

Added translate and QA as parsers, as well as metadata extraction since you mentioned that was valuable. Will clean up the old files and update the notebooks early this week.

I ran into a couple of issues though:

Passing arguments to parsers The BaseBlobParser class doesn't support additional arguments when passed into loaders. It seems that other parsers don't need to accept additional arguments, but it's necessary for translation (need to specify the language) and metadata extraction (need to pass in a JSON schema to describe the metadata to extract).

I can extend BaseBlobParser and BaseLoader to accept **kwargs if we want to keep these modules as parsers, but that might make the scope of these changes further reaching than anticipated.

class DoctranTranslateParser(BaseBlobParser):
    """Translates text documents into other languages using doctran."""

    def lazy_parse(self, blob: Blob, language: str, **kwargs) -> Iterator[Document]:
        """Lazily parse the blob."""
        openai_api_key = get_from_dict_or_env(kwargs, "openai_api_key", "OPENAI_API_KEY")
        doctran = Doctran(openai_api_key=openai_api_key)
        doctran_doc = doctran.parse(content=blob.as_string()).translate(language=language).execute()
        yield [Document(page_content=doctran_doc.properties_as_yaml() if doctran_doc.extracted_properties else doctran_doc.transformed_content)]

Splitting When splitting documents, the metadata for the parent document is copied to all child documents. This means any extracted properties (including QA pairs) will be copied over to all chunks, rather than mapped to only the relevant chunks that contain those properties. Therefore vector retrieval using these metadata properties won't be able to retrieve the correct documents if the document gets split.

I can:

  1. Keep the functionality as is, but that would make metadata extraction useless if the user intends to split the document after parsing.
  2. Keep the metadata extraction/QA generation step as a document transformation so it can happen after splitting
  3. Add functionality to handle document splitting while mapping properties to the correct chunks. Not sure if there's a great way to do this since I'm not super familiar with the LangChain codebase yet.

Let me know what you think! Can also discuss more on our call but it's a lot to cover in 30 min haha.

jasonwcfan avatar Jul 09 '23 19:07 jasonwcfan

Added translate and QA as parsers, as well as metadata extraction since you mentioned that was valuable. Will clean up the old files and update the notebooks early this week.

I ran into a couple of issues though:

Passing arguments to parsers

The BaseBlobParser class doesn't support additional arguments when passed into loaders. It seems that other parsers don't need to accept additional arguments, but it's necessary for translation (need to specify the language) and metadata extraction (need to pass in a JSON schema to describe the metadata to extract).

I can extend BaseBlobParser and BaseLoader to accept **kwargs if we want to keep these modules as parsers, but that might make the scope of these changes further reaching than anticipated.


class DoctranTranslateParser(BaseBlobParser):

    """Translates text documents into other languages using doctran."""



    def lazy_parse(self, blob: Blob, language: str, **kwargs) -> Iterator[Document]:

        """Lazily parse the blob."""

        openai_api_key = get_from_dict_or_env(kwargs, "openai_api_key", "OPENAI_API_KEY")

        doctran = Doctran(openai_api_key=openai_api_key)

        doctran_doc = doctran.parse(content=blob.as_string()).translate(language=language).execute()

        yield [Document(page_content=doctran_doc.properties_as_yaml() if doctran_doc.extracted_properties else doctran_doc.transformed_content)]

Splitting

When splitting documents, the metadata for the parent document is copied to all child documents. This means any extracted properties (including QA pairs) will be copied over to all chunks, rather than mapped to only the relevant chunks that contain those properties. Therefore vector retrieval using these metadata properties won't be able to retrieve the correct documents if the document gets split.

I can:

  1. Keep the functionality as is, but that would make metadata extraction useless if the user intends to split the document after parsing.

  2. Keep the metadata extraction/QA generation step as a document transformation so it can happen after splitting

  3. Add functionality to handle document splitting while mapping properties to the correct chunks. Not sure if there's a great way to do this since I'm not super familiar with the LangChain codebase yet.

Let me know what you think! Can also discuss more on our call but it's a lot to cover in 30 min haha.

Thanks! Hmm, you should be able to pass args to the parser. I'll have a close look later today! See here for example Grobid parser accepting args and applying metadata to chunks.

''' loader = GenericLoader.from_filesystem( "../Papers/", glob="*", suffixes=[".pdf"], parser= GrobidParser(segment_sentences=False) ) docs = loader.load() '''

rlancemartin avatar Jul 09 '23 22:07 rlancemartin

re: arguments, ah yeah nevermind - didn't notice that the loader accepts an instance of a parser rather than the class, so I can just include set them in the constructor 👍

jasonwcfan avatar Jul 09 '23 22:07 jasonwcfan

re: arguments, ah yeah nevermind - didn't notice that the loader accepts an instance of a parser rather than the class, so I can just include set them in the constructor 👍

Nice! Ya. And I think chunk-wise metadata should work (let's have a quick skim at Grobid as an example). I'll look when back on my computer later.

rlancemartin avatar Jul 09 '23 22:07 rlancemartin

@rlancemartin I might be missing something here, but is there an easy way to load strings into Blobs? Right now I have to define a new BlobLoader class in order to load strings:

class StringBlobLoader(BlobLoader):
    def yield_blobs(self) -> Iterable[Blob]:
        yield Blob(data=sample_text)

loader = GenericLoader(
    blob_loader=StringBlobLoader(),
    blob_parser= DoctranTranslateParser(language="spanish")
)
docs = loader.load()

jasonwcfan avatar Jul 10 '23 23:07 jasonwcfan

@rlancemartin I might be missing something here, but is there an easy way to load strings into Blobs? Right now I have to define a new BlobLoader class in order to load strings:

class StringBlobLoader(BlobLoader):
    def yield_blobs(self) -> Iterable[Blob]:
        yield Blob(data=sample_text)

loader = GenericLoader(
    blob_loader=StringBlobLoader(),
    blob_parser= DoctranTranslateParser(language="spanish")
)
docs = loader.load()

Blobs are files, so you can just put an example txt file w/ your string in ../example data.

rlancemartin avatar Jul 10 '23 23:07 rlancemartin

@rlancemartin Good to know! It wasn't a large amount of data so I just kept it in the notebook :)

All the changes are in!

  • fixed the bug that was causing the issue in python <3.10, turns out it was due to the new Union notation using |.
  • removed doctran as a dependency but will raise an ImportError if these new parsers are used without the package being installed

I think it should be good to merge in now, but let me know if I missed anything!

jasonwcfan avatar Jul 11 '23 05:07 jasonwcfan

@rlancemartin Good to know! It wasn't a large amount of data so I just kept it in the notebook :)

All the changes are in!

  • fixed the bug that was causing the issue in python <3.10, turns out it was due to the new Union notation using |.

  • removed doctran as a dependency but will raise an ImportError if these new parsers are used without the package being installed

I think it should be good to merge in now, but let me know if I missed anything!

Nice! I'll have a look tmrw and plan to get it in!

rlancemartin avatar Jul 11 '23 05:07 rlancemartin

Also, see tests:

poetry run mypy .
tests/integration_tests/test_document_transformers.py:2: error: Module "langchain.document_transformers" has no attribute "EmbeddingsClusteringFilter"  [attr-defined]

rlancemartin avatar Jul 11 '23 18:07 rlancemartin

Fixed the test failure as well! It was a result of moving document_transformers.py into a directory with an __init__.py Thanks for catching these @rlancemartin

jasonwcfan avatar Jul 11 '23 21:07 jasonwcfan

Fixed the test failure as well! It was a result of moving document_transformers.py into a directory with an __init__.py Thanks for catching these @rlancemartin

hmm, pulled your latest but still see issues:

File ~/anaconda3/envs/lcn2/lib/python3.9/site-packages/doctran/doctran.py:165, in Doctran.__init__(self, openai_api_key, openai_model, openai_token_limit)
    164 def __init__(self, openai_api_key: str = None, openai_model: str = "gpt-4", openai_token_limit: int = 8000):
--> 165     self.config = DoctranConfig(
    166         openai_api_key=openai_api_key,
    167         openai_model=openai_model,
    168         openai_completions_url="https://api.openai.com/v1/completions",
    169         openai=openai,
    170         openai_token_limit=openai_token_limit
    171     )
    172     if openai_api_key:
    173         self.config.openai.api_key = openai_api_key

File ~/anaconda3/envs/lcn2/lib/python3.9/site-packages/pydantic/main.py:341, in pydantic.main.BaseModel.__init__()

ValidationError: 1 validation error for DoctranConfig
openai_api_key
  none is not an allowed value (type=type_error.none.not_allowed)

rlancemartin avatar Jul 11 '23 22:07 rlancemartin

@rlancemartin sorry about that I was running the notebooks with the API key passed in as a param so wasn't running into that bug, but tested it with the key set as an environment variable and fixed the issue.

jasonwcfan avatar Jul 11 '23 23:07 jasonwcfan

@rlancemartin sorry about that I was running the notebooks with the API key passed in as a param so wasn't running into that bug, but tested it with the key set as an environment variable and fixed the issue.

Cool, Ill give it one more check later.

In the meantime, see failing test:

langchain/document_transformers/__init__.py:1: error: Module "langchain.document_transformers.embeddings_redundant_filter" has no attribute "_filter_cluster_embeddingsgit"; maybe "_filter_cluster_embeddings" or "_filter_similar_embeddings"?  [attr-defined]
Found 1 error in 1 file (checked 1332 source files)
make: *** [Makefile:34: lint] Error 1

rlancemartin avatar Jul 12 '23 00:07 rlancemartin

@jasonwcfan i chatted w/ @baskaryan. this is better as a document transformer since it clearly ingests a document (rather than loading from a novel source) and transforms it. i moved it back to being a transformer. please have a quick look, and then we can merge tomorrow. please add a bit more detail to Extract Properties notebook header and also make sure that part is working as expected. it looks a bit off to me.

rlancemartin avatar Jul 12 '23 04:07 rlancemartin

Note: I'm using poetry lock --no-update to re-generate poetry.lock after checking out master pyproject.toml. Good for future reference when you see conflicts w/ poetry.lock.

rlancemartin avatar Jul 12 '23 04:07 rlancemartin

Oof that's odd, I can't think of any way it would have leaked through Doctran. It's not in the repo and we don't send openai keys to any API. I'll comb through doctran tomorrow and double check.

In the meantime I've pushed some updates to the notebooks + transformers, I'll check if they pass the tests once it's done deploying. Thanks for bearing with me @rlancemartin!

jasonwcfan avatar Jul 12 '23 06:07 jasonwcfan

Oof that's odd, I can't think of any way it would have leaked through Doctran. It's not in the repo and we don't send openai keys to any API. I'll comb through doctran tomorrow and double check.

In the meantime I've pushed some updates to the notebooks + transformers, I'll check if they pass the tests once it's done deploying. Thanks for bearing with me @rlancemartin!

Sounds good! Ya it's really weird.

rlancemartin avatar Jul 12 '23 12:07 rlancemartin

Error due to lib flipped back to async -

╭─────────────────────────────── Traceback (most recent call last) ────────────────────────────────╮
│ in <module>:1                                                                                    │
│                                                                                                  │
│ ❱ 1 transformed_document = await qa_transformer.atransform_documents(documents)                  │
│   2 print(json.dumps(transformed_document[0].metadata, indent=2))                                │
│   3                                                                                              │
│                                                                                                  │
│ /Users/rlm/Desktop/Code/langchain/langchain/document_transformers/text_qa.py:36 in               │
│ atransform_documents                                                                             │
│                                                                                                  │
│   33 │   │   │   │   "Install doctran to use this parser. (pip install doctran)"                 │
│   34 │   │   │   )                                                                               │
│   35 │   │   for d in documents:                                                                 │
│ ❱ 36 │   │   │   doctran_doc = await doctran.parse(content=d.page_content).interrogate().exec    │
│   37 │   │   │   questions_and_answers = doctran_doc.extracted_properties.get(                   │
│   38 │   │   │   │   "questions_and_answers"                                                     │
│   39 │   │   │   )                                                                               │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
TypeError: object Document can't be used in 'await' expression

@jasonwcfan will add test to Doctrac to protect against this.

rlancemartin avatar Jul 12 '23 18:07 rlancemartin

Adding @baskaryan re: creation of document_transformers subdir and moving document_transformers.py to langchain/document_transformers/embeddings_redundant_filter.py

rlancemartin avatar Jul 12 '23 20:07 rlancemartin

This is good to go in once @jasonwcfan adds init methods and does a final sweep of Notebooks w/ init.

@baskaryan minor point that Vercel deploy test is failing (non blocking).

rlancemartin avatar Jul 12 '23 23:07 rlancemartin

can fix vercel and lint issues and land from here @jasonwcfan!

baskaryan avatar Jul 13 '23 02:07 baskaryan

Thanks @baskaryan! Great. Should we wait for the Vercel fix?

rlancemartin avatar Jul 13 '23 03:07 rlancemartin