langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Adds OpenAI functions powered document metadata tagger

Open jacoblee93 opened this issue 2 years ago โ€ข 2 comments
trafficstars

Adds a new document transformer that automatically extracts metadata for a document based on an input schema. I also moved document_transformers.py to document_transformers/__init__.py to group it with this new transformer - it didn't seem to cause issues in the notebook, but let me know if I've done something wrong there.

Also had a linter issue I couldn't figure out:

MacBook-Pro:langchain jacoblee$ make lint
poetry run mypy .
docs/dist/conf.py: error: Duplicate module named "conf" (also at "./docs/api_reference/conf.py")
docs/dist/conf.py: note: See https://mypy.readthedocs.io/en/stable/running_mypy.html#mapping-file-paths-to-modules for more info
docs/dist/conf.py: note: Common resolutions include: a) using `--exclude` to avoid checking one of them, b) adding `__init__.py` somewhere, c) using `--explicit-package-bases` or adjusting MYPYPATH
Found 1 error in 1 file (errors prevented further checking)
make: *** [lint] Error 2

@rlancemartin @baskaryan

jacoblee93 avatar Jul 11 '23 07:07 jacoblee93

The latest updates on your projects. Learn more about Vercel for Git โ†—๏ธŽ

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
langchain โฌœ๏ธ Ignored (Inspect) Jul 13, 2023 4:50am

vercel[bot] avatar Jul 11 '23 07:07 vercel[bot]

PR Analysis

  • ๐ŸŽฏ Main theme: Adding a new document transformer that automatically extracts metadata for a document based on an input schema
  • ๐Ÿ” Description and title: Yes
  • ๐Ÿ“Œ Type of PR: Enhancement
  • ๐Ÿงช Relevant tests added: No
  • โœจ Minimal and focused: Yes, the PR is focused on adding a new feature of document metadata tagging and does not include unrelated changes.
  • ๐Ÿ”’ Security concerns: No, the PR does not introduce possible security concerns or issues. The changes are related to data processing and do not involve any security-sensitive operations.

PR Feedback

  • ๐Ÿ’ก General PR suggestions: The PR is well-structured and the code changes are clear. However, it lacks tests for the new functionality. It's important to add tests to ensure the new feature works as expected and to prevent regressions in the future. Additionally, the linter issue mentioned in the PR description should be resolved.

  • ๐Ÿค– Code suggestions:

    • relevant file: langchain/document_transformers/init.py suggestion content: Consider adding type hints to the function _filter_similar_embeddings and _filter_cluster_embeddings for better code readability and maintainability. [important]

    • relevant file: langchain/document_transformers/openai_functions.py suggestion content: In the MetadataTagger class, the atransform_documents method raises a NotImplementedError. If this method is not intended to be used, consider removing it or adding a docstring to explain why it's not implemented. [medium]

    • relevant file: langchain/document_transformers/openai_functions.py suggestion content: In the create_metadata_tagger function, consider adding a docstring to explain the purpose of the function, its arguments, and its return value. This will improve code readability and maintainability. [medium]

    • relevant file: langchain/chains/openai_functions/tagging.py suggestion content: In the create_tagging_chain and create_tagging_chain_pydantic functions, consider adding a docstring to explain the purpose of the functions, their arguments, and their return values. This will improve code readability and maintainability. [medium]

How to use

Tag me in a comment '@CodiumAI-Agent' to ask for a new review after you update the PR. You can also tag me and ask any question, for example '@CodiumAI-Agent is the PR ready for merge?'

CodiumAI-Agent avatar Jul 11 '23 07:07 CodiumAI-Agent

Nice. Yes. This is good. Consolidating document_transformers in a new dir will make additions easier going forward.

Like @baskaryan said, https://github.com/hwchase17/langchain/pull/7379 is going in soon.

rlancemartin avatar Jul 12 '23 23:07 rlancemartin