langchain
langchain copied to clipboard
feature/4493 Improve Evernote Document Loader
Improve Evernote Document Loader
When exporting from Evernote you may export more than one note. Currently the Evernote loader concatenates the content of all notes in the export into a single document and only attaches the name of the export file as metadata on the document.
This change ensures that each note is loaded as an independent document and all available metadata on the note e.g. author, title, created, updated are added as metadata on each document.
It also uses an existing optional dependency of html2text instead of pypandoc to remove the need to download the pandoc application via download_pandoc() to be able to use the pypandoc python bindings.
Fixes #4493
Before submitting
Who can review?
@eyurtsev / @dev2049
Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested:
@MikeMcGarry Thank you for the PR. This is a major improvement to how ever note document loader works!
@hwchase17 thoughts on backwards compatibility?
Hi @eyurtsev thanks for your detailed review, your guidance on setting up the dependencies appropriately was really helpful and you are on the money about those methods which should be static methods rather than class methods. I've addressed all your feedback including preserving the existing behaviour and making the new behaviour available by passing load_single_document=False into the initialisation of the EverNoteLoader. I've also added some additional unit tests to improve our coverage.
Thank you! as a heads up, will commandeer tomorrow to help resolve merge conflicts around poetry.lock files, there's a bunch of PRs that are bumping into merge conflicts, so will be doing the same in all to merge the code faster.
Okay sounds good! I've run poetry run black . and committed the updated files as I see one of the pipeline tests failed as the files were not formatted appropriately.
I've also rebased onto master.
seems like one of the test files is 11k lines, any chance we could use a smaller one 😅
It's a base64 encoded image. I wanted to make sure images were removed from the note and didn't end up in the page context.
Do we have some size constraints we want to consider with test data files? These aren't part of the package contents when published to PyPi right?
It's a base64 encoded image. I wanted to make sure images were removed from the note and didn't end up in the page context.
Do we have some size constraints we want to consider with test data files? These aren't part of the package contents when published to PyPi right?
they're not part of the package, it's still nice to keep the repo small though. could we just use a small image (shouldn't change quality of the test right?)
It's a base64 encoded image. I wanted to make sure images were removed from the note and didn't end up in the page context. Do we have some size constraints we want to consider with test data files? These aren't part of the package contents when published to PyPi right?
they're not part of the package, it's still nice to keep the repo small though. could we just use a small image (shouldn't change quality of the test right?)
Okay sure thing, I've scaled down the image it's now 18KB instead of 2.1MB.
@dev2049 / @eyurtsev I've accommodated all the feedback and rebased onto master. Are there any outstanding points we need to address?
Code looks ready for merging. Waiting for tests to pass and then can merge in. Thank you @MikeMcGarry