Improve Evernote Document Loader

When exporting from Evernote you may export more than one note. Currently the Evernote loader concatenates the content of all notes in the export into a single document and only attaches the name of the export file as metadata on the document.

This change ensures that each note is loaded as an independent document and all available metadata on the note e.g. author, title, created, updated are added as metadata on each document.

It also uses an existing optional dependency of html2text instead of pypandoc to remove the need to download the pandoc application via download_pandoc() to be able to use the pypandoc python bindings.

Fixes #4493

Before submitting

Who can review?

@eyurtsev / @dev2049

Community members can review the PR once tests pass. Tag maintainers/contributors who might be interested:

May 12 '23 12:05 MikeMcGarry

@MikeMcGarry Thank you for the PR. This is a major improvement to how ever note document loader works!

@hwchase17 thoughts on backwards compatibility?

May 12 '23 15:05 eyurtsev

Hi @eyurtsev thanks for your detailed review, your guidance on setting up the dependencies appropriately was really helpful and you are on the money about those methods which should be static methods rather than class methods. I've addressed all your feedback including preserving the existing behaviour and making the new behaviour available by passing load_single_document=False into the initialisation of the EverNoteLoader. I've also added some additional unit tests to improve our coverage.

May 13 '23 11:05 MikeMcGarry

Thank you! as a heads up, will commandeer tomorrow to help resolve merge conflicts around poetry.lock files, there's a bunch of PRs that are bumping into merge conflicts, so will be doing the same in all to merge the code faster.

May 15 '23 02:05 eyurtsev

Okay sounds good! I've run poetry run black . and committed the updated files as I see one of the pipeline tests failed as the files were not formatted appropriately.

May 15 '23 05:05 MikeMcGarry

I've also rebased onto master.

May 15 '23 05:05 MikeMcGarry

seems like one of the test files is 11k lines, any chance we could use a smaller one 😅

May 17 '23 00:05 dev2049

It's a base64 encoded image. I wanted to make sure images were removed from the note and didn't end up in the page context.

Do we have some size constraints we want to consider with test data files? These aren't part of the package contents when published to PyPi right?

May 17 '23 14:05 MikeMcGarry

It's a base64 encoded image. I wanted to make sure images were removed from the note and didn't end up in the page context.

Do we have some size constraints we want to consider with test data files? These aren't part of the package contents when published to PyPi right?

they're not part of the package, it's still nice to keep the repo small though. could we just use a small image (shouldn't change quality of the test right?)

May 17 '23 17:05 dev2049

It's a base64 encoded image. I wanted to make sure images were removed from the note and didn't end up in the page context. Do we have some size constraints we want to consider with test data files? These aren't part of the package contents when published to PyPi right?

they're not part of the package, it's still nice to keep the repo small though. could we just use a small image (shouldn't change quality of the test right?)

Okay sure thing, I've scaled down the image it's now 18KB instead of 2.1MB.

May 19 '23 02:05 MikeMcGarry

@dev2049 / @eyurtsev I've accommodated all the feedback and rebased onto master. Are there any outstanding points we need to address?

May 19 '23 02:05 MikeMcGarry

Code looks ready for merging. Waiting for tests to pass and then can merge in. Thank you @MikeMcGarry

May 19 '23 14:05 eyurtsev

langchain
langchain copied to clipboard

feature/4493 Improve Evernote Document Loader

Improve Evernote Document Loader

Before submitting

Who can review?

langchain langchain copied to clipboard

feature/4493 Improve Evernote Document Loader

Improve Evernote Document Loader

Before submitting

Who can review?

langchain
langchain copied to clipboard