langchain icon indicating copy to clipboard operation
langchain copied to clipboard

Minor Change For New Wikipedia Loader

Open T1b4lt opened this issue 2 years ago • 4 comments

It's important for documents to have metadata["source"] field, for example, for index.query_with_sources()

@eyurtsev

T1b4lt avatar May 09 '23 08:05 T1b4lt

i like it, any objections @eyurtsev @leo-gan?

dev2049 avatar May 09 '23 17:05 dev2049

It's a breaking change, so if we're OK proceeding, let's relabel commit title to Breaking Change so we remember to include it in the release notes as such.

Will need to resolve merge conflict first, and looking for input from @leo-gan

eyurtsev avatar May 09 '23 19:05 eyurtsev

It's a breaking change, so if we're OK proceeding, let's relabel commit title to Breaking Change so we remember to include it in the release notes as such.

Will need to resolve merge conflict first, and looking for input from @leo-gan

could just keep "page_url" in metadata as well

dev2049 avatar May 09 '23 19:05 dev2049

I'm in favor of having a generic provenance field that captures the protocol / storage. As long as the provenance field is completely specified it makes it easy to treat all content on an equal footing regardless of whether it came from s3, a website or a a row in a postgres database.

With that said, I doubt that our sources are specified correctly at the moment, but would be in favor of moving in that direction.

eyurtsev avatar May 12 '23 02:05 eyurtsev

I'm in favor of having a generic provenance field that captures the protocol / storage. As long as the provenance field is completely specified it makes it easy to treat all content on an equal footing regardless of whether it came from s3, a website or a a row in a postgres database.

With that said, I doubt that our sources are specified correctly at the moment, but would be in favor of moving in that direction.

any suggestions for this pr specifically (for now, agree we should come up with more thoughtful approach in medium term)

dev2049 avatar May 15 '23 20:05 dev2049