mem0 icon indicating copy to clipboard operation
mem0 copied to clipboard

added SitemapLoader

Open aaishikdutta opened this issue 2 years ago • 4 comments

Relates to https://github.com/embedchain/embedchain/issues/46. Added a SitemapLoader class that iterates over a list of urls from an XML sitemap and internally calls the WebPageLoader to load the urls. It is paired with the WebPageChunker to produce embedding for an entire website in one go.

aaishikdutta avatar Jun 25 '23 20:06 aaishikdutta

@aaishikdutta looks cool, try to run it but got error:

AttributeError: 'SitemapLoader' object has no attribute 'create_chunks'

might missing the embedchain/chunkers/site_map.py

gasolin avatar Jun 27 '23 07:06 gasolin

@gasolin, I have mapped WebPageChunker to site_map as well. It works on my local. Can you please paste your code so that I can reproduce.

aaishikdutta avatar Jun 27 '23 08:06 aaishikdutta

@aaishikdutta sorry false alert, I copied chunkers to SitemapLoader instead of WebPageChunker 😅

gasolin avatar Jun 27 '23 08:06 gasolin

all good @gasolin. Let me know if you find any other issues in this.

aaishikdutta avatar Jun 27 '23 08:06 aaishikdutta

Can you write an exception for paths that point to images? Some sitemaps contain them. Thanks.

cachho avatar Jul 06 '23 20:07 cachho

Hey @cachho are you referring to this syntax. Sorry, haven't worked with these sitemaps previously:

Screenshot_2023-07-07-08-31-06-41_40deb401b9ffe8e1df2f1cc5ba480b12.jpg

aaishikdutta avatar Jul 07 '23 03:07 aaishikdutta

Hey @cachho are you referring to this syntax. Sorry, haven't worked with these sitemaps previously:

Screenshot_2023-07-07-08-31-06-41_40deb401b9ffe8e1df2f1cc5ba480b12.jpg

I haven't seen that one, I guess you're just taking the <loc> so that images would be excluded anyways, right?

No I'm talking about stuff like this: https://en.wikipedia.org/wiki/File:Streathamladyjayne.jpg Where File is in the URL. These are included in MediaWiki sitemaps. I would like them to be excluded, because the embedding model can't handle binary files anyways, as far as I know.

But I see how that would be too individual. I might have to filter the sitemap myself.

cachho avatar Jul 10 '23 10:07 cachho

forget what I said, this is fine.

cachho avatar Jul 10 '23 20:07 cachho

resolved the merge conflicts for you

cachho avatar Jul 11 '23 14:07 cachho

Looks good. Can you please resolve the merge conflicts and run make format to the code based on repository guidelines? Thanks!

updated @deshraj

Please add config and run make format lint

updated @cachho

aaishikdutta avatar Jul 11 '23 15:07 aaishikdutta