added SitemapLoader
Relates to https://github.com/embedchain/embedchain/issues/46. Added a SitemapLoader class that iterates over a list of urls from an XML sitemap and internally calls the WebPageLoader to load the urls. It is paired with the WebPageChunker to produce embedding for an entire website in one go.
@aaishikdutta looks cool, try to run it but got error:
AttributeError: 'SitemapLoader' object has no attribute 'create_chunks'
might missing the embedchain/chunkers/site_map.py
@gasolin, I have mapped WebPageChunker to site_map as well. It works on my local. Can you please paste your code so that I can reproduce.
@aaishikdutta sorry false alert, I copied chunkers to SitemapLoader instead of WebPageChunker 😅
all good @gasolin. Let me know if you find any other issues in this.
Can you write an exception for paths that point to images? Some sitemaps contain them. Thanks.
Hey @cachho are you referring to this syntax. Sorry, haven't worked with these sitemaps previously:
Hey @cachho are you referring to this syntax. Sorry, haven't worked with these sitemaps previously:
I haven't seen that one, I guess you're just taking the <loc> so that images would be excluded anyways, right?
No I'm talking about stuff like this: https://en.wikipedia.org/wiki/File:Streathamladyjayne.jpg Where File is in the URL. These are included in MediaWiki sitemaps. I would like them to be excluded, because the embedding model can't handle binary files anyways, as far as I know.
But I see how that would be too individual. I might have to filter the sitemap myself.
forget what I said, this is fine.
resolved the merge conflicts for you
Looks good. Can you please resolve the merge conflicts and run
make formatto the code based on repository guidelines? Thanks!
updated @deshraj
Please add config and run
make format lint
updated @cachho
