content-chatbot icon indicating copy to clipboard operation
content-chatbot copied to clipboard

Site map problem

Open bibhas2 opened this issue 2 years ago • 4 comments

Looking at create_embeddings.py it appears that the code does not crawl what it finds in the site map. It seems to only use the URLs directly referred by the site map. If the site map has links to other site maps then the script does not work. See the example attached here. sitemap_index.xml.txt

This should be made clear in the README.

bibhas2 avatar Apr 02 '23 16:04 bibhas2

Hi @bibhas2

do I understand it correctly that you are referring to not having the ability to parse sitemaps which link other sitemaps in turn?

mpaepper avatar Apr 09 '23 10:04 mpaepper

I'm coming across the same challenge, mostly with a sitemap generated by Yoast SEO in WordPress. That creates a "sitemap_index.xml" which then links to individual sitemaps for each post type - and those "sub" sitemaps contain the actual web pages that need to be vectorized

jan-koch avatar Jul 06 '23 07:07 jan-koch

Hi @bibhas2

do I understand it correctly that you are referring to not having the ability to parse sitemaps which link other sitemaps in turn?

Yes, you are correct. Sorry, I should be explained it better. I am modifying the bug report to add more details.

bibhas2 avatar Jul 06 '23 13:07 bibhas2

@mpaepper not sure how to do a PR so sharing the updated file here.

Thanks to GPT (I'm not a Pythond developer), I was able to fix the challenge of scraping Yoast SEO sitemaps. Here's the gist with the updated create_embeddings.py

https://gist.github.com/jan-koch/d563d3d5e182aaee83c0ba68f5c5520a

It works on my end but as I said, I wrote most of the updated code with ChatGPT and while I can read and somewhat understand it, I'm not sure if it has any major flaws.

jan-koch avatar Jul 10 '23 09:07 jan-koch