langchain
langchain copied to clipboard
Change methodology for scraping
It seems that the ReadTheDocsLoader is trying to parse and clean HTML contents from specific tags in the HTML file. If the HTML file doesn't contain the exact tag, the page_content will be empty.
The loader is looking for the "main" tag with the id "main-content" and if it doesn't find it, it's looking for a "div" tag with the role "main". If neither is found, it returns an empty string.
One way to fix this issue is to adjust the tags to those present in the HTML files to be scraped.
@eyurtsev
this is a pretty big change to the logic right?
i wrote this loader, and IIRC I did this because body by itself contained a lot of irrelevant info
This is true, but every html file has a body tag,while not all of them have the tags in the current version. A lot of different docs sites contain inconsistent formatting. Perhaps a version could be implemented to identify all content types...
Not to mention the AI is pretty good at determining relevant information. In conclusion it's better to have irrelevant content scraped then no content at all aslong as you can discern relevancy efficently.
Would it be better to use the tags if they exist and if not then fall back to a different strategy of using the body tag ?
Would it be better to use the tags if they exist and if not then fall back to a different strategy of using the body tag ?
Yes.
@Haste171 let us know if you're interested in implementing the fallback strategy
Yes, i'll try and get something up and working asap.
Feel free to re-open if you end up doing any work on this! Thanks @Haste171