langchain Change methodology for scraping

It seems that the ReadTheDocsLoader is trying to parse and clean HTML contents from specific tags in the HTML file. If the HTML file doesn't contain the exact tag, the page_content will be empty.

The loader is looking for the "main" tag with the id "main-content" and if it doesn't find it, it's looking for a "div" tag with the role "main". If neither is found, it returns an empty string.

One way to fix this issue is to adjust the tags to those present in the HTML files to be scraped.

@eyurtsev

May 31 '23 16:05 Haste171

this is a pretty big change to the logic right?

i wrote this loader, and IIRC I did this because body by itself contained a lot of irrelevant info

This is true, but every html file has a body tag,while not all of them have the tags in the current version. A lot of different docs sites contain inconsistent formatting. Perhaps a version could be implemented to identify all content types...

May 31 '23 22:05 Haste171

Not to mention the AI is pretty good at determining relevant information. In conclusion it's better to have irrelevant content scraped then no content at all aslong as you can discern relevancy efficently.

May 31 '23 22:05 Haste171

Would it be better to use the tags if they exist and if not then fall back to a different strategy of using the body tag ?

Jun 02 '23 02:06 eyurtsev

Would it be better to use the tags if they exist and if not then fall back to a different strategy of using the body tag ?

Yes.

Jun 02 '23 02:06 Haste171

@Haste171 let us know if you're interested in implementing the fallback strategy

Jun 11 '23 20:06 eyurtsev

Yes, i'll try and get something up and working asap.

Jun 12 '23 12:06 Haste171

Feel free to re-open if you end up doing any work on this! Thanks @Haste171

Jul 24 '23 14:07 eyurtsev

langchain langchain copied to clipboard

Change methodology for scraping

langchain
langchain copied to clipboard