localGPT
localGPT copied to clipboard
added html file reading functionality
Hello, this should solve issue #47 pertaining to html file reading. Excellent project, hoping to follow and help where I can! Don't hesitate to reach out if you need help with anything or have questions with what I implemented.
@Ben-Nachmanson this looks great to me. Thanks for adding this, just have one request, can you remove the example html file. The reason is that there is one PDF file in the repo and I don't want to confuse the users if there are multiple files and the model starts answering from both sources. This is a great addition.
Wouldn't this require some changes to the text splitter as well? Doing so would probably mean better parsing of the document to pull data from. Text splitter documentation: https://python.langchain.com/en/latest/reference/modules/text_splitter.html?highlight=html
Wouldn't this require some changes to the text splitter as well? Doing so would probably mean better parsing of the document to pull data from. Text splitter documentation: https://python.langchain.com/en/latest/reference/modules/text_splitter.html?highlight=html
Elaborate. Based on my understanding this would not require changes, the text splitter already has the ability to read html files.
Wouldn't this require some changes to the text splitter as well? Doing so would probably mean better parsing of the document to pull data from. Text splitter documentation: https://python.langchain.com/en/latest/reference/modules/text_splitter.html?highlight=html
Elaborate. Based on my understanding this would not require changes, the text splitter already has the ability to read html files.
I'm not quite experienced in this domain I just stumbled upon language-specific text parsers and wanted to bring it to attention. However, it may or may not be a requirement. I'll trust your word for the text splitter being able to read HTML files.
I'm with @Arham4, it looks like TextSplitter can use a different set of separators specific to programming languages. https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/code_splitter.html#html
@Ben-Nachmanson @PromtEngineer any updates there?