localGPT icon indicating copy to clipboard operation
localGPT copied to clipboard

added html file reading functionality

Open Ben-Nachmanson opened this issue 1 year ago • 5 comments

Hello, this should solve issue #47 pertaining to html file reading. Excellent project, hoping to follow and help where I can! Don't hesitate to reach out if you need help with anything or have questions with what I implemented.

Ben-Nachmanson avatar Jun 03 '23 02:06 Ben-Nachmanson

@Ben-Nachmanson this looks great to me. Thanks for adding this, just have one request, can you remove the example html file. The reason is that there is one PDF file in the repo and I don't want to confuse the users if there are multiple files and the model starts answering from both sources. This is a great addition.

PromtEngineer avatar Jun 04 '23 06:06 PromtEngineer

Wouldn't this require some changes to the text splitter as well? Doing so would probably mean better parsing of the document to pull data from. Text splitter documentation: https://python.langchain.com/en/latest/reference/modules/text_splitter.html?highlight=html

Arham4 avatar Jun 04 '23 23:06 Arham4

Wouldn't this require some changes to the text splitter as well? Doing so would probably mean better parsing of the document to pull data from. Text splitter documentation: https://python.langchain.com/en/latest/reference/modules/text_splitter.html?highlight=html

Elaborate. Based on my understanding this would not require changes, the text splitter already has the ability to read html files.

Ben-Nachmanson avatar Jun 08 '23 18:06 Ben-Nachmanson

Wouldn't this require some changes to the text splitter as well? Doing so would probably mean better parsing of the document to pull data from. Text splitter documentation: https://python.langchain.com/en/latest/reference/modules/text_splitter.html?highlight=html

Elaborate. Based on my understanding this would not require changes, the text splitter already has the ability to read html files.

I'm not quite experienced in this domain I just stumbled upon language-specific text parsers and wanted to bring it to attention. However, it may or may not be a requirement. I'll trust your word for the text splitter being able to read HTML files.

Arham4 avatar Jun 08 '23 18:06 Arham4

I'm with @Arham4, it looks like TextSplitter can use a different set of separators specific to programming languages. https://python.langchain.com/en/latest/modules/indexes/text_splitters/examples/code_splitter.html#html

imjwang avatar Jun 14 '23 23:06 imjwang

@Ben-Nachmanson @PromtEngineer any updates there?

devaskim avatar Aug 01 '23 18:08 devaskim