localGPT
localGPT copied to clipboard
Programming Language Support for Documents
Originally posted by @sime2408 in https://github.com/PromtEngineer/localGPT/issues/151#issuecomment-1597633918
This ticket is to support different methods of Document splitting. Specifically for different programming languages.
Currently, Documents are loaded and then split with vanilla RecursiveCharacterTextSplitter. As noted in langchain docs, this splitting is good for generic text as it keeps paragraphs together.
Different programming languages have different separators that should work to split programming documents better. They can be defined by RecursiveCharacterTextSplitter.from_langauge
It'll be efficient if we can load documents into a Dict (see #147 and linked conversation)
@imjwang this is really helpful. I will be merging a major code change over next couple of days, can you please look into this afterwards? Thanks
Certainly