localGPT
localGPT copied to clipboard
Programming Language Support for Documents
Originally posted by @sime2408 in https://github.com/PromtEngineer/localGPT/issues/151#issuecomment-1597633918
This ticket is to support different methods of Document splitting. Specifically for different programming languages.
Currently, Documents are loaded and then split with vanilla RecursiveCharacterTextSplitter
. As noted in langchain docs, this splitting is good for generic text as it keeps paragraphs together.
data:image/s3,"s3://crabby-images/78698/78698f14c8d3efceb239000415e6f91ef5fc6774" alt="Screen Shot 2023-06-20 at 1 05 37 PM"
Different programming languages have different separators that should work to split programming documents better. They can be defined by RecursiveCharacterTextSplitter.from_langauge
It'll be efficient if we can load documents into a Dict (see #147 and linked conversation)
@imjwang this is really helpful. I will be merging a major code change over next couple of days, can you please look into this afterwards? Thanks
Certainly