localGPT icon indicating copy to clipboard operation
localGPT copied to clipboard

Programming Language Support for Documents

Open imjwang opened this issue 1 year ago • 2 comments

Originally posted by @sime2408 in https://github.com/PromtEngineer/localGPT/issues/151#issuecomment-1597633918

This ticket is to support different methods of Document splitting. Specifically for different programming languages.

Currently, Documents are loaded and then split with vanilla RecursiveCharacterTextSplitter. As noted in langchain docs, this splitting is good for generic text as it keeps paragraphs together.

Screen Shot 2023-06-20 at 1 05 37 PM

src of img

Different programming languages have different separators that should work to split programming documents better. They can be defined by RecursiveCharacterTextSplitter.from_langauge

Code Splitter from Langchain

It'll be efficient if we can load documents into a Dict (see #147 and linked conversation)

imjwang avatar Jun 20 '23 17:06 imjwang

@imjwang this is really helpful. I will be merging a major code change over next couple of days, can you please look into this afterwards? Thanks

PromtEngineer avatar Jun 22 '23 22:06 PromtEngineer

Certainly

imjwang avatar Jun 23 '23 00:06 imjwang