semantic-kernel
semantic-kernel copied to clipboard
Specialized TextChunkers .Net
I'd like the ability to use specialized chunkers based on the file type to ingest data. A great example use case is the GitHub qa sample app, which only parses markdown files to summarize a repository. That assumes proper documentation. But if it could parse say .cs, .py or the other language file extensions, the GithubSkill would be a lot more powerful because it isn't reliant on robust documentation.
I can provide a starting point for chunkers for both .cs and .py files, it's just not clear where in the SDK they belong
We'd love your contributions! In Python we just landed this: https://github.com/microsoft/semantic-kernel/pull/450
And in C# https://github.com/microsoft/semantic-kernel/blob/main/dotnet/src/SemanticKernel/Text/TextChunker.cs
I'd say implementing more chunkers there would be a good starting point!
@lemillermicrosoft @dluc thoughts?
@shawncal can you reply what folder would be the best place to put chunkers
this was meant for .Net just FYI
bump
Also wondering if extending DocumentSkill would be a better place to implement chunkers
@dluc Can we look into this for the backlog?
Also wondering if extending DocumentSkill would be a better place to implement chunkers
I like the idea of making chunkers plugins/functions
All .Net issues prior to 1-Dec-2023 are being closed. Please re-open, if this issue is still relevant to the .Net Semantic Kernel 1.x release. In the future all issues that are inactive for more than 90 days will be labelled as 'stale' and closed 14 days later.