semantic-kernel icon indicating copy to clipboard operation
semantic-kernel copied to clipboard

Specialized TextChunkers .Net

Open Kevdome3000 opened this issue 1 year ago • 5 comments

I'd like the ability to use specialized chunkers based on the file type to ingest data. A great example use case is the GitHub qa sample app, which only parses markdown files to summarize a repository. That assumes proper documentation. But if it could parse say .cs, .py or the other language file extensions, the GithubSkill would be a lot more powerful because it isn't reliant on robust documentation.

I can provide a starting point for chunkers for both .cs and .py files, it's just not clear where in the SDK they belong

Kevdome3000 avatar Apr 27 '23 04:04 Kevdome3000

We'd love your contributions! In Python we just landed this: https://github.com/microsoft/semantic-kernel/pull/450

And in C# https://github.com/microsoft/semantic-kernel/blob/main/dotnet/src/SemanticKernel/Text/TextChunker.cs

I'd say implementing more chunkers there would be a good starting point!

@lemillermicrosoft @dluc thoughts?

alexchaomander avatar Apr 27 '23 13:04 alexchaomander

@shawncal can you reply what folder would be the best place to put chunkers

evchaki avatar Apr 27 '23 20:04 evchaki

this was meant for .Net just FYI

Kevdome3000 avatar Apr 28 '23 00:04 Kevdome3000

bump

Also wondering if extending DocumentSkill would be a better place to implement chunkers

Kevdome3000 avatar Apr 29 '23 05:04 Kevdome3000

@dluc Can we look into this for the backlog?

microsoftShannon avatar May 02 '23 21:05 microsoftShannon

Also wondering if extending DocumentSkill would be a better place to implement chunkers

I like the idea of making chunkers plugins/functions

matthewbolanos avatar Nov 28 '23 01:11 matthewbolanos

All .Net issues prior to 1-Dec-2023 are being closed. Please re-open, if this issue is still relevant to the .Net Semantic Kernel 1.x release. In the future all issues that are inactive for more than 90 days will be labelled as 'stale' and closed 14 days later.

markwallace-microsoft avatar Mar 12 '24 16:03 markwallace-microsoft