matrixone icon indicating copy to clipboard operation
matrixone copied to clipboard

LLM_CHUNK implementation

Open charleschile opened this issue 1 year ago • 4 comments

What type of PR is this?

  • [ ] API-change
  • [ ] BUG
  • [ ] Improvement
  • [ ] Documentation
  • [x] Feature
  • [ ] Test and CI
  • [ ] Code Refactoring

Which issue(s) this PR fixes:

issue #18664

What this PR does / why we need it:

As part of our document LLM support, we are introducing the LLM_CHUNK function. This function can chunk the content in datalink with 4 chunk strategy available.

Usage: select llm_chunk("<input datalink>", "fixed_width; <width number>"); or select llm_chunk("<input datalink>", "<sentence or paragraph or document>");

Return Value: a JSON-like string representation of an array of chunks with offset and size: [[offset0, size0, "chunk"], [offset1, size1, "chunk"],...]

Example SQL for fixed with:

 select llm_chunk(cast('file:///Users/charles/Desktop/codes/testData/example.txt' as datalink), "fixed_width; 11");

Example return:

[[0, 11, "hello world"], [11, 11, " this is a "], [22, 11, "test? hello"], [33, 11, " world! thi"], [44, 11, "s is a test"], [55, 11, ". hello wor"], [66, 3, "ld."]] 

Example SQL for sentence:

 select llm_chunk(cast('file:///Users/charles/Desktop/codes/testData/example.txt' as datalink), "sentence");

Example return:

[[0, 27, "hello world this is a test?"], [27, 13, " hello world!"], [40, 16, " this is a test."], [56, 13, " hello world."]]

charleschile avatar Sep 01 '24 16:09 charleschile