matrixone
matrixone copied to clipboard
LLM_CHUNK implementation
What type of PR is this?
- [ ] API-change
- [ ] BUG
- [ ] Improvement
- [ ] Documentation
- [x] Feature
- [ ] Test and CI
- [ ] Code Refactoring
Which issue(s) this PR fixes:
issue #18664
What this PR does / why we need it:
As part of our document LLM support, we are introducing the LLM_CHUNK function. This function can chunk the content in datalink with 4 chunk strategy available.
Usage: select llm_chunk("<input datalink>", "fixed_width; <width number>"); or select llm_chunk("<input datalink>", "<sentence or paragraph or document>");
Return Value: a JSON-like string representation of an array of chunks with offset and size: [[offset0, size0, "chunk"], [offset1, size1, "chunk"],...]
Example SQL for fixed with:
select llm_chunk(cast('file:///Users/charles/Desktop/codes/testData/example.txt' as datalink), "fixed_width; 11");
Example return:
[[0, 11, "hello world"], [11, 11, " this is a "], [22, 11, "test? hello"], [33, 11, " world! thi"], [44, 11, "s is a test"], [55, 11, ". hello wor"], [66, 3, "ld."]]
Example SQL for sentence:
select llm_chunk(cast('file:///Users/charles/Desktop/codes/testData/example.txt' as datalink), "sentence");
Example return:
[[0, 27, "hello world this is a test?"], [27, 13, " hello world!"], [40, 16, " this is a test."], [56, 13, " hello world."]]