matrixone
matrixone copied to clipboard
LLM_EXTRACT_TEXT implementation
What type of PR is this?
- [ ] API-change
- [ ] BUG
- [ ] Improvement
- [ ] Documentation
- [x] Feature
- [ ] Test and CI
- [ ] Code Refactoring
Which issue(s) this PR fixes:
issue #18664
What this PR does / why we need it:
As part of our document LLM support, we are introducing the LLM_EXTRACT_TEXT function. This function extracts text from PDF files and writes the extracted text to a specified text file, extractor type can be specified by the third argument.
Usage: llm_extract_text(<input PDF datalink>, <output text file datalink>, <extractor type string>);
Return Value: A boolean indicating whether the extraction and writing process was successful.
Note:
- Both the input and output paths must be absolute paths.
Example SQL:
select llm_extract_text(cast('file:///Users/charles/Desktop/codes/matrixone/matrixone/test/distributed/resources/llm_test/extract_text/example.pdf?offset=0&size=4' as datalink), cast('file:///Users/charles/Desktop/codes/matrixone/matrixone/test/distributed/resources/llm_test/extract_text/example.txt' as datalink), "pdf");
Example return:
mysql> select llm_extract_text(cast('file:///Users/charles/Desktop/codes/matrixone/matrixone/test/distributed/resources/llm_test/extract_text/example.pdf?offset=0&size=4' as datalink), cast('file:///Users/charles/Desktop/codes/matrixone/matrixone/test/distributed/resources/llm_test/extract_text/example.txt' as datalink), "pdf");
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| llm_extract_text(cast(file:///Users/charles/Desktop/codes/matrixone/matrixone/test/distributed/resources/llm_test/extract_text/example.pdf?offset=0&size=4 as datalink), cast(file:///Users/charles/Desktop/codes/matrixone/matrixone/test/distributed/resources/llm_test/extract_text/example.txt as datalink), pdf) |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| true |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.10 sec)