matrixone icon indicating copy to clipboard operation
matrixone copied to clipboard

LLM_EXTRACT_TEXT implementation

Open charleschile opened this issue 1 year ago • 4 comments

What type of PR is this?

  • [ ] API-change
  • [ ] BUG
  • [ ] Improvement
  • [ ] Documentation
  • [x] Feature
  • [ ] Test and CI
  • [ ] Code Refactoring

Which issue(s) this PR fixes:

issue #18664

What this PR does / why we need it:

As part of our document LLM support, we are introducing the LLM_EXTRACT_TEXT function. This function extracts text from PDF files and writes the extracted text to a specified text file, extractor type can be specified by the third argument.

Usage: llm_extract_text(<input PDF datalink>, <output text file datalink>, <extractor type string>);

Return Value: A boolean indicating whether the extraction and writing process was successful.

Note:

  • Both the input and output paths must be absolute paths.

Example SQL:

select llm_extract_text(cast('file:///Users/charles/Desktop/codes/matrixone/matrixone/test/distributed/resources/llm_test/extract_text/example.pdf?offset=0&size=4' as datalink), cast('file:///Users/charles/Desktop/codes/matrixone/matrixone/test/distributed/resources/llm_test/extract_text/example.txt' as datalink), "pdf");

Example return:

mysql> select llm_extract_text(cast('file:///Users/charles/Desktop/codes/matrixone/matrixone/test/distributed/resources/llm_test/extract_text/example.pdf?offset=0&size=4' as datalink), cast('file:///Users/charles/Desktop/codes/matrixone/matrixone/test/distributed/resources/llm_test/extract_text/example.txt' as datalink), "pdf");
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| llm_extract_text(cast(file:///Users/charles/Desktop/codes/matrixone/matrixone/test/distributed/resources/llm_test/extract_text/example.pdf?offset=0&size=4 as datalink), cast(file:///Users/charles/Desktop/codes/matrixone/matrixone/test/distributed/resources/llm_test/extract_text/example.txt as datalink), pdf)   |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| true                                                                                                                                                                                                                                                                                                                    |
+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.10 sec)

charleschile avatar Aug 30 '24 13:08 charleschile