chunjun
chunjun copied to clipboard
[Feature][s3] 增加支持读取apache tika 支持的所有类型文档
Search before asking
- [X] I had searched in the issues and found no similar feature requirement.
Description
增加支持读取apache tika 支持的所有类型文档
Use case
CREATE TABLE source
(
content String,
metadata String
) WITH (
'connector' = 's3-x',
'assessKey' = 'xxx',
'secretKey' = 'xxx',
'bucket' = 'di-test',
'objects' = '["/pdf-source/20240528/.*"]',
'endpoint' = 'http://10.x.x.x',
-- 是否启动分块, 默认false
'tika-use-extract' = 'true'
-- 分块大小, 默认 -1 不分块,抽取取全部
,'tika-chunk-size' = '40'
-- 内容重合度比例值 0-100
,'tika-overlap-ratio' = '0'
-- 禁用 Bucket 名称注入到 endpoint 前缀, 默认false, 如果使用域名需要设置成true
,'disableBucketNameInEndpoint' = 'true'
-- 匹配对象的正则表达式
,'objectsRegex' = '.*\.doc'
);
CREATE TABLE sink
(
content String,
metadata String
) WITH (
'connector' = 'stream-x',
'print' = 'true'
);
INSERT INTO sink SELECT * FROM source;
Related issues
No response
Are you willing to submit a PR?
- [X] Yes I am willing to submit a PR!
Code of Conduct
- [X] I agree to follow this project's Code of Conduct