[Feature][s3] 增加支持读取apache tika 支持的所有类型文档

Open libailin opened this issue 1 year ago • 0 comments

Search before asking

[X] I had searched in the issues and found no similar feature requirement.

Description

增加支持读取apache tika 支持的所有类型文档

Use case

CREATE TABLE source
(
    content String,
    metadata String
) WITH (
    'connector' = 's3-x',
    'assessKey' = 'xxx',
    'secretKey' = 'xxx',
    'bucket' = 'di-test',
    'objects' = '["/pdf-source/20240528/.*"]',
    'endpoint' = 'http://10.x.x.x',
    -- 是否启动分块, 默认false
    'tika-use-extract' = 'true'
    -- 分块大小, 默认 -1 不分块，抽取取全部
    ,'tika-chunk-size' = '40'
    -- 内容重合度比例值 0-100
    ,'tika-overlap-ratio' = '0'
    -- 禁用 Bucket 名称注入到 endpoint 前缀, 默认false, 如果使用域名需要设置成true
    ,'disableBucketNameInEndpoint' = 'true'
    -- 匹配对象的正则表达式
    ,'objectsRegex' = '.*\.doc'
);


CREATE TABLE sink
(
    content String,
    metadata String
) WITH (
      'connector' = 'stream-x',
      'print' = 'true'
      );

INSERT INTO sink SELECT * FROM source;

Related issues

No response

Are you willing to submit a PR?

[X] Yes I am willing to submit a PR!

Code of Conduct

[X] I agree to follow this project's Code of Conduct

Aug 27 '24 10:08 libailin