[BitSail][Connector] Support local file source connector
Is your feature request related to a problem? Please describe
We need to read local file as source. CSV file format could be a good start.
Describe the solution you'd like
Use V1 source connector interface, read CSV file as source and able to write to PrintSink
Additional context
Hi, there! I'm interested in adding this feature into BitSail, could you assign this issue to me?
@Jake-00 assigned to you!
Hi @garyli1019 , I had two questions below:
- Dose local CSV reader better implement SimpleSourceBase interface ranther than Source? Coz I found that fake_connnector implements SimpleSourceBase interface.
- CSV format is schema-less. Do I need to load additional json to ensure schma?
Hello @Jake-00
-
SimpleSourceBaseis a single parallelism source, but if we want to run it in multi parallelism, we need to implement a completeSourceinterface. The scenario we need multi parallelism would be reading multiple csv file under a path, or a big csv file we need to split(not sure if this is possible for csv). If you would like to support a single parallelism first and extend to multi parallelism in a future task. That's total ok as well. That's up to you :) - The way we handle the schema could be flexible. Some cases we may use the first line as schema, sometime the file doesn't have the schema or the schema is out of order. We could implement the standard way first(like other connector, include the schema in the configuration file) and support other ways later.
Hi @garyli1019 Following your analysis, I have decided to implement single parallelism local csv source on standard way first. And it comes two things needing discussion:
- Referring to csv-format in Flink's docs, here is the demo conf below. How about this design?
{
"job": {
"common": {
"job_id": -255,
"instance_id": -2035,
"job_name": "bitsail_local_csv_to_print_test",
"user_name": "test"
},
"reader": {
"class": "com.bytedance.bitsail.connector.local.csv.source.LocalCsvSource",
"file_path": "src/test/resources/student.csv",
"field_delimiter": ",",
"disable_quote_character": true,
"allow_comments": false,
"ignore_parse_errors": false,
"null_literal": "",
"columns": [
{
"name": "id",
"type": "long"
},
{
"name": "name",
"type": "string"
}
]
},
"writer": {
"class": "com.bytedance.bitsail.connector.legacy.print.sink.PrintSink",
"writer_parallelism_num": 2
}
}
}
- Parsing a csv file to POJO objects by Jackson is convenient, but it may be difficult to parse each row of csv file to Row(defined in BitSail) object. So I try to import univocity-parsers to deal with row parsing problem. I am not sure whether using univocity-parsers is a right action for effective parsing.
hi @Jake-00 , sorry for the late. Regarding csv and json format, we already have it in the bitsail-components/bitsail-component-format, we could reuse the existing one.
The existing csv format tool really helps, and I create a pr: )