datafusion support `skip_rows` for `CsvFormat`

Is your feature request related to a problem or challenge?

Some tools generate CSVs that contain some extra data at the top or bottom, such as bank statement exports

"Account Name : REDACTED"
"Account Number : REDACTED"
"Date Range : 01/01/2023-12/31/2023"
Transaction Date,Posted Date,Category,Debit
2023-01-01,2023-01-01,Payment/Credit,100.0

I'd like a way to skip the first n rows

Describe the solution you'd like

add a new option to CsvFormat for skipping rows

CsvFormat::default().with_skip_rows(2)

Describe alternatives you've considered

none that I can think of.

Additional context

https://github.com/GlareDB/glaredb/issues/2035

Jan 10 '24 21:01 universalmind303

It might also make sense to add this upstream to arrow-rs (though we can add it first to DataFusion and then upstream it)

Jan 14 '24 19:01 alamb

Likely there is a support on Seek::StartFrom on a file level

                let decoder = if is_whole_file_scanned {
                        // Don't seek if no range as breaks FIFO files
                        file_compression_type.convert_read(file)?
                    } else {
                        file.seek(SeekFrom::Start(result.range.start as _))?;
                        file_compression_type.convert_read(
                            file.take((result.range.end - result.range.start) as u64),
                        )?
                    };

But imagine if there is parallel read for the source containing multiple csv files, how skip_rows should work in this case? 🤔

Jan 24 '24 00:01 comphead

I am interested in it.

If there is not someone working with it, you can assigned it to me @alamb

Jan 26 '24 05:01 Tangruilin

I am interested in it.

If there is not someone working with it, you can assigned it to me @alamb

@alamb Do not forget it ~~~~

Jan 26 '24 11:01 Tangruilin

Thanks @Tangruilin I think we need to find out how to deal with parallel reads and multiple files prior to development

Jan 26 '24 16:01 comphead

I don't know of anyone working on it @Tangruilin -- I have assigned it to you; As @comphead says, this is probably a task that needs some design prior to code

Jan 26 '24 20:01 alamb

I think the idea of skipping N rows on the file level doesn't make much sense. What we can probably do is to skip N rows on dataframe level, but again there is no guarantee which exactly 2 rows will be skipped because ordering, shuffling, etc. IMHO it looks more a user task than DataFusion task as the user has more context when executing the query

I checked Spark but I haven't found the embedded functionality probably because of concerns above

@universalmind303 what is your vision as the ticket owner?

Jan 26 '24 22:01 comphead

I think the idea of skipping N rows on the file level doesn't make much sense. What we can probably do is to skip N rows on dataframe level, but again there is no guarantee which exactly 2 rows will be skipped because ordering, shuffling, etc. IMHO it looks more a user task than DataFusion task as the user has more context when executing the query

If I understand the issue correctly, skipping rows at the DataFrame level would not work as the file wouldn't even be able to be parsed into a DataFrame in the first place, due to the initial rows being not CSV rows.

Is it possible to specify to skip specific file lines, instead of the first N rows of a file, perhaps? Though I'm not sure if that might be any simpler than trying to skip the first N rows.

Jan 26 '24 22:01 Jefffrey

I think you can skip lines as I posted the code above https://github.com/apache/arrow-datafusion/issues/8824#issuecomment-1907167127

But my concern is for reading multiple files, is it expected to skip lines for all the files? but what if the format is mixed so the user'll lose some of data silently.

We can think of lets skip unparseable rows, but this solution has tons of nuances

Jan 27 '24 02:01 comphead

I don't know of anyone working on it @Tangruilin -- I have assigned it to you; As @comphead says, this is probably a task that needs some design prior to code

Maybe I can write a draft which do not think about parallel read, and then we discuss about it.

Jan 27 '24 04:01 Tangruilin

I don't know of anyone working on it @Tangruilin -- I have assigned it to you; As @comphead says, this is probably a task that needs some design prior to code

Maybe I can write a draft which do not think about parallel read, and then we discuss about it.

This task still has some unclear parts, I would happy if you describe your proposal in comments

Jan 27 '24 06:01 comphead

I came across this issue since I was looking for a way to deal with CSV files containing comments (in my case: lines starting with # ). Please let me know if I should open a new issue for that.

I was originally looking for a way to replace the reader or hook after the builtin one, but from reading above comments this probably does not fit datafusion's input handling.

Skip first n rows wouldn't help in my case, but if there was a way to hook into the CSV reader by providing a line-based filter function it might (naively: a CsvOptions field like filter: FnOnce(&str) -> bool). Maybe something like this could even be generalized to a line-based transformer function a la filter_map (probably would need to return some Option<Cow> to not penalize use cases which only filter, but do not transform). If OP's lines to skip can be clearly identified this might be able to address their use case as well.

Apr 26 '24 07:04 bbannier

I came across this issue since I was looking for a way to deal with CSV files containing comments (in my case: lines starting with # ). Please let me know if I should open a new issue for that.

I think a new issue would be good as this one describes something different

I was originally looking for a way to replace the reader or hook after the builtin one, but from reading above comments this probably does not fit datafusion's input handling.

I think you would have to provide your own TableProvider / format for the ListingTable provider if you wanted to do this today

Skip first n rows wouldn't help in my case, but if there was a way to hook into the CSV reader by providing a line-based filter function it might (naively: a CsvOptions field like filter: FnOnce(&str) -> bool). Maybe something like this could even be generalized to a line-based transformer function a la filter_map (probably would need to return some Option<Cow> to not penalize use cases which only filter, but do not transform). If OP's lines to skip can be clearly identified this might be able to address their use case as well.

Since the readers are based on streams of data, I do wonder if we could implement some sort of filtering that happened prior to parsing that would let you transform the input however you wanted 🤔

Apr 26 '24 18:04 alamb