datafusion
datafusion copied to clipboard
support `skip_rows` for `CsvFormat`
Is your feature request related to a problem or challenge?
Some tools generate CSVs that contain some extra data at the top or bottom, such as bank statement exports
"Account Name : REDACTED"
"Account Number : REDACTED"
"Date Range : 01/01/2023-12/31/2023"
Transaction Date,Posted Date,Category,Debit
2023-01-01,2023-01-01,Payment/Credit,100.0
I'd like a way to skip the first n
rows
Describe the solution you'd like
add a new option to CsvFormat
for skipping rows
CsvFormat::default().with_skip_rows(2)
Describe alternatives you've considered
none that I can think of.
Additional context
https://github.com/GlareDB/glaredb/issues/2035
It might also make sense to add this upstream to arrow-rs (though we can add it first to DataFusion and then upstream it)
Likely there is a support on Seek::StartFrom on a file level
let decoder = if is_whole_file_scanned {
// Don't seek if no range as breaks FIFO files
file_compression_type.convert_read(file)?
} else {
file.seek(SeekFrom::Start(result.range.start as _))?;
file_compression_type.convert_read(
file.take((result.range.end - result.range.start) as u64),
)?
};
But imagine if there is parallel read for the source containing multiple csv files, how skip_rows should work in this case? 🤔
I am interested in it.
If there is not someone working with it, you can assigned it to me @alamb
I am interested in it.
If there is not someone working with it, you can assigned it to me @alamb
@alamb Do not forget it ~~~~
Thanks @Tangruilin I think we need to find out how to deal with parallel reads and multiple files prior to development
I don't know of anyone working on it @Tangruilin -- I have assigned it to you; As @comphead says, this is probably a task that needs some design prior to code
I think the idea of skipping N rows on the file level doesn't make much sense. What we can probably do is to skip N rows on dataframe level, but again there is no guarantee which exactly 2 rows will be skipped because ordering, shuffling, etc. IMHO it looks more a user task than DataFusion task as the user has more context when executing the query
I checked Spark but I haven't found the embedded functionality probably because of concerns above
@universalmind303 what is your vision as the ticket owner?
I think the idea of skipping N rows on the file level doesn't make much sense. What we can probably do is to skip N rows on dataframe level, but again there is no guarantee which exactly 2 rows will be skipped because ordering, shuffling, etc. IMHO it looks more a user task than DataFusion task as the user has more context when executing the query
If I understand the issue correctly, skipping rows at the DataFrame level would not work as the file wouldn't even be able to be parsed into a DataFrame in the first place, due to the initial rows being not CSV rows.
Is it possible to specify to skip specific file lines, instead of the first N rows of a file, perhaps? Though I'm not sure if that might be any simpler than trying to skip the first N rows.
I think you can skip lines as I posted the code above https://github.com/apache/arrow-datafusion/issues/8824#issuecomment-1907167127
But my concern is for reading multiple files, is it expected to skip lines for all the files? but what if the format is mixed so the user'll lose some of data silently.
We can think of lets skip unparseable rows, but this solution has tons of nuances
I don't know of anyone working on it @Tangruilin -- I have assigned it to you; As @comphead says, this is probably a task that needs some design prior to code
Maybe I can write a draft which do not think about parallel read, and then we discuss about it.
I don't know of anyone working on it @Tangruilin -- I have assigned it to you; As @comphead says, this is probably a task that needs some design prior to code
Maybe I can write a draft which do not think about parallel read, and then we discuss about it.
This task still has some unclear parts, I would happy if you describe your proposal in comments
I came across this issue since I was looking for a way to deal with CSV files containing comments (in my case: lines starting with #
). Please let me know if I should open a new issue for that.
I was originally looking for a way to replace the reader or hook after the builtin one, but from reading above comments this probably does not fit datafusion
's input handling.
Skip first n rows wouldn't help in my case, but if there was a way to hook into the CSV reader by providing a line-based filter function it might (naively: a CsvOptions
field like filter: FnOnce(&str) -> bool
). Maybe something like this could even be generalized to a line-based transformer function a la filter_map
(probably would need to return some Option<Cow>
to not penalize use cases which only filter, but do not transform). If OP's lines to skip can be clearly identified this might be able to address their use case as well.
I came across this issue since I was looking for a way to deal with CSV files containing comments (in my case: lines starting with
#
). Please let me know if I should open a new issue for that.
I think a new issue would be good as this one describes something different
I was originally looking for a way to replace the reader or hook after the builtin one, but from reading above comments this probably does not fit
datafusion
's input handling.
I think you would have to provide your own TableProvider / format for the ListingTable provider if you wanted to do this today
Skip first n rows wouldn't help in my case, but if there was a way to hook into the CSV reader by providing a line-based filter function it might (naively: a
CsvOptions
field likefilter: FnOnce(&str) -> bool
). Maybe something like this could even be generalized to a line-based transformer function a lafilter_map
(probably would need to return someOption<Cow>
to not penalize use cases which only filter, but do not transform). If OP's lines to skip can be clearly identified this might be able to address their use case as well.
Since the readers are based on streams of data, I do wonder if we could implement some sort of filtering that happened prior to parsing that would let you transform the input however you wanted 🤔