calamine
calamine copied to clipboard
feat: add option to keep first empty rows for xlsx
We have an issue https://github.com/ToucanToco/fastexcel/issues/209 that has been getting some popularity People want to be able to set the "header row" knowing where it starts and the fact that calamine skips by default empty rows at the beginning is not always wanted.
I suggest we add an option to be able to change the default behavior directly in calamine. It will be a bit less performant for the ones who enable this option since the whole vec of cells will be shifted but has no impact on people who keep the default behavior.
Note: I decided to tackle only xlsx files for now because
- I don't know if this will be accepted
- I don't know if the behavior is the same for other files
- I don't know if the "XlsxOptions" is something we want to generalize as an associated type
type Optionin theReadertrait
related to https://github.com/tafia/calamine/issues/147#issuecomment-492716541
I don't know if the behavior is the same for other files
Yes, the same. See first four tests https://github.com/dimastbk/python-calamine/blob/master/tests/test_base.py#L11
EDIT: SORRY for the late answer!!
I don't know if this will be accepted
I don't see anything wrong here. I'd like the comments to reflect where/how it applies if possible, in particular if there is no option for other files.
I don't know if the behavior is the same for other files
While I'd love all files to be supported equally, I understand people focusing on their own issues first. If you don't mind trying to check other files i'd be great, else it is ok. Thanks anyway for your time!
I don't know if the "XlsxOptions" is something we want to generalize as an associated type type Option in the Reader trait
I don't really know too. I suppose it could be something useful in more contexts but I haven't felt a strong need either.
I'll try to tackle other file formats by the end of the week
@tafia I added the support for ODS. I reckon having the option at workbook level is not appropriate. We could want to open a sheet with one header row and another sheet with another header row. For me we should have
/// Options for reading data
pub struct ReadDataOptions {
/// header row
pub header_row: Option<u32>,
...
}
pub trait Reader<RS>: Sized
where
RS: Read + Seek,
{
...
fn worksheet_range(&mut self, name: &str, options: ReadDataOptions) -> Result<Range<Data>, Self::Error>;
...
fn worksheet_range_at(&mut self, n: usize, options: ReadDataOptions) -> Option<Result<Range<Data>, Self::Error>> {
let name = self.sheet_names().get(n)?.to_string();
Some(self.worksheet_range(&name, options))
}
...
}
pub trait ReaderRef<RS>: Reader<RS>
where
RS: Read + Seek,
{
fn worksheet_range_ref<'a>(&'a mut self, name: &str, options: ReadDataOptions)
-> Result<Range<DataRef<'a>>, Self::Error>;
fn worksheet_range_at_ref(&mut self, n: usize, options: ReadDataOptions) -> Option<Result<Range<DataRef>, Self::Error>> {
let name = self.sheet_names().get(n)?.to_string();
Some(self.worksheet_range_ref(&name, options))
}
}
The issue is that it would break the API
I could add new method worksheet_range_with_option for example but it sounds quite heavy.
WDYT?
@PrettyWood would your solution work for this use case?
@ACronje yup!
Even though my Rust knowledge is pretty limited I would be willing to help get this over the line @PrettyWood @tafia .
We really need this!
The rust part is not really a problem and I can finish the job easily, it's more a problem of API ;) I need @tafia opinion on https://github.com/tafia/calamine/pull/453#issuecomment-2358655100
@tafia I added the support for ODS. I reckon having the option at workbook level is not appropriate. We could want to open a sheet with one header row and another sheet with another header row. For me we should have
/// Options for reading data pub struct ReadDataOptions { /// header row pub header_row: Option<u32>, ... } pub trait Reader<RS>: Sized where RS: Read + Seek, { ... fn worksheet_range(&mut self, name: &str, options: ReadDataOptions) -> Result<Range<Data>, Self::Error>; ... fn worksheet_range_at(&mut self, n: usize, options: ReadDataOptions) -> Option<Result<Range<Data>, Self::Error>> { let name = self.sheet_names().get(n)?.to_string(); Some(self.worksheet_range(&name, options)) } ... } pub trait ReaderRef<RS>: Reader<RS> where RS: Read + Seek, { fn worksheet_range_ref<'a>(&'a mut self, name: &str, options: ReadDataOptions) -> Result<Range<DataRef<'a>>, Self::Error>; fn worksheet_range_at_ref(&mut self, n: usize, options: ReadDataOptions) -> Option<Result<Range<DataRef>, Self::Error>> { let name = self.sheet_names().get(n)?.to_string(); Some(self.worksheet_range_ref(&name, options)) } }The issue is that it would break the API I could add new method
worksheet_range_with_optionfor example but it sounds quite heavy. WDYT?
@PrettyWood since adding options like you've suggested above would be a breaking change, and since adding a new method like worksheet_range_with_option would inevitably require you to add worksheet_formula_with_options etc. how about rather passing the sheet-specific parsing options to with_options?
I think this could be useful for the other file types that do not support lazy iteration (they are loading the sheets into memory at instantiation time I believe? see: ods) and therefore might decide to use this sheet-specific config to, for example, be more efficient with loading sheet data (e.g. only keeping in memory the sheets defined in the config)
I first went with the with_options approach (it's still the case in this PR) but I reckon it's wrong to have sheet options (like header_row) that are meant for a single sheet to be set at workbook level.
I understand what you're saying and I saw your implementation. I explained why I think it's acceptable for the sheet options to be at the workbook level.
Thanks for sharing your opinion. I'll finish implementing all the extensions and let the with_options for now.
If it's good enough it will be mergeable directly. If not it won't cost too much to update
EDIT: it is now implemented for all extensions. I went for an associated type if we want to have that at reader level. I still reckon we should have options at workbook level and options at sheet level. And header_row should be at sheet level IMO.
Huge thanks to @PrettyWood for this improvement! Our team depends on this PR to fully adopt calamine.
@tafia Even though you've already approved the PR, I believe @PrettyWood is still waiting for you to take another look at the post-approval changes before merging. Any chance you could take a look in the coming days? Or can he go ahead and merge? Thanks a lot!
Thanks a lot for the extra effort! LGTM, just one tiny refactor I believe makes the intent clearer.
@PrettyWood let me know once you think you're done. I'm happy to merge it as it is already. Thanks again!
Ok @tafia sorry for those last changes but I prefer switching to an enum. It's more explicit and I plan on adding skip_rows and n_rows to options after. So I guess it will make sense to also have None for the header row to say "I don't want any header at all". Right now there is no difference between "the first row of the range" and "the header row" but it will make sense after if some kind of pagination is added
If that's ok for you this PR is ready to merge on my end I'll open new PRs soon
Awesome thanks!