tablesaw
tablesaw copied to clipboard
Excel reader stops parsing table on an empty row
If a table in excel has a row that hasn't any data, the current reader implementation interprets that is the end of the table and doesn't read next rows.
I understand that this can be the desired behavior in some cases where after the blank row it follows another information that is not part of the table. But I don't think that this is the expected behavior.
It will be nice to have at least a parameter in reader options like 'endTableParsingOnFirstBlankRow'
no more parameters :) an excel table can be full of all kinds of junk. We can't take ownership of a user's unbounded sloppiness, so I don't think we should try too hard to guess what they meant with their poor quality table. Excel is a very good environment for cleaning data issues like this.
You're right @lwhite1 , the user can fill the data badly in an infinite number of different ways and it can be a nightmare of parameters to cover all cases. But it's also true that the current table identification logic is quite poor. For example today I've encountered that in an excel like this it doesn't detect any table
| col2 | col3 | |
|---|---|---|
| 1 | 2 | 3 |
| 1 | 2 | 3 |
I've thought a little about the best solution and I think that will be to expose an option to be able to pass the table area coordinates as options. That way it will be only a parameter and will allow custom logic or preprocess to identify the tables giving a more intelligent approach and don't pollute with lots of options.
I like the proposal. Being able to optionally specify the coordinates would satisfy most use cases without complex reader logic.
@lwhite1 Any possibility to add just add a parameter to cover table coordinates? I think that will be very useful and will allow major usecases without entering a parameters and logic nightmare as you said.
But it's also true that the current table identification logic is quite poor.
For example today I've encountered that in an excel like this it doesn't detect any table
col2 col3 1 2 3 1 2 3
I assume this doesn't work because the first value in the table header is empty, but that doesn't seem to be the problem you're addressing here.
@lwhite1 Any possibility to add just add a parameter to cover table coordinates? I think that will be very useful and will allow major usecases without entering a parameters and logic nightmare as you said.
I guess I would like to understand what you're doing that requires automated handling of very messy spreadsheets, and also what your proposal is to handle empty rows and columns within the specified coordinates. Every time we add a feature to handle poor quality data we make Tablesaw more complex and harder to maintain.
I'm running into similar issues. I think calling a spreadsheet with an empty row, or a blank cell in the upper left corner "very messy" or "poor quality" is extreme language. But I do understand the maintainability problem.