tablesaw icon indicating copy to clipboard operation
tablesaw copied to clipboard

Excel reader stops parsing table on an empty row

Open lujop opened this issue 4 years ago • 7 comments

If a table in excel has a row that hasn't any data, the current reader implementation interprets that is the end of the table and doesn't read next rows.

I understand that this can be the desired behavior in some cases where after the blank row it follows another information that is not part of the table. But I don't think that this is the expected behavior.

It will be nice to have at least a parameter in reader options like 'endTableParsingOnFirstBlankRow'

lujop avatar May 07 '21 10:05 lujop

no more parameters :) an excel table can be full of all kinds of junk. We can't take ownership of a user's unbounded sloppiness, so I don't think we should try too hard to guess what they meant with their poor quality table. Excel is a very good environment for cleaning data issues like this.

lwhite1 avatar May 08 '21 16:05 lwhite1

You're right @lwhite1 , the user can fill the data badly in an infinite number of different ways and it can be a nightmare of parameters to cover all cases. But it's also true that the current table identification logic is quite poor. For example today I've encountered that in an excel like this it doesn't detect any table

col2 col3
1 2 3
1 2 3

I've thought a little about the best solution and I think that will be to expose an option to be able to pass the table area coordinates as options. That way it will be only a parameter and will allow custom logic or preprocess to identify the tables giving a more intelligent approach and don't pollute with lots of options.

lujop avatar May 11 '21 16:05 lujop

I like the proposal. Being able to optionally specify the coordinates would satisfy most use cases without complex reader logic.

miho avatar Jun 16 '21 05:06 miho

@lwhite1 Any possibility to add just add a parameter to cover table coordinates? I think that will be very useful and will allow major usecases without entering a parameters and logic nightmare as you said.

lujop avatar Aug 22 '21 18:08 lujop

But it's also true that the current table identification logic is quite poor.
For example today I've encountered that in an excel like this it doesn't detect any table

  col2 col3
1 2 3
1 2 3

I assume this doesn't work because the first value in the table header is empty, but that doesn't seem to be the problem you're addressing here.

lwhite1 avatar Aug 23 '21 21:08 lwhite1

@lwhite1 Any possibility to add just add a parameter to cover table coordinates? I think that will be very useful and will allow major usecases without entering a parameters and logic nightmare as you said.

I guess I would like to understand what you're doing that requires automated handling of very messy spreadsheets, and also what your proposal is to handle empty rows and columns within the specified coordinates. Every time we add a feature to handle poor quality data we make Tablesaw more complex and harder to maintain.

lwhite1 avatar Aug 23 '21 21:08 lwhite1

I'm running into similar issues. I think calling a spreadsheet with an empty row, or a blank cell in the upper left corner "very messy" or "poor quality" is extreme language. But I do understand the maintainability problem.

wcneill avatar Feb 09 '22 12:02 wcneill