tablesaw icon indicating copy to clipboard operation
tablesaw copied to clipboard

XlsxReader casting integers cells to blanks

Open Afollet opened this issue 4 years ago • 6 comments

Hello devs,

Thanks for making/maintaining this library. I am currently in a microservice that validates the data in excel spreadsheets.

I'm reading an excel workbook with multiple sheets in tablesaw-excel.

public static List<Table> readXlsx(InputStream inputStream) throws IOException {
    Source source = new Source(inputStream);
    XlsxReadOptions options = XlsxReadOptions.builder(source).build();
    XlsxReader reader = new XlsxReader();
    return reader.readMultiple(options);
  }

I have cell types in Excel of both numeric and text in a column. When the table is read in, the numeric cell types become blank cells. Therefore, when I try to pull up missing values I get the rows where the numeric type cells are at

Table missing = table.where(table.column("myColumn").isMissing());

Is there some way I can change the behavior of this using XlsxReadOptions? Or is there another method I can use to avoid this behavior?

Thanks!

Afollet avatar Feb 05 '21 12:02 Afollet

So, I made some changes to the XlsxReader to change the behavior as I needed ( I also made some changes so it's easier to use the code base in our project): https://github.com/PDXFinder/pdx-validator/blob/dev/src/main/java/org/pdxfinder/validator/ValidatorXlsxReader.java

I made two changes:

  1. Change the method "appendValue" to return a string representation of a numeric cell if the column type is String.
  2. Change "isBlank" to not count numeric cells with a value of 0 as blank

I added a basic test for this behavior: https://github.com/PDXFinder/pdx-validator/blob/dev/src/test/java/org/pdxfinder/ValidatorXlsxReaderTest.java

I would be willing to make a PR for these changes on this repo. I was not sure if the maintainers are interested in these changes, so let me know.

Afollet avatar Apr 07 '21 11:04 Afollet

I think that it will be solved with #909 if we also do an improvement in type detection like commented on #751

With #909 the reader it's able to read number values in String columns And if #751 it's solved will be a more consistent type detection that doesn't depend on the first cell type

lujop avatar May 07 '21 13:05 lujop

@Afollet If you have time to check and can test with the current master version I think that this issue must be solved

lujop avatar May 11 '21 19:05 lujop

@lujop Can I pull down the snapshot of this version of master? I think that would be easiest for me to test it, but I'm was not able to pull it on https://oss.sonatype.org/

Could you advise me on how to use maven to test these changes?

Afollet avatar May 12 '21 13:05 Afollet

@Afollet you can just clone the repo and do a mvn install and it will build and install the tablesaw jar in your local maven repository. Then use the dependency with the new version as normal and while in the same machine maven will take the dependency from your local repository.

lujop avatar May 12 '21 15:05 lujop

@lujop Yes, I have tested your changes and it has resolved this issue.

Thanks for the great work @lujop . Dev's like you make the world go round.

Afollet avatar May 13 '21 12:05 Afollet