spark-excel icon indicating copy to clipboard operation
spark-excel copied to clipboard

[BUG] Spark Excel not reading whole columns and is only reading specific data address ranges

Open bitbythecron opened this issue 10 months ago • 6 comments

Am I using the newest version of the library?

  • [x] I have made sure that I'm using the latest version of the library.

Is there an existing issue for this?

  • [x] I have searched the existing issues

Current Behavior

Java app here using the Spark Excel library to read an Excel file into a Dataset<Row>. When I use the following configurations:

String filePath = "file:///Users/myuser/example-data.xlsx";
Dataset<Row> dataset = spark.read()
        .format("com.crealytics.spark.excel")
        .option("header", "true")
        .option("inferSchema", "true")
        .option("dataAddress", "'ExampleData'!A2:D7")
        .load(filePath);

This works beautifully and my Dataset<Row> is instantiated without any issues whatsoever. But the minute I go to just tell it to read any rows between A through D, it reads an empty Dataset<Row>:

// dataset will be empty
.option("dataAddress", "'ExampleData'!A:D")

This also happens if I set the sheetName and dataAddress separately:

// dataset will be empty
.option("sheetName", "ExampleData")
.option("dataAddress", "A:D")

And it also happens when, instead of providing the sheetName, I provide a sheetIndex:

// dataset will be empty; and I have experimented by setting it to 0 as well
// in case it is a 0-based index
.option("sheetIndex", 1)
.option("dataAddress", "A:D")

My question: is this expected behavior of the Spark Excel library, or is it a bug I have discovered, or am I not using the Options API correctly here?

Expected Behavior

Explained above, I would have expected all three option configurations to work, but only the first one does.

Steps To Reproduce

Code is provided above. I am pulling in the following Gradle libraries:

    implementation("org.apache.spark:spark-core_2.12:3.5.3")
    implementation("org.apache.spark:spark-sql_2.12:3.5.3")
    implementation("com.crealytics:spark-excel_2.12:3.5.1_0.20.4")
    implementation("com.databricks:spark-xml_2.12:0.18.0")

I am using a Java application (not Scala).

Environment

- Spark version: `2.12:3.5.3`
- Spark-Excel version: `2.12:3.5.1_0.20.4`
- OS: Mac Sequoia 15.3
- Cluster environment

Anything else?

No response

bitbythecron avatar Feb 06 '25 16:02 bitbythecron

Hi @bitbythecron, as you are familiar with Java: Can you try what happens with the same file if you read it by Apache POI, manually create an AreaReference with your dataAddress and run these commands to get the min and max line and col numbers?

nightscape avatar Feb 07 '25 18:02 nightscape

Hi, I will try this suggestion. Please give me until tomorrow and I will post the results.

bitbythecron avatar Feb 08 '25 11:02 bitbythecron

Hi @nightscape apologies it took a few more days than expected to get this to you.

Please see this gist. You would need to modify the filePath and dataAddress to replicate it on your end, but I can tell you: this seems to be a limitation with Apache POI, not spark-excel.

When I use this line:

String dataAddress = "'Price Sheet'!A2:D7";

The program reads the file and correctly reads all the data between A2 and D7.

But when I change that to just read all data in the columns:

String dataAddress = "'Price Sheet'!A:D";

It doesn't throw an error, but it comes back empty.

With Apache POI being as mature as it is, I'm leaning towards this being expected behavior, albeit its disappointing.

Can you think of any way around this limitation? In production, my app will be given Excel files without knowing the exact addresses/areas/bounds of the data to read.

This feels like a pretty huge limitation of POI to be honest! I mean, how many times are you going to know (ahead of time) the exact rows in the Excel file you are processing?!

Thanks for any input/advice here!

bitbythecron avatar Feb 11 '25 14:02 bitbythecron

Have you tried using a very high end address? I'm not a 100% sure, but it might be that this just works.

nightscape avatar Feb 11 '25 14:02 nightscape

Well I guess that would work for most use cases, but I'm more apt to lean towards something like this:

  1. Use POI directly to read the row count from the specific columns I'm interested in
  2. Convert this row count into an actual data address range. For instance if the row count is 10 and I know I want to read columns A through D and ignore the header, that would translate to "A2:D11"
  3. Then use spark-excel with the reader options that will work

I hate to read the same doc twice but I think that's what I may be stuck with if you can't think of anything else. No worries either way, and thanks for the quick responses so far!

bitbythecron avatar Feb 11 '25 20:02 bitbythecron

If you want, you could create a PR and add a method to read the data address ranger from the WorkbookReader, as is already done for getting the sheetNames. That wouldn't help performance-wise, but you wouldn't have to use POI directly. It might also be that the data address is stored as metadata, so hopefully POI doesn't have to read the entire sheet to find out what's the maximum address.

nightscape avatar Feb 11 '25 22:02 nightscape