spark-excel
spark-excel copied to clipboard
[BUG] Data is not being read using streaming approach.
Is there an existing issue for this?
- [X] I have searched the existing issues
Current Behavior
When I read from an Excel file using a streaming Excel reader (with maxRowsInMemory options set), data is not being read from the file. It happens for Excel files where the dimension section contains an open-ended data address. For example, A1.
It looks like the problem is in this method:
private def rowIndices(sheet: Sheet): Range =
(math.max(dataAddress.getFirstCell.getRow, sheet.getFirstRowNum) to
math.min(dataAddress.getLastCell.getRow, sheet.getLastRowNum))
When dimension doesn't have the right bottom end, sheet.getLastRowNum has a default value equal to 0. It cuts off all the rows in the sheet.
I would suggest to fix it like
private def rowIndices(sheet: Sheet): Range =
(math.max(dataAddress.getFirstCell.getRow, sheet.getFirstRowNum) to
dataAddress.getLastCell.getRow)
because it is not always possible to figure out the last row num using the dimension field.
Expected Behavior
All data is supposed to be read.
Steps To Reproduce
Just a simple read from an Excel file generated using the streaming-excel-reader library, with random data.
The file it generates contains an A1 value in the dimension record.
Environment
- Spark version: 3.1.3
- Spark-Excel version: 0.18.5
- OS: MacOS
Anything else?
No response
Hi, can you provide a small excel file that demonstrates this behavior?
Sure. This one should work. It is reproducible on any file generated using the Streaming Excel Workbook library when it is being read with the maxRowsInMemory parameter (i.e. using the same library). Just Spark Excel doesn't account for that behavior of setting getLastRowNum to zero.