spark-excel
spark-excel copied to clipboard
[BUG] Data is not being read using streaming approach.
Is there an existing issue for this?
- [X] I have searched the existing issues
Current Behavior
When I read from an Excel file using a streaming Excel reader (with maxRowsInMemory options set), data is not being read from the file. It happens for Excel files where the dimension
section contains an open-ended data address. For example, A1
.
It looks like the problem is in this method:
private def rowIndices(sheet: Sheet): Range =
(math.max(dataAddress.getFirstCell.getRow, sheet.getFirstRowNum) to
math.min(dataAddress.getLastCell.getRow, sheet.getLastRowNum))
When dimension
doesn't have the right bottom end, sheet.getLastRowNum
has a default value equal to 0
. It cuts off all the rows in the sheet.
I would suggest to fix it like
private def rowIndices(sheet: Sheet): Range =
(math.max(dataAddress.getFirstCell.getRow, sheet.getFirstRowNum) to
dataAddress.getLastCell.getRow)
because it is not always possible to figure out the last row num using the dimension
field.
Expected Behavior
All data is supposed to be read.
Steps To Reproduce
Just a simple read from an Excel file generated using the streaming-excel-reader library, with random data.
The file it generates contains an A1
value in the dimension
record.
Environment
- Spark version: 3.1.3
- Spark-Excel version: 0.18.5
- OS: MacOS
Anything else?
No response
Hi, can you provide a small excel file that demonstrates this behavior?
Sure. This one should work. It is reproducible on any file generated using the Streaming Excel Workbook library when it is being read with the maxRowsInMemory
parameter (i.e. using the same library). Just Spark Excel doesn't account for that behavior of setting getLastRowNum
to zero.