neo4j-apoc-procedures Incoherent behaviours when some header cells are empty when using `apoc.load.xls`

Incoherent behaviours when some header cells are empty when using `apoc.load.xls`

Open pjeannet opened this issue 2 years ago • 1 comments

Expected Behavior (Mandatory)

When using apoc.load.xls to load an excel file, we want all columns to be loaded even if one of the header cell is empty. In case of empty header, the missing value could be replaced by an optionnal prefix concatenated with the header index.

Actual Behavior (Mandatory)

If the first header cell of the file (A1) is null/empty, then the first column is ignored and not loaded. If several consecutive headers are empty then all these columns are skipped. If A1 is not empty but other header cell is empty then following error is thrown.

Failed to invoke procedure `apoc.load.xls`: Caused by: java.lang.IllegalStateException: Header at position 1 doesn't have a value

If A1 is not empty but other header cell is empty and there is some styling applied (I couldn't pinpoint which one exactly as some work and others don't) then the columns are loaded.

How to Reproduce the Problem

Try to load the different tabs from the provided file. Each 4 tabs reproduce one of the above in same order.

Simple Dataset (where it's possibile)

CALL apoc.load.xls("test.xlsx", "test1")
YIELD list as row
RETURN row limit 2

test.xlsx

Steps (Mandatory)

1.CALL apoc.load.xls("test.xlsx", "test1") YIELD list as row RETURN row limit 2 1.CALL apoc.load.xls("test.xlsx", "test2") YIELD list as row RETURN row limit 2 1.CALL apoc.load.xls("test.xlsx", "test3") YIELD list as row RETURN row limit 2 1.CALL apoc.load.xls("test.xlsx", "test4") YIELD list as row RETURN row limit 2

Screenshots (where it's possibile)

Specifications (Mandatory)

This was tested on the following versions :

Versions

OS: Windows & Linux
Neo4j: 4.3.x, 4.4.x
Neo4j-Apoc: 4.3.0.0, 4.4.0.1

Propositions

After looking at the code I have the following propositions :

this line should be changed as the first row could contain empty values. From what I understood getting the index of the first column containing data is difficult due to the way excel store data, maybe we could just add an option to the procedure allowing to force first and last column to import.
We could also edit the getHeader function. A cell is only null if it does not contain data nor is formated. We should either also consider an empty formated cell as null (and document that apoc won't work with empty headers) or create a placeholder value for these empty headers.

Dec 21 '21 12:12 pjeannet

We run into another issue yesterday while importing an excel file with a single column containing empty cells. We noticed that the load would stop at the first empty cell. We beleive it may be the same bug as the one I discribed previously. You can test this by creating an excel file containing following data.

test
row1
row2

row4

row6

Calling CALL apoc.load.xls("filename","tabname") will then only load row 1 and row 2 (default ignore headers)

EDIT : I took a look at the procedure code and this part is causing the issue. When the row is empty the function tryAdvance will return false and stops the stream. Maybe we should make use of the limit property that is computed using sheet.getLastRowNum.
This is untested code, I usually don't do java

public boolean tryAdvance(Consumer<? super XLSResult> action) {
            try {
                Row row = sheet.getRow((int)lineNo);
                if (row != null && lineNo <= limit) {
                    Object[] list = extract(row, selection);
                    action.accept(new XLSResult(header, list, lineNo-skip, ignore,mapping, nullValues));
                    lineNo++;
                    return true;
                }
                if((int)lineNo != limit) {
                  return true;
                }
                return false;
            } catch (Exception e) {
                throw new RuntimeException("Error reading XLS from URL " + cleanUrl(url) + " at " + lineNo, e);
            }
        }

Jul 07 '22 09:07 pjeannet

Thanks for the detailed description. It will be resolved in release 5.17.1 where it will be possible to pass an additional parameter skipNulls: true e.g. call apoc.load.xls('file','',{skipNulls: true})

Feb 26 '24 09:02 RobertoSannino

Hello, thanks for the update. I'll do some testing once this is available. If I understood correctly, null cells won't be skipped anymore by default and the new skipNulls parameter will be there for those that want the old behavior?

EDIT : I went and read the associated commit, I see that you also created a firstCellNum and lastCellNum which will also help covering most use cases.

Feb 26 '24 09:02 pjeannet

neo4j-apoc-procedures neo4j-apoc-procedures copied to clipboard

Incoherent behaviours when some header cells are empty when using `apoc.load.xls`

Expected Behavior (Mandatory)

Actual Behavior (Mandatory)

How to Reproduce the Problem

Simple Dataset (where it's possibile)

Steps (Mandatory)

Screenshots (where it's possibile)

Specifications (Mandatory)

Versions

Propositions

neo4j-apoc-procedures
neo4j-apoc-procedures copied to clipboard