spark-excel Use takeWhile method from Range

The symptoms: I have a file with ~1 million rows, 125 columns. It takes ~12 seconds to count lines with spark-excel's API V1 and ~2 minutes with API V2.

The issue:

Range does not contain own optimized method filter, that's why it uses method from TraversableLike which iterates over each number in range.
r.getLastCellNum evaluated for each number in range.

Here are some rough benchmarks with another file: filter => 50 seconds val lastCellNum => 38 seconds withFilter => 20 seconds takeWhile => 12 seconds API V1 => 12 seconds (File taken from here and manually converted to "xlsx")

PS. API V2 seems great! :)

Mar 12 '23 01:03 EnverOsmanov

Hi @EnverOsmanov, thanks for the PR! I'm slightly worried that .takeWhile slightly modifies the semantics to stop after having found the first non-matching index. At the moment this doesn't matter because we're using a Range where the colInds are sorted, but should someone change this to an unsorted Seq[Int] it would break silently. Of course the real performance benefit outweighs the maybe-in-some-not-too-likely-future-scenario breakage, but if you find a fast version with identical semantics it would be nicer. Could you maybe try the following?

val lastCellNum = r.getLastCellNum
colInd
  .iterator
  .filter(_ < lastCellNum)

Mar 13 '23 08:03 nightscape

If coldInd would be unsorted Seq[Int] and some columns would be missing, it should break test cases.

Benchmarks: .iterator.filter => 31 seconds .view.filter => 2 minutes

Here is the code how I read the data.

Btw, I just checked the content of colInd for the file and I see that it is "1 to 16383" while lastCellNum is 23. The reason of big range is that I specified in "dataAddress" only the starting cell.

Mar 13 '23 09:03 EnverOsmanov

The alternative approach to avoid iteration over full colInd is in API V1:

.map(_.cellIterator().asScala.filter(c => colInd.contains(c.getColumnIndex)).toVector)

But I'm not exactly sure what was the idea behind the change in V2.

Mar 13 '23 10:03 EnverOsmanov

Hmm, maybe it is to be able to do the

r.getCell(_, MissingCellPolicy.CREATE_NULL_AS_BLANK)

@quanghgx could you chime in here?

Mar 13 '23 16:03 nightscape

If colInd would be unsorted Seq[Int] we should sort it once. Otherwise we would be filtering on full collection for each row.

Mar 16 '23 06:03 EnverOsmanov