Use takeWhile method from Range
The symptoms: I have a file with ~1 million rows, 125 columns. It takes ~12 seconds to count lines with spark-excel's API V1 and ~2 minutes with API V2.
The issue:
-
Rangedoes not contain own optimized methodfilter, that's why it uses method fromTraversableLikewhich iterates over each number in range. -
r.getLastCellNumevaluated for each number in range.
Here are some rough benchmarks with another file: filter => 50 seconds val lastCellNum => 38 seconds withFilter => 20 seconds takeWhile => 12 seconds API V1 => 12 seconds (File taken from here and manually converted to "xlsx")
PS. API V2 seems great! :)
Hi @EnverOsmanov, thanks for the PR!
I'm slightly worried that .takeWhile slightly modifies the semantics to stop after having found the first non-matching index.
At the moment this doesn't matter because we're using a Range where the colInds are sorted, but should someone change this to an unsorted Seq[Int] it would break silently.
Of course the real performance benefit outweighs the maybe-in-some-not-too-likely-future-scenario breakage, but if you find a fast version with identical semantics it would be nicer.
Could you maybe try the following?
val lastCellNum = r.getLastCellNum
colInd
.iterator
.filter(_ < lastCellNum)
If coldInd would be unsorted Seq[Int] and some columns would be missing, it should break test cases.
Benchmarks: .iterator.filter => 31 seconds .view.filter => 2 minutes
Here is the code how I read the data.
Btw, I just checked the content of colInd for the file and I see that it is "1 to 16383" while lastCellNum is 23. The reason of big range is that I specified in "dataAddress" only the starting cell.
The alternative approach to avoid iteration over full colInd is in API V1:
.map(_.cellIterator().asScala.filter(c => colInd.contains(c.getColumnIndex)).toVector)
But I'm not exactly sure what was the idea behind the change in V2.
Hmm, maybe it is to be able to do the
r.getCell(_, MissingCellPolicy.CREATE_NULL_AS_BLANK)
@quanghgx could you chime in here?
If colInd would be unsorted Seq[Int] we should sort it once. Otherwise we would be filtering on full collection for each row.