spout icon indicating copy to clipboard operation
spout copied to clipboard

Investigate speed optimizations

Open adrilo opened this issue 6 years ago • 3 comments

When reading an XLSX file, we can probably optimize a few things:

  • don't read the styles XML file if we don't need styles (i.e. if date formatting is off)
  • parse/split the shared strings XML files on demand, as opposed to do it all at once

adrilo avatar Apr 11 '18 14:04 adrilo

parse/split the shared strings XML files on demand, as opposed to do it all at once

This is a current case for me. We use Spout for importing files. Only the first sheet is imported. What the users do a lot is, that they have 2-10 sheets in a workbook and just swap them for imports. Since Spout selects the caching strategy based on the number of shared strings and the memory available - it seems that the slow filesystem strategy (in my current case 8sec memory vs 90sec fs read time) is picked - even though only a small subset of the workbook is actually read.

So... I am motivated to do some work regarding the shared strings. But - this is somewhat the dark magic layer of Spout. Some hints and guidance would be great. I will however try to dig into it.

madflow avatar Aug 12 '19 14:08 madflow

For the shared strings, here is how it works: Read the "sharedStrings.xml" file. Get how many strings there are. Based on this number and the available amount of memory, pick the in-memory or on-disk strategy: https://github.com/box/spout/blob/master/src/Spout/Reader/XLSX/Manager/SharedStringsManager.php#L114

The algorithm is pretty straightforward and described here: https://github.com/box/spout/blob/master/src/Spout/Reader/XLSX/Manager/SharedStringsCaching/CachingStrategyFactory.php#L13

The in-memory strategy is easy to understand: it's a big hash map (index => string value). For the on-disk strategy, the values are split into big chunks and written in temp folder (each chunk contains 10,000 strings: https://github.com/box/spout/blob/master/src/Spout/Reader/XLSX/Manager/SharedStringsCaching/CachingStrategyFactory.php#L51). This is so that a chunk can be fully loaded in memory. When reading a shared string (the shared string index is referenced in the sheetN.xml), we compute a modulo to guess which chunk the shared string belongs to. The chunk is then loaded in memory: https://github.com/box/spout/blob/master/src/Spout/Reader/XLSX/Manager/SharedStringsCaching/FileBasedStrategy.php#L128.

The setup works best when same shared strings are used close to each other, so that they can be in the same chunk. Otherwise that causes the current chunk to be unloaded and a new chunk to be read from disk and loaded, which takes time.

adrilo avatar Aug 20 '19 08:08 adrilo

I wanted to speedup the reading of some big ODS files, so I profiled Spout, and benchmarked a few simple optimizations. Overall, I obtained a 16 % speedup (PHP 7.4, Debian, dev computer). YMMV, but for medium or big files, I expect similar gains in any environment.

You can see the changes I made on this branch: https://github.com/box/spout/compare/master...mytskine:performance I can submit a PR if you're interested, or you can just pick up whatever interests you.

I didn't change the algorithms in any way. On the critical path, some method calls grows grow linearly with the number of cells. Since PHP function calls are expensive, I mostly inlined a few shorts methods (1 or 2 lines long). The drawback is a small decrease in code readability in a few methods. I used PHP-SPX to profile (great tool to find the critical path) and PHPBench to benchmark, both on command line.

Results for a 2400 cells ODS file (times normalized with PHPBench):

  • before: 86.432 ms
  • after: 72.798 (15.8 % better)

For the file ods_with_one_million_rows.ods from tests/resources/ods/:

  • before: 118,404 ms
  • after: 99,234 ms (16.2 % better)

mytskine avatar Jan 07 '22 15:01 mytskine