torch-rnn Output each char individually, avoid cost of appending to static string

This loop in LanguageModel.lua

  for i = 1, encoded:size(1) do
    local idx = encoded[i]
    local token = self.idx_to_token[idx]
    s = s .. token
  end

is a killer for performance with strings greater than about 100,000 characters. Like a lot of languages strings in Lua are static so appending to them requires reallocating a new chunk of memory, copying all of the previous string into the new memory and adding the new character to the end. Of course they try and mitigate it by doubling the size of the string array each reallocation, but while this performs well in theory, in practice it still kills run time. Here's some perf numbers for various size strings running the original code (model_type: lstm, rnn_size: 256, layers: 3, temp: 0.9):

Sample size of 100      real 0m2.415s   user 0m1.932s   sys 0m0.446s
Sample size of 1000     real 0m3.282s   user 0m2.797s   sys 0m0.454s
Sample size of 10000    real 0m12.441s  user 0m11.869s  sys 0m0.571s
Sample size of 100000   real 1m48.001s  user 1m46.454s  sys 0m1.514s
Sample size of 200000   real 3m46.629s  user 3m41.396s  sys 0m5.102s
Sample size of 400000   real 8m24.205s  user 7m55.585s  sys 0m28.352s
Sample size of 600000   real 13m58.980s user 12m53.891s sys 1m4.605s
Sample size of 800000   real 20m23.200s user 18m18.093s sys 2m4.343s
Sample size of 1000000  real 28m8.949s  user 24m44.892s sys 3m23.126s
/home/username/torch/install/bin/luajit: not enough memory

For some reason it runs out of memory at 1,000,000 characters so the 15.6x slowdown compared to 100,000 characters is probably smaller than if it actually finished. Here's the perf numbers for this pull request which outputs each character as soon as it is generated, avoiding repeated appends to a large string:

Sample size of 100      real 0m2.439s   user 0m1.923s   sys 0m0.505s
Sample size of 1000     real 0m3.308s   user 0m2.844s   sys 0m0.460s
Sample size of 10000    real 0m12.161s  user 0m11.608s  sys 0m0.501s
Sample size of 100000   real 1m43.001s  user 1m41.889s  sys 0m1.010s
Sample size of 200000   real 3m22.384s  user 3m20.563s  sys 0m1.675s
Sample size of 400000   real 6m42.487s  user 6m39.472s  sys 0m2.853s
Sample size of 600000   real 11m0.262s  user 10m53.453s sys 0m6.417s
Sample size of 800000   real 13m12.612s user 13m5.283s  sys 0m6.897s
Sample size of 1000000  real 18m1.585s  user 17m50.051s sys 0m11.045s

With smaller strings the start-up time dominates and there's not much difference. At around 100,000 characters the new code just starts to show real improvements. At 800,000 characters the old code is 11.3x as slow as generating 100,000 characters and it only gets worse as numbers get bigger. The new code is 7.7x as slow, less than the expected 8x slowdown.

There's ways to keep the string in memory and only output it at the end effeciently. But outputting the characters immediately is helpful when first learning torch-rnn/torch as well as when tuning the temperature parameter on a newly trained RNN.

Aug 20 '16 19:08 joelself

One thing I see your commit does not address is printing the start text to the output. I would love to merge this to my fork if you fix that.

Aug 28 '16 19:08 dgcrouse

Updated to write out the start text before anything else (other than the verbose stuff).

Aug 28 '16 20:08 joelself

Thanks. I'm going to work on smoothly integrating it with a couple of other PRs related to output then integrate it.

Aug 29 '16 02:08 dgcrouse

Cool. I'm also working on more performance improvements, but I'm finding Lua tooling to be somewhat non-existent.

Aug 29 '16 17:08 joelself

torch-rnn torch-rnn copied to clipboard

Output each char individually, avoid cost of appending to static string

torch-rnn
torch-rnn copied to clipboard