torch-rnn
torch-rnn copied to clipboard
Output each char individually, avoid cost of appending to static string
This loop in LanguageModel.lua
for i = 1, encoded:size(1) do
local idx = encoded[i]
local token = self.idx_to_token[idx]
s = s .. token
end
is a killer for performance with strings greater than about 100,000 characters. Like a lot of languages strings in Lua are static so appending to them requires reallocating a new chunk of memory, copying all of the previous string into the new memory and adding the new character to the end. Of course they try and mitigate it by doubling the size of the string array each reallocation, but while this performs well in theory, in practice it still kills run time. Here's some perf numbers for various size strings running the original code (model_type: lstm, rnn_size: 256, layers: 3, temp: 0.9):
Sample size of 100 real 0m2.415s user 0m1.932s sys 0m0.446s
Sample size of 1000 real 0m3.282s user 0m2.797s sys 0m0.454s
Sample size of 10000 real 0m12.441s user 0m11.869s sys 0m0.571s
Sample size of 100000 real 1m48.001s user 1m46.454s sys 0m1.514s
Sample size of 200000 real 3m46.629s user 3m41.396s sys 0m5.102s
Sample size of 400000 real 8m24.205s user 7m55.585s sys 0m28.352s
Sample size of 600000 real 13m58.980s user 12m53.891s sys 1m4.605s
Sample size of 800000 real 20m23.200s user 18m18.093s sys 2m4.343s
Sample size of 1000000 real 28m8.949s user 24m44.892s sys 3m23.126s
/home/username/torch/install/bin/luajit: not enough memory
For some reason it runs out of memory at 1,000,000 characters so the 15.6x slowdown compared to 100,000 characters is probably smaller than if it actually finished. Here's the perf numbers for this pull request which outputs each character as soon as it is generated, avoiding repeated appends to a large string:
Sample size of 100 real 0m2.439s user 0m1.923s sys 0m0.505s
Sample size of 1000 real 0m3.308s user 0m2.844s sys 0m0.460s
Sample size of 10000 real 0m12.161s user 0m11.608s sys 0m0.501s
Sample size of 100000 real 1m43.001s user 1m41.889s sys 0m1.010s
Sample size of 200000 real 3m22.384s user 3m20.563s sys 0m1.675s
Sample size of 400000 real 6m42.487s user 6m39.472s sys 0m2.853s
Sample size of 600000 real 11m0.262s user 10m53.453s sys 0m6.417s
Sample size of 800000 real 13m12.612s user 13m5.283s sys 0m6.897s
Sample size of 1000000 real 18m1.585s user 17m50.051s sys 0m11.045s
With smaller strings the start-up time dominates and there's not much difference. At around 100,000 characters the new code just starts to show real improvements. At 800,000 characters the old code is 11.3x as slow as generating 100,000 characters and it only gets worse as numbers get bigger. The new code is 7.7x as slow, less than the expected 8x slowdown.
There's ways to keep the string in memory and only output it at the end effeciently. But outputting the characters immediately is helpful when first learning torch-rnn/torch as well as when tuning the temperature parameter on a newly trained RNN.
One thing I see your commit does not address is printing the start text to the output. I would love to merge this to my fork if you fix that.
Updated to write out the start text before anything else (other than the verbose stuff).
Thanks. I'm going to work on smoothly integrating it with a couple of other PRs related to output then integrate it.
Cool. I'm also working on more performance improvements, but I'm finding Lua tooling to be somewhat non-existent.