stupidedi icon indicating copy to clipboard operation
stupidedi copied to clipboard

WIP: Memory Usage Improvements (to fix #65)

Open kputnam opened this issue 5 years ago • 5 comments

This is still WIP. Hoping for only minimal API changes.

kputnam avatar Jun 16 '19 22:06 kputnam

Reading an example 4.4MB file in master has peak memory usage of 921.5 MB. With current work in progress, this is down to 782.23 MB. Most of the problem remains in the tokenizer, not the parser, which is a relief.

kputnam avatar Jun 16 '19 22:06 kputnam

The tokenization code was the biggest culprit for memory allocation. While there weren't a lot of retained objects that would suggest a leak, the tokenizer allocated so much short-term garbage that the GC struggled to keep up. It also took up a lot of stack frames, because it was written to use recursion rather than imperative looping.

The build for this branch is currently broken. I've made some drastic (but incomplete) changes that should reduce String allocations, along with other changes to further reduce the memory footprint of various classes. Once I have the parts working separately and unit tests written, I'll go about fitting the new tokenizer into the rest of the library. That might result in user-facing API changes, but I'm aiming to minimize that and provide shims where possible.

kputnam avatar Jun 22 '19 09:06 kputnam

Here's some preliminary results. The test script reads a file, hands it to the tokenizer, and iterates through each SegmentTok. No higher level parsing is done.

In the output below, the "Finished" is ultimately the important metric to the user. But the tokenizer's contribution to the total memory use is closer to "Finish" - "Post-init". The tokenizer may actually consume less, due to definitions and other classes being lazy-loaded while the tokenizer runs.

Some of the post-init memory could be consumed by reading the input file into memory. That could explain why the larger file has a larger post-init size compared to the smaller file.

master: 450KB input

» gtime -v ./memprof-tk.rb --fast file-s.x12 >/dev/null
Pre-init: 17 MiB
Post-init: 18 MiB
Finish: 44 MiB      <-- 26MB from tokenizer
1.992 seconds       <--
	Command being timed: "./memprof-tk.rb --fast file-s.x12"
	...
	Maximum resident set size (kbytes): 44516

wip: 450KB input

» gtime -v prof/memprof-tk.rb --fast prof/file-s.x12 >/dev/null
Pre-init: 17 MiB
Post-init: 18 MiB
Finish: 29 MiB      <-- 11MB from tokenizer
0.884 seconds       <--
	Command being timed: "prof/memprof-tk.rb --fast prof/file-s.x12"
	...
	Maximum resident set size (kbytes): 29284

The difference is more dramatic on larger files.

master: 4.4MB input

» gtime -v ./memprof-tk.rb --fast file-l.x12 >/dev/null
Pre-init: 17 MiB
Post-init: 22 MiB
Finish: 198 MiB      <-- 176MB from tokenizer
19.207 seconds       <--
	Command being timed: "./memprof-tk.rb --fast file-l.x12"
	...
	Maximum resident set size (kbytes): 199016

wip: 4.4MB input

» gtime -v prof/memprof-tk.rb --fast prof/file-l.x12 >/dev/null
Pre-init: 17 MiB
Post-init: 22 MiB
Finish: 33 MiB      <-- 11MB from tokenizer
7.932 seconds       <--
	Command being timed: "prof/memprof-tk.rb --fast prof/file-l.x12"
	User time (seconds): 7.79
	System time (seconds): 0.21
	Percent of CPU this job got: 99%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 0:08.07
	Average shared text size (kbytes): 0
	Average unshared data size (kbytes): 0
	Average stack size (kbytes): 0
	Average total size (kbytes): 0
	Maximum resident set size (kbytes): 33700

When I have time, I'd like to plot out the runtime and memory usage as file size increases. One important note is these gains might be diminished when the parser is changed to accommodate the new tokenizer.

There also is further savings by not adding the FunctionalGroupDef to the config. That saves a few MB by not loading all the SegmentDefs and ElementDefs. Doing that, the faster branch runs in 19MB on the small file and 23MB on the larger one. Subtracting the post-init would suggest the tokenizer uses 1MB or less for both small and large files.

kputnam avatar Jun 23 '19 00:06 kputnam

OK, all the wires are plugged back in and the tests pass. This needs some cleanup work, and then it should sit for a little bit before merging. There are also specs, documentation, and examples that need to be written/updated. Hopefully I will also have something in place to track performance metrics so things won't backslide after these improvements.

I optimized as much as possible in Ruby, but eventually decided to write a small native extension to optimize some frequently used operations. The extension isn't required, but I'm not sure yet how to ignore build failures (e.g., on machines that don't have a compiler). Maybe there's a way to pass a flag to gem install to tell us not to build the extension?

Previous benchmark results didn't account for tracking the position of each token (line, column, offset, etc), so that added more memory allocations and some CPU time. Overall, it didn't add much and these improvements overall should be noticeable by users.

The benchmark results at c128060 (positions aren't tracked) showed 985 MB of allocations (23,360,511 objects). Currently at 2a618646, with NoPosition used, there is 879 MB of allocations (21,367,064 objects). The current benchmark results (d06579ee) are:

wip: 450KB input

» gtime -v ./_/prof/memprof-tk.rb --0-pos --fast ./_/prof/file-s.x12 >/dev/null
Pre-init: 18 MiB
Post-init: 18 MiB
Finish: 29 MiB     <-- roughly no change from previous
0.933 seconds      <-- slight increase
	Command being timed: "./_/prof/memprof-tk.rb --0-pos --fast ./_/prof/file-s.x12"
        ...
	Maximum resident set size (kbytes): 29308

wip: 4.4MB input

» gtime -v ./_/prof/memprof-tk.rb --0-pos --fast ./_/prof/file-l.x12 >/dev/null
Pre-init: 22 MiB
Post-init: 22 MiB
Finish: 33 MiB    <-- roughly no change
8.224 seconds     <-- slight increase
	Command being timed: "./_/prof/memprof-tk.rb --0-pos --fast ./_/prof/file-l.x12"
        ...
	Maximum resident set size (kbytes): 32748

Here --0-pos uses NoPosition as the class to track position, which doesn't track any information. It's directly comparable in terms of features to c128060b (which didn't have any support for tracking token positions). Overall, no memory increase and slight increase in CPU time.

The position class is user-definable, so as much or as little information can be tracked. In the benchmark code, there are four options:

Option Tracked info Time Max RSS
--0-pos nothing 8.1s 32436 KB
--1-pos offset 8.2s 32768 KB
--2-pos line, column 11.4s 32868 KB
--3-pos name, line, column 11.4s 33136 KB
--4-pos name, line, column, offset 11.4s 33176 KB

The memory and run time use varies slightly from one run to another, so small differences between options aren't significant.

kputnam avatar Jul 21 '19 16:07 kputnam

Further improvements by eliminating the Array allocation caused by def foo(*args) when using def_delegators. This required quite a lot more code to generate the methods with the same parameter declaration as the receiving method.

With this change, there might be more low-hanging fruit in the rest of the code base by adding a type: ... or instance: ... argument to each def_delegators call. The tokenizer benchmark now runs almost 1s faster than c128060, which is an unexpected improvement.

Option Tracked info Time Max RSS
--0-pos nothing 7.0s 32320 KB
--1-pos offset 7.0s 32684 KB
--2-pos line, column 10.3s 32604 KB
--3-pos name, line, column 10.4s 32548 KB
--4-pos name, line, column, offset 10.5s 32764 KB

kputnam avatar Jul 22 '19 05:07 kputnam