Move UUIDText, dsc, and TimeSync to HashMaps
This PR continues the work done by jrouaix and dgmcdona to reduce this library’s memory usage.
Prior to this PR the library would read and cache all UUIDText and dsc files on a system or logarchive prior to parsing the actual logs.
The UUIDText and dsc files contains part of the log message.
This PR changes that behavior. Instead, the library will parse the log file first and then read the corresponding UUIDText and/or dsc file. The library will then cache the the corresponding UUIDText and/or dsc file before moving on to the next log entry.
The library will keep a small cache of UUIDText and dsc files while reading the log entries. This is all handle via the FileProvider trait that dgmcdona added
Currently it will cache 30 UUIDText files and 2 dsc files. There typically ~250 UUIDText files. With a total size of ~90MB to ~140MB
Dsc files are larger but fewer in number. Typically ~6 dsc files. With a total size of ~600MB
By using a smaller cache, the memory usage of the library drops a pretty good amount.
Memory usage prior to this PR: ~300MB-1GB. Dependent on how many UUIDText and dsc files
Now: ~100MB-500MB.
(Though this is still higher than the builtin in log command Apple. Which is ~150MB - ~200MB)
Switching to HashMap also makes the library a bit faster now
basic hyperfine test on a few log files
Benchmark 1: ./iteratorV2 -m log-archive -i /Users/android/Downloads/system_logs.logarchive -o out.csv -f csv
Time (mean ± σ): 6.324 s ± 0.051 s [User: 5.469 s, System: 0.638 s]
Range (min … max): 6.264 s … 6.405 s 10 runs
Benchmark 2: ./iteratorV1 -m log-archive -i /Users/android/Downloads/system_logs.logarchive -o out2.csv -f csv
Time (mean ± σ): 7.391 s ± 0.119 s [User: 6.641 s, System: 0.595 s]
Range (min … max): 7.278 s … 7.604 s 10 runs
Summary
./iteratorV2 -m log-archive -i /Users/android/Downloads/system_logs.logarchive -o out.csv -f csv ran
1.17 ± 0.02 times faster than ./iteratorV1 -m log-archive -i /Users/android/Downloads/system_logs.logarchive -o out2.csv -f csv
Where iteratorV2 is the HashMap version
Finally this PR includes a fix to better handle <private> number entries.
Also since we switch from Vec to HashMap this is probably a breaking change. The example file has been updated.
@mrguiman @jrouaix both of you discussed about attempts to lower memory usage.
If either of have a chance to try out this branch (hashmaps) and compare memory usage to the current implementation. That would be great, if not no worries.
Im hopeful there should be a bit more lower memory usage
Hi @puffyCid !
I finally got time to try a little with this incoming version.
So far a "dumb" refacto implementation of our iterator didn't show better results. Highest memory consumption seems to be the same, while performances are overall a little worst. (for 10 executions of the same scan)
# before
real 11m41.818s
user 13m0.079s
sys 1m7.636s
# after
real 12m12.739s
user 12m16.940s
sys 2m17.766s
Also the results are shown in an other order than before, I think it's just because the files are not picked in the same order.
Could you make an implementation with you don't entirely trust WalkDir, but also sort found SourceFiles in a way to make to upper iterator output logdata in a calendar order ?
Here is our new parse_all_archives_logs implementation (old code in comments) :
Yet, I don't pretend the order I picked for reading ARCHIVES was a good one, It just is deterministic so our snapshot tests don't flicker at every run.
pub fn parse_all_archives_logs(&mut self) {
let tracev3 = self.provider.tracev3_files();
for tracev3 in tracev3 {
self.parse_file(tracev3);
}
// const ARCHIVES: &[&str] = &["Persist", "Special", "Signpost", "HighVolume"];
// const SINGLE_FILES: &[&str] = &["logdata.LiveData.tracev3"];
// for archive in ARCHIVES {
// self.parse_archive_logs(archive);
// }
// for archive in SINGLE_FILES {
// let archive_path = self.root_path.join(archive);
// self.parse_file(&archive_path.display().to_string());
// }
}
Then the build_log function now takes a provider, but no more string_results & shared_string_results ?
So it'll ask the provider on demand ? Will it kill performances ? (or probably I didn't really understood that part, i'll have to dig more on that)
The LogarchiveProvider with SourceFile / Read abstraction could allow us to read the unifiedlogs without having to unzip the sysdiagnose to disk. Good to see all this !
Then the
build_logfunction now takes a provider, but no more string_results & shared_string_results ? So it'll ask the provider on demand ? Will it kill performances ? (or probably I didn't really understood that part, i'll have to dig more on that)
Yes, instead of reading all of the log strings and shared strings at the start, the library now delays reading the file until the it encounters a log entry that requires it. Then it will read the string file (uuidtext files) and cache it for future log entries.
The library keeps a small cache and data if the limit is reached. Performance should be about the same. As shown in hyperfine results above.
I also did simple comparison with my own application that uses this library and got about same performance.
This PR:
real 0m21.282s
user 0m19.202s
sys 0m1.960s
Before:
real 0m23.064s
user 0m21.414s
sys 0m1.408s
I also a small memory decrease as well.