macos-UnifiedLogs icon indicating copy to clipboard operation
macos-UnifiedLogs copied to clipboard

Move UUIDText, dsc, and TimeSync to HashMaps

Open puffyCid opened this issue 9 months ago • 2 comments

This PR continues the work done by jrouaix and dgmcdona to reduce this library’s memory usage.

Prior to this PR the library would read and cache all UUIDText and dsc files on a system or logarchive prior to parsing the actual logs.

The UUIDText and dsc files contains part of the log message.

This PR changes that behavior. Instead, the library will parse the log file first and then read the corresponding UUIDText and/or dsc file. The library will then cache the the corresponding UUIDText and/or dsc file before moving on to the next log entry.

The library will keep a small cache of UUIDText and dsc files while reading the log entries. This is all handle via the FileProvider trait that dgmcdona added

Currently it will cache 30 UUIDText files and 2 dsc files. There typically ~250 UUIDText files. With a total size of ~90MB to ~140MB

Dsc files are larger but fewer in number. Typically ~6 dsc files. With a total size of ~600MB

By using a smaller cache, the memory usage of the library drops a pretty good amount.

Memory usage prior to this PR: ~300MB-1GB. Dependent on how many UUIDText and dsc files Now: ~100MB-500MB. (Though this is still higher than the builtin in log command Apple. Which is ~150MB - ~200MB)

Switching to HashMap also makes the library a bit faster now

basic hyperfine test on a few log files

Benchmark 1: ./iteratorV2 -m log-archive -i /Users/android/Downloads/system_logs.logarchive -o out.csv -f csv
  Time (mean ± σ):      6.324 s ±  0.051 s    [User: 5.469 s, System: 0.638 s]
  Range (min … max):    6.264 s …  6.405 s    10 runs

Benchmark 2: ./iteratorV1 -m log-archive -i /Users/android/Downloads/system_logs.logarchive -o out2.csv -f csv
  Time (mean ± σ):      7.391 s ±  0.119 s    [User: 6.641 s, System: 0.595 s]
  Range (min … max):    7.278 s …  7.604 s    10 runs

Summary
  ./iteratorV2 -m log-archive -i /Users/android/Downloads/system_logs.logarchive -o out.csv -f csv ran
    1.17 ± 0.02 times faster than ./iteratorV1 -m log-archive -i /Users/android/Downloads/system_logs.logarchive -o out2.csv -f csv

Where iteratorV2 is the HashMap version

Finally this PR includes a fix to better handle <private> number entries.

Also since we switch from Vec to HashMap this is probably a breaking change. The example file has been updated.

puffyCid avatar Feb 27 '25 05:02 puffyCid

@mrguiman @jrouaix both of you discussed about attempts to lower memory usage. If either of have a chance to try out this branch (hashmaps) and compare memory usage to the current implementation. That would be great, if not no worries.

Im hopeful there should be a bit more lower memory usage

puffyCid avatar Feb 27 '25 05:02 puffyCid

Hi @puffyCid !

I finally got time to try a little with this incoming version.

So far a "dumb" refacto implementation of our iterator didn't show better results. Highest memory consumption seems to be the same, while performances are overall a little worst. (for 10 executions of the same scan)

# before
real    11m41.818s
user    13m0.079s
sys     1m7.636s
# after
real    12m12.739s
user    12m16.940s
sys     2m17.766s

Also the results are shown in an other order than before, I think it's just because the files are not picked in the same order. Could you make an implementation with you don't entirely trust WalkDir, but also sort found SourceFiles in a way to make to upper iterator output logdata in a calendar order ?

Here is our new parse_all_archives_logs implementation (old code in comments) :

Yet, I don't pretend the order I picked for reading ARCHIVES was a good one, It just is deterministic so our snapshot tests don't flicker at every run.

  pub fn parse_all_archives_logs(&mut self) {
    let tracev3 = self.provider.tracev3_files();
    for tracev3 in tracev3 {
      self.parse_file(tracev3);
    }
  
    // const ARCHIVES: &[&str] = &["Persist", "Special", "Signpost", "HighVolume"];
    // const SINGLE_FILES: &[&str] = &["logdata.LiveData.tracev3"];

    // for archive in ARCHIVES {
    //   self.parse_archive_logs(archive);
    // }

    // for archive in SINGLE_FILES {
    //   let archive_path = self.root_path.join(archive);
    //   self.parse_file(&archive_path.display().to_string());
    // }
  }

Then the build_log function now takes a provider, but no more string_results & shared_string_results ? So it'll ask the provider on demand ? Will it kill performances ? (or probably I didn't really understood that part, i'll have to dig more on that)

image

The LogarchiveProvider with SourceFile / Read abstraction could allow us to read the unifiedlogs without having to unzip the sysdiagnose to disk. Good to see all this !

jrouaix avatar Mar 04 '25 13:03 jrouaix

Then the build_log function now takes a provider, but no more string_results & shared_string_results ? So it'll ask the provider on demand ? Will it kill performances ? (or probably I didn't really understood that part, i'll have to dig more on that)

Yes, instead of reading all of the log strings and shared strings at the start, the library now delays reading the file until the it encounters a log entry that requires it. Then it will read the string file (uuidtext files) and cache it for future log entries.

The library keeps a small cache and data if the limit is reached. Performance should be about the same. As shown in hyperfine results above.

I also did simple comparison with my own application that uses this library and got about same performance.

This PR:
real	0m21.282s
user	0m19.202s
sys	0m1.960s

Before:
real	0m23.064s
user	0m21.414s
sys	0m1.408s

I also a small memory decrease as well.

puffyCid avatar Mar 30 '25 01:03 puffyCid