nokogiri
nokogiri copied to clipboard
Profile memory usage from processing HTML contents
What problem is this PR intended to solve?
This pull request adds a GitHub Actions Workflow that profiles memory usage from using Nokogiri to parse and serialize HTML content.
The core of the workflow is split into two steps:
- Parsing an entire HTML page with
Nokogiri::HTML::Document
and dumping it back into HTML string. - Parsing just the
<body>
tag contents withNokogiri::HTML::DocumentFragment
and dumping it back into HTML string.
(The profiling involves parsing and serializing the same input a 1000 times).
The source material for both steps are from a physical HTML file (in turn sourced from the current state of https://nokogiri.org/). Having a static fixture ensures consistent sample across multiple runs of the workflow.
Code Climate has analyzed commit 2213ba15 and detected 0 issues on this pull request.
The test coverage on the diff in this pull request is 100.0% (80% is the threshold).
This pull request will bring the total coverage in the repository to 94.3% (0.0% change).
View more on Code Climate.
:white_check_mark: Build nokogiri 1.0.685 completed (commit https://github.com/sparklemotion/nokogiri/commit/794609a419 by @ashmaroli)
Please ignore the failed CI status -- that was due to my config mistake related to #2089. This PR was green and should be considered green.
:white_check_mark: Build nokogiri 1.0.711 completed (commit https://github.com/sparklemotion/nokogiri/commit/cb122b5a75 by @ashmaroli)
@ashmaroli How to evaluate memory profiling results? I mean, you wouldn't go checking CI logs on every commit and manually compare memory usage between commits.
you wouldn't go checking CI logs on every commit and manually compare memory usage between commits.
@ilyazub Honestly, I was planning on doing just that — compare manually between commits of interest — say, the current master
stats vs stats for a proposed memory optimization.
Theoretically, if there was a way to get the log for an action that ran on master
, then one could just execute a script to extract the necessary numbers from that log, compare it with the numbers on an action run in a pull request of interest, and post it as a comment to the said pull request (much like how AppVeyorBot
did on this PR).
@ashmaroli Thanks for taking the time to put this together, and sorry for not commenting before now.
I think there's value in being able to look at data like this, but (as @ilyazub is implying) I worry that flagging memory leaks requires more rigor that just emitting a log file.
I'd ideally like to embed this approach in a CI test that emits pass/fail status. Can you think of a way to integrate this data better in the testing cycle?
I'm also curious about combining this approach with the long-defunct memory test suite (see https://github.com/sparklemotion/nokogiri/issues/1603) and also incorporate testing with compaction and either valgrind or ASAN. Any thoughts about that?
For the CI integration
- The simplest thing is to check memory usage limits: fail if script allocated more than allowed and pass otherwise.
- The more advanced thing may be to store historical memory usage data in a file (eg., JSON or Yaml) and fail the CI job if the script used more than
prev_run + N%
of memory. To store that file we can use build artifacts in Concourse CI (not sure), separate repository with a single file (deno.land approach), GitHub gist, etc. - Same as the second option, but to use the server to write, read and compare historical data of memory usage (kinda Lighthouse CI Server approach). Seems to be overkill.
Regarding #1603, we can store historical data for each test in test/test_memory_leak.rb
(the second approach). The output of each test case could be stored in an array (like for test_leaking_namespace_node_strings
). But I don't know what to do with endless loops (eg., test_leak_on_node_replace
).
@flavorjones @ashmaroli What do you think?
Thanks for the feedback on this idea, @flavorjones and @ilyazub. So, the first step is to handle passing data across CI builds. I'm not familiar with Concourse CI, but if suggestion#2 from @ilyazub is implementable, that'd be nice to have.
Regarding increasing the rigor of profiling memory, we may need to look into other options.
To my knowledge, memory_profiler
only measures Ruby objects. It won't detect allocations from native extensions.
Once we have a satisfactory way to pass data across builds, we could modify the profiling script to parse the artifact data, compare and issue an exit status accordingly.