nokogiri icon indicating copy to clipboard operation
nokogiri copied to clipboard

Profile memory usage from processing HTML contents

Open ashmaroli opened this issue 3 years ago • 9 comments

What problem is this PR intended to solve?

This pull request adds a GitHub Actions Workflow that profiles memory usage from using Nokogiri to parse and serialize HTML content.

The core of the workflow is split into two steps:

  • Parsing an entire HTML page with Nokogiri::HTML::Document and dumping it back into HTML string.
  • Parsing just the <body> tag contents with Nokogiri::HTML::DocumentFragment and dumping it back into HTML string.

(The profiling involves parsing and serializing the same input a 1000 times).

The source material for both steps are from a physical HTML file (in turn sourced from the current state of https://nokogiri.org/). Having a static fixture ensures consistent sample across multiple runs of the workflow.

ashmaroli avatar Sep 30 '20 07:09 ashmaroli

Code Climate has analyzed commit 2213ba15 and detected 0 issues on this pull request.

The test coverage on the diff in this pull request is 100.0% (80% is the threshold).

This pull request will bring the total coverage in the repository to 94.3% (0.0% change).

View more on Code Climate.

codeclimate[bot] avatar Sep 30 '20 07:09 codeclimate[bot]

:white_check_mark: Build nokogiri 1.0.685 completed (commit https://github.com/sparklemotion/nokogiri/commit/794609a419 by @ashmaroli)

AppVeyorBot avatar Sep 30 '20 07:09 AppVeyorBot

Please ignore the failed CI status -- that was due to my config mistake related to #2089. This PR was green and should be considered green.

flavorjones avatar Oct 01 '20 17:10 flavorjones

:white_check_mark: Build nokogiri 1.0.711 completed (commit https://github.com/sparklemotion/nokogiri/commit/cb122b5a75 by @ashmaroli)

AppVeyorBot avatar Oct 15 '20 14:10 AppVeyorBot

@ashmaroli How to evaluate memory profiling results? I mean, you wouldn't go checking CI logs on every commit and manually compare memory usage between commits.

ilyazub avatar Nov 27 '20 16:11 ilyazub

you wouldn't go checking CI logs on every commit and manually compare memory usage between commits.

@ilyazub Honestly, I was planning on doing just that — compare manually between commits of interest — say, the current master stats vs stats for a proposed memory optimization.

Theoretically, if there was a way to get the log for an action that ran on master, then one could just execute a script to extract the necessary numbers from that log, compare it with the numbers on an action run in a pull request of interest, and post it as a comment to the said pull request (much like how AppVeyorBot did on this PR).

ashmaroli avatar Nov 27 '20 17:11 ashmaroli

@ashmaroli Thanks for taking the time to put this together, and sorry for not commenting before now.

I think there's value in being able to look at data like this, but (as @ilyazub is implying) I worry that flagging memory leaks requires more rigor that just emitting a log file.

I'd ideally like to embed this approach in a CI test that emits pass/fail status. Can you think of a way to integrate this data better in the testing cycle?

I'm also curious about combining this approach with the long-defunct memory test suite (see https://github.com/sparklemotion/nokogiri/issues/1603) and also incorporate testing with compaction and either valgrind or ASAN. Any thoughts about that?

flavorjones avatar Nov 29 '20 20:11 flavorjones

For the CI integration

  1. The simplest thing is to check memory usage limits: fail if script allocated more than allowed and pass otherwise.
  2. The more advanced thing may be to store historical memory usage data in a file (eg., JSON or Yaml) and fail the CI job if the script used more than prev_run + N% of memory. To store that file we can use build artifacts in Concourse CI (not sure), separate repository with a single file (deno.land approach), GitHub gist, etc.
  3. Same as the second option, but to use the server to write, read and compare historical data of memory usage (kinda Lighthouse CI Server approach). Seems to be overkill.

Regarding #1603, we can store historical data for each test in test/test_memory_leak.rb (the second approach). The output of each test case could be stored in an array (like for test_leaking_namespace_node_strings). But I don't know what to do with endless loops (eg., test_leak_on_node_replace).

@flavorjones @ashmaroli What do you think?

ilyazub avatar Nov 30 '20 10:11 ilyazub

Thanks for the feedback on this idea, @flavorjones and @ilyazub. So, the first step is to handle passing data across CI builds. I'm not familiar with Concourse CI, but if suggestion#2 from @ilyazub is implementable, that'd be nice to have.

Regarding increasing the rigor of profiling memory, we may need to look into other options. To my knowledge, memory_profiler only measures Ruby objects. It won't detect allocations from native extensions. Once we have a satisfactory way to pass data across builds, we could modify the profiling script to parse the artifact data, compare and issue an exit status accordingly.

ashmaroli avatar Nov 30 '20 13:11 ashmaroli