macports-base icon indicating copy to clipboard operation
macports-base copied to clipboard

Shared memory for darwintraced processes

Open MihirLuthra opened this issue 5 years ago • 19 comments

The upgradations are made as per the Speed up trace mode project. The code has worked stably with all ports I have tested till now. This PR adds a shared memory cache for the processes into which darwintrace library is injected. I have also uploaded a separate repository on GitHub that contains tests for the newly added library and a readme which attempts to explain the complete code. The library can be used in a standalone way (by defining STANDALONE_DTSM in Makefile) if anyone wants to test its working.

Any suggestions or feedbacks would be really helpful.

I am entering speed comparisons data about in a spreadsheet. I will keep updating the table with the comparison test I make.

MihirLuthra avatar Jun 08 '19 18:06 MihirLuthra

Can you add something to that spreadsheet to show the percentage speedup/slowdown?

pmetzger avatar Jun 09 '19 11:06 pmetzger

Can you add something to that spreadsheet to show the percentage speedup/slowdown?

I have added a new column indicating the % speedup/slowdown. Also I guess there are still many scope of improvements in the newly added library. Aside from those i am uncertain about some improvements which maybe made. Like I dunno if I can maintain the same shared memory for all the installation phases of a port. Currently I delete the shared memory data after each phase. I actually think it should be okay to maintain data till the end of the installation but it would be better if someone confirms that.

MihirLuthra avatar Jun 09 '19 16:06 MihirLuthra

Hrm. Looks like all this is a lot of work for only a few percent change on average...

pmetzger avatar Jun 09 '19 20:06 pmetzger

Hrm. Looks like all this is a lot of work for only a few percent change on average...

Till now I guess yes. I am hoping to see more changes with the edits I am testing right now. I will make a comment when I am done. Can you give a look at P,Q & R column(the results with upgradations in code) in the spreadsheet? Those results till know are probably turning out to be better than before.

MihirLuthra avatar Jun 09 '19 21:06 MihirLuthra

@pmetzger I have made some changes to the % calculation formula. Last time it was getting calculated as ((actual_time_by_old_trace_mode - actual_time_by_modified_trace_mode)/actual_time_by_old_trace_mode) * 100 which actually isn’t correct. I changed it to ((time_lag_caused_by_old_trace_mode - time_lag_caused_by_new_trace_mode)/time_lag_caused_by_old_trace_mode) * 100. Actually now it makes more sense because the modifications made in trace mode can’t make the trace mode faster than what it takes without trace mode. Like in perl5.28, time taken by old trace mode is 8m29.826s and time taken by new trace mode is 7m15.217s but the time taken without trace mode is 5m58.364s. So this actually means 51 % improvement. Am I right on that?

MihirLuthra avatar Jun 10 '19 03:06 MihirLuthra

8.5 minutes vs 7.25 minutes is not a 51% inprovement. A 51% improvement would be 4.25 minutes.

pmetzger avatar Jun 10 '19 19:06 pmetzger

8.5 minutes vs 7.25 minutes is not a 51% inprovement. A 51% improvement would be 4.25 minutes.

I understand, I will try to make the improvements better. Maybe I rushed in too early with a PR just completing the core idea.

MihirLuthra avatar Jun 10 '19 20:06 MihirLuthra

8.5 minutes vs 7.25 minutes is not a 51% inprovement. A 51% improvement would be 4.25 minutes.

... only if the build without trace mode would take 0 minutes.

Since @MihirLuthra is not optimizing the compiler where your calculation would make sense, his numbers seem to be correct. If the build without trace mode is taking 6 minutes and the time in trace mode dropped from 8.5 to 7.25 minutes, this means that the overhead dropped from 2.5 minutes to 1.25 minutes. He's optimising the overhead, not the compilation time, so the assessment sounds correct to me.

I'm not saying that there's no further room for improvement, in particular for the cases where the percentages are negative, just that the numbers sound correct.

mojca avatar Jun 21 '19 08:06 mojca

8.5 minutes vs 7.25 minutes is not a 51% inprovement. A 51% improvement would be 4.25 minutes.

... only if the build without trace mode would take 0 minutes.

Since @MihirLuthra is not optimizing the compiler where your calculation would make sense, his numbers seem to be correct. If the build without trace mode is taking 6 minutes and the time in trace mode dropped from 8.5 to 7.25 minutes, this means that the overhead dropped from 2.5 minutes to 1.25 minutes. He's optimising the overhead, not the compilation time, so the assessment sounds correct to me.

I'm not saying that there's no further room for improvement, in particular for the cases where the percentages are negative, just that the numbers sound correct.

@mojca , thanks for making me aware of that. Also in the new changes none of the ports is giving negative results, but I surely need to optimise it more. I will post the latest tests asap.

MihirLuthra avatar Jun 21 '19 13:06 MihirLuthra

@raimue Kindly check the latest code. I think I fixed most issues which you told me last time. Also, here are the latest tests but they still are incomplete.

MihirLuthra avatar Jun 22 '19 16:06 MihirLuthra

@neverpanic, @jmroot: could you please provide feedback to the revisited code?

@MihirLuthra: sorry for the delayed answer. One possible way to run performance tests without overheating your laptop could be to set up builds on Azure CI which provides 6 hours to finish some builds. I'm not saying it's a small amount of work, but one could start from the existing setup of ports setup, and first build the base with and without your patches (maybe different parameters to the build matrix?), install the base, then install dependencies from binary packages and finally start timing the build time of a given port. The port could potentially be cleaned and rebuilt again to increase the statistics. This is not a prerequisite to get this code merged, but rather a random idea that might help doing performance testing later.

mojca avatar Aug 20 '19 21:08 mojca

Just in the interest of documenting the measurements that we just did, here are some numbers.

curl-modified,untraced,75.36,83.06,49.01
curl-modified,untraced,75.48,82.96,48.68
curl-modified,untraced,74.83,80.88,48.21
curl-modified,baseline,84.61,95.44,58.41
curl-modified,baseline,84.19,93.21,57.67
curl-modified,baseline,83.99,93.83,57.80
curl-modified,optimized,81.11,93.37,54.97
curl-modified,optimized,81.86,93.38,55.47
curl-modified,optimized,81.85,93.27,55.23
curl-unmodified,untraced,78.28,85.64,48.21
curl-unmodified,untraced,79.82,88.55,50.73
curl-unmodified,untraced,79.53,86.71,50.39
curl-unmodified,baseline,90.14,102.47,61.22
curl-unmodified,baseline,88.33,100.38,59.54
curl-unmodified,baseline,89.24,99.23,60.02
curl-unmodified,optmized,85.66,98.39,56.66
curl-unmodified,optmized,85.62,99.43,57.08
curl-unmodified,optmized,85.23,97.31,56.25

This was tested on a MacBook Pro (Retina, 15-inch, Mid 2015) with sudo port <-t> destroot curl with a hot cache and no cleanup to do at the beginning of the build.

untraced is without trace mode enabled, baseline is with the current master of trace mode, and optimized is with this PR.

We see an improvement of between 3.5 and 4.7 percentage points in the wallclock (that's between 3.1 and 4.2 % of the tracemode baseline). The overhead of trace mode falls from ~12.5% to 8%.

neverpanic avatar Oct 12 '19 17:10 neverpanic

4% isn't a lot given the complexity. Is that worth it?

pmetzger avatar Oct 13 '19 20:10 pmetzger

Probably not in its current state, but we've found some things in the Ctrie that aren't as optimal as they should be, so I'm pretty confident this can be improved, and also simplified by maybe removing some of the socket communication functionality and replacing it with direct shared memory access.

neverpanic avatar Oct 14 '19 07:10 neverpanic

Progress is being made: https://github.com/neverpanic/cctrie. This implementation of a CTrie is significantly faster than what's currently in this PR.

neverpanic avatar Oct 14 '19 15:10 neverpanic

I thought I should update it here once, the shared memory to be used alongside the above ctrie repository is ready: https://github.com/MihirLuthra/shm_alloc Thanks to Clemens, the combination is significantly faster than before. Now what remains is writing tcl wrappers.

MihirLuthra avatar Dec 28 '19 13:12 MihirLuthra

Old changes have been archived : old branch

Additions:

  1. Shared memory allocation library. This is introduced into the memory of each darwintraced process with__attribute__((constructor)).
  2. Ctrie data structure working with shared memory which is used for caching paths that were queried from registry.
  3. A shared memory trie that is being used for trace sandbox. Previously,darwintrace asked tracelib.c for a filemap over the socket through which it used to iterate. Now in porttrace.tcl, instead of supplying sandbox bounds to tracelib.c, prefixes are inserted in this trie.

The above 3 are combined into a static library that is used by both pextlib and darwintrace. A shared library wasn't a good candidate for this as darwintrace would intercept calls in our own library as well.

MihirLuthra avatar Feb 08 '20 02:02 MihirLuthra

Love the work, it would be great to see a noticable reduction in trace mode overhead!

mascguy avatar Oct 28 '21 00:10 mascguy

As someone who's come to depend on trace mode more and more - it's absolutely critical when validating port dependencies and such - I'd love to see this work merged!

The first step is to resolve the merge conflicts, and I'll take care of that.

mascguy avatar Aug 19 '22 14:08 mascguy