fbctf icon indicating copy to clipboard operation
fbctf copied to clipboard

Performance Issue(s)

Open punkdis opened this issue 8 years ago • 38 comments

6d04149d9939687faf1bc6b7be1764213ad06d3b

AWS EC2, Ubuntu 14.04.5 LTS on a m4.xlarge

When we have over 150 participants over 20 mins the CPU slowly goes to 80%. Once at 80% performance goes to the crapper and the UI is slow and participants are seeing the impact in game play.

image

punkdis avatar Feb 16 '17 02:02 punkdis

Thanks for filling the issue @rfrillman - we'll investigate asap.

juliannagler avatar Feb 16 '17 18:02 juliannagler

Do you happen to know if you were running Memcached? Was the database server on the same system as the webserver (a single system)?

justinwray avatar Feb 17 '17 04:02 justinwray

Yes Memcached was running.

memcache 1359 0.0 0.2 327440 2580 ? Sl 04:11 0:00 /usr/bin/memcached -m 64 -p 11211 -u memcache -l 127.0.0.1 ubuntu 1886 0.0 0.0 11744 932 pts/0 S+ 04:19 0:00 grep --color=auto memcached

Yes the database server was on the same system as the webserver.

punkdis avatar Feb 17 '17 04:02 punkdis

Interesting, nearly all of the data is cached now (even sessions). So you shouldn't see a high load in relation to the database. In the past we've seen this issue when Memcached wasn't running/working.

Did you have the binlog enabled by chance? Or the general query log? (Note: You probably shouldn't for performance reasons - but it would be useful for some investigative information in this case).

justinwray avatar Feb 17 '17 04:02 justinwray

binlog is/was disabled. I can see if there are any query logs

punkdis avatar Feb 17 '17 04:02 punkdis

After reviewing the logs there is nothing that stands out. Maybe the resources for m4.xlarge is not enough.

image

I have another event in a month or so and I am going to move to the C4.4xlarge

image

punkdis avatar Feb 18 '17 02:02 punkdis

We would like to expose our experience using the Facebook CTF platform. First of all, the platform was deployed on a Virtual Private Server (OVH SSD3 VPS Server), whose specifications are:

  • 8 GB RAM.
  • 2v cores.
  • 2.4 GHz.
  • SSD 40 GB.
  • KVM OpenStack.
  • Local RAID 10.

After deploying the platform for our CTF, was tested with 15 players and we did not experience any problem of flooding or lagging in the platform.

Once the competition was opened, 340 participants were registered and the platform remained open from 02/16/2017 to 02/19/2017. At this point the platform began to flood, being almost impossible to interact with the interface, even when there were only 30 players ingame.

The memory system (RAM) was not a problem, however, the CPU load increased considerably. To correct these problems, we made the following modifications:

  • Modifying Mysql stack threads from 10 to 25.
  • Restart HHVM service every 5 minutes.
  • Deletion of active sessions of the BBDD every hour, to remove the inactive users (cleaning memcache included).
  • Refresh callbacks changed from 5000 to 30000 and 10000 to 120000

The main problem is given by HHVM, and although the modifications managed to fix the lagging problems at times, we did not manage to correct the problems completely and all users started complaining about the flooding of the system until we delete the user sessions and restart HHVM service again.

shargon avatar Feb 22 '17 00:02 shargon

Any other thoughts?

punkdis avatar Feb 24 '17 04:02 punkdis

I had this same issue awhile back, and still do at times for large events. HHVM is tying up the CPU waiting for the database queries. I just ran an event with 30 teams last week 50-60 sessions, and used a c4.4xl (16 cores) but moved the DB to AWS Aurora db.r3.2xl. (working on the code so that I can do a pull request for it) Late in in the afternoon when the game log grew due to all the guesses/answers, the CPU was sitting around 60%. The biggest help from my testing was having the DB separate. so the DB and HHVM were not both beating up on the same CPU.

wisco24 avatar Feb 25 '17 04:02 wisco24

From the queries I looked at, the issue is with how the leader board scores are done. They are recalculated every time a team scores by querying/adding the full game logs table. A much more efficient way would be to store the score in a separate DB table so it doesn't have to go through the all the guesses each time.

wisco24 avatar Feb 25 '17 04:02 wisco24

Just looked at pull request #459 and that might be a mitigation method as well as it will create separate DB's, etc. and then possibly put a Load Balancer in front of it with sticky sessions so a team stays stuck to one system.

wisco24 avatar Feb 25 '17 04:02 wisco24

Has anyone used the Live Sync API?

punkdis avatar Apr 04 '17 03:04 punkdis

image

I ran another event and the mysqld was off the chart and caused the game to become unresponsive Any suggestions? Or any immediate commands I could issue to free up the CPU?

git rev-parse HEAD 51e06a7094df9d9e66249b40cae387d2150a52fd

punkdis avatar Aug 04 '17 02:08 punkdis

Exactly what I was seeing.... Best way I found was the offload the DB to a different server. In my case, I move the DB to AWS Aurora RDS, and then have the rest on a rather large EC2 instance....

I tried to restart mysqld and hhvm when I had this issue before, but it cause the scoring system to get messed up and award double points for the same question.

wisco24 avatar Aug 04 '17 02:08 wisco24

@wisco24 do you have any documentation on how you setup the DB on a different server? I have another event scheduled for September

punkdis avatar Aug 04 '17 02:08 punkdis

@punkdis The multi server changes were just merged in the dev branch. Depending on how many servers you want to separate, here are the general instructions for production. Note that the additional options you see with IP addresses are instructing as to the location of hhvm, mysql, and cache servers. So generally you would provision in this order:

Update: See Justin's comment below for correct commands

stevcoll avatar Aug 04 '17 02:08 stevcoll

@stevcoll Do you know when the changes will be merged into production?

punkdis avatar Aug 04 '17 02:08 punkdis

The last I heard master and dev were in conflict so it could be months, but I'll let @justinwray answer that one.

stevcoll avatar Aug 04 '17 03:08 stevcoll

@punkdis / @stevcoll dev and master are currently out of sync.

However, given the number of improvements, including performance updates, in dev it might be a good time to merge that work into master.

Merging dev into master would also bring the new server/container separation to master as well.

In the meantime, you can certainly pull down the dev branch, to do so you need to do the following:

  • git clone https://github.com/facebook/fbctf
  • cd fbctf
  • git checkout dev

From that point forward all other documentation applies and you will now be using the dev branch.

Additionally, documentation for the new installation options can be found in the commit log here: https://github.com/facebook/fbctf/commit/b487fc119f3cef02b289e13d29b80272c4d94c2a

You can use the individual provision.sh executions as @stevcoll provided, or the quick_install option for multiple direct installations:

  • Database Server (MySQL)

    • source ./extra/lib.sh
    • quick_setup install_multi_mysql <dev/prod>
  • Cache Server (Memcached)

    • source ./extra/lib.sh
    • quick_setup install_multi_nginx <dev/prod>
  • HHVM Server (HHVM)

    • source ./extra/lib.sh
    • quick_setup install_multi_hhvm <dev/prod> <IP of MySQL Server> <IP of Memcached Server>
  • Web Server (Nginx)

    • source ./extra/lib.sh
    • quick_setup install_multi_nginx <dev/prod> <IP of HHVM Server>

justinwray avatar Aug 04 '17 15:08 justinwray

Restart HHVM service every 5 minutes.

Test this very carefully - as HHVM is a JIT with a high warmup cost, this will usually make things much worse.

In general, when looking at HHVM's CPU usage, the best tool to use is 'perf' from the linux perf tools; as HHVM generates a perf PID map, this will show both the native C++ and JITed PHP functions in the same stack traces. If you get lots of memory addresses with out names in the stack traces, the PID map option is probably off or there's a permission issue reading it; if you get no PHP functions or raw memory addresses, it's probably not warmed up yet and there's no JITed code.

You can also use perf to profile the entire system, not just HHVM - so if, for example, there's a problem with MySQL, this will show you the PHP call to MySQL, the C++ HHVM builtin talking to MySQL, the C functions in MySQL, and ultimately the kernel calls that MySQL is spending its' time in.

fredemmott avatar Aug 11 '17 15:08 fredemmott

HEAD: 1f236bba7e5e7a3e7169ff61037502a6928a94ba on an EC2 m4.x16large with a separate AWS Aurora SQL Database.

The platform (coupled with our naiveté) caused us a lot of misery, and possibly the reputation of our yearly CTF event CodeFest '17 CTF

utilisation

As soon as the contest began, the website slowed to a crawl. Error 504s everywhere. I shifted the MySQL database to AWS Aurora and upgraded m4.xlarge to m4.x16large. Still virtually no difference.

ubuntu@ip-172-31-20-69:/var/log/hhvm$ wc -l error.log 6853801 error.log

Mid-way into the contest, the game board stopped opening at all as GET-ing /data/teams.php would always fail. It's still not working, even after stopping the game and dropping all entries from sessions.

ubuntu@ip-172-31-20-69:/var/log/hhvm$ cat error.log | grep -v mysql | grep -v translation | tail [Sun Sep 24 14:52:45 2017] [hphp] [15327:7f9ac5bff700:735:000001] [] \nFatal error: Uncaught exception 'IndexRedirectException' with message '' in /var/www/fbctf/src/SessionUtils.php:113\nStack trace:\n#0 /var/www/fbctf/src/data/country-data.php(7): SessionUtils::enforceLogin()\n#1 {main} [Sun Sep 24 14:52:45 2017] [hphp] [15327:7f9ac13ff700:1561:000001] [] \nFatal error: Uncaught exception 'IndexRedirectException' with message '' in /var/www/fbctf/src/SessionUtils.php:113\nStack trace:\n#0 /var/www/fbctf/src/data/teams.php(7): SessionUtils::enforceLogin()\n#1 {main} [Sun Sep 24 14:52:47 2017] [hphp] [15327:7f9ac5bff700:736:000001] [] \nFatal error: Uncaught exception 'IndexRedirectException' with message '' in /var/www/fbctf/src/SessionUtils.php:113\nStack trace:\n#0 /var/www/fbctf/src/data/map-data.php(7): SessionUtils::enforceLogin()\n#1 {main} [Sun Sep 24 14:52:47 2017] [hphp] [15327:7f9ac13ff700:1562:000001] [] \nFatal error: Uncaught exception 'IndexRedirectException' with message '' in /var/www/fbctf/src/SessionUtils.php:113\nStack trace:\n#0 /var/www/fbctf/src/data/country-data.php(7): SessionUtils::enforceLogin()\n#1 {main} [Sun Sep 24 14:52:47 2017] [hphp] [15327:7f9ac5bff700:737:000001] [] \nFatal error: Uncaught exception 'IndexRedirectException' with message '' in /var/www/fbctf/src/SessionUtils.php:113\nStack trace:\n#0 /var/www/fbctf/src/data/teams.php(7): SessionUtils::enforceLogin()\n#1 {main} [Sun Sep 24 14:52:47 2017] [hphp] [15327:7f9ac13ff700:1563:000001] [] \nFatal error: Uncaught exception 'IndexRedirectException' with message '' in /var/www/fbctf/src/SessionUtils.php:113\nStack trace:\n#0 /var/www/fbctf/src/data/map-data.php(7): SessionUtils::enforceLogin()\n#1 {main} [Sun Sep 24 14:52:47 2017] [hphp] [15327:7f9ac13ff700:1564:000001] [] \nFatal error: Uncaught exception 'IndexRedirectException' with message '' in /var/www/fbctf/src/SessionUtils.php:113\nStack trace:\n#0 /var/www/fbctf/src/data/teams.php(7): SessionUtils::enforceLogin()\n#1 {main} [Sun Sep 24 14:52:47 2017] [hphp] [15327:7f9ac13ff700:1565:000001] [] \nFatal error: Uncaught exception 'IndexRedirectException' with message '' in /var/www/fbctf/src/SessionUtils.php:113\nStack trace:\n#0 /var/www/fbctf/src/data/command-line.php(7): SessionUtils::enforceLogin()\n#1 {main} [Sun Sep 24 14:52:47 2017] [hphp] [15327:7f9ac5bff700:738:000001] [] \nFatal error: Uncaught exception 'IndexRedirectException' with message '' in /var/www/fbctf/src/SessionUtils.php:113\nStack trace:\n#0 /var/www/fbctf/src/data/configuration.php(7): SessionUtils::enforceLogin()\n#1 {main} [Sun Sep 24 14:52:47 2017] [hphp] [15327:7f9ac13ff700:1566:000001] [] \nFatal error: Uncaught exception 'IndexRedirectException' with message '' in /var/www/fbctf/src/SessionUtils.php:113\nStack trace:\n#0 /var/www/fbctf/src/data/country-data.php(7): SessionUtils::enforceLogin()\n#1 {main}

There were around ~500 team registrations and around 276 teams managed to make a submission. There were as many as 160 sessions open simultaneously.

5hubh4m avatar Sep 25 '17 05:09 5hubh4m

@5hubh4m:

Thank you for submitting your report (and using the existing issue thread)! Could you please provide the quick_setup or provision details/options you used?

It sounds like your game has already concluded, but if you do still have the system running, and are experiencing performance issues, please take a look at https://github.com/facebook/hhvm/wiki/Profiling#linux-perf-tools. The reports from pref, sorted by both Self and Children during performance degradation will be beneficial for further troubleshooting.

Even if you are unable to obtain the pref reports, did you happen to capture the output from top or ps during the performance degradation? The output from such tools may help track down which process was using an excessive amount of CPU.

The output from your error_log are entries showing requests that require a valid session but were sent unauthenticated. In other words, these "errors" are simply requests where someone was not logged in. Given that sessions do timeout, this frequently happens when people leave their browser windows open on the scoreboard, but otherwise inactive.

@5hubh4m / All:

We have been dedicating much focus on the overall performance of the platform. One area of significant improvement has been in the caching of database requests. Nearly all requests are now cached, and the results are stored for subsequent usage unless the cached data is invalidated on specific actions. As a result, the total number of database queries is very low.

Another major area of performance improvement has been the ability, through the quick_setup or provision process, to separate the various components of the platform onto different servers/instances/containers. In total, you can install the platform components across four servers; one for Nginx, one for HHVM, one for MySQL, and one for Memcached. Furthermore, on platforms like Amazon Web Services (or equivalent cloud solutions) you can use services such as RDS for MySQL and ElastiCache for Memcached, to provide more purpose-tailored performance.

While we continue to do load and stress testing of our own, we, unfortunately, have been unable to reproduce the issue(s) detailed in this thread. Our testing, with hundreds of users, on an even smaller resource footprint (2 GHz of CPU and 2 GBs of RAM) shows negligible latency for the end-user. One thing that has become apparent through our testing is that MySQL, given the caching, is not experiencing a significant amount of load. Instead, it becomes more and more likely that HHVM is the cause of excessive CPU utilization, system load, and subsequent latency.

We are finalizing testing that will migrate the platform from Ubuntu 14.04 and HHVM 3.18 to Ubuntu 16.04 and HHVM 3.21. While this will not ensure a performance boost, it does include performance improvements to HHVM itself.

One commonality we see in the performance reports is the usage of Amazon Web Services. While the platform is certainly capable of running on AWS (and some of our testing takes place on AWS as well), there is one important aspect of AWS T2 instances to consider: CPU Credits. When using a t2 instance on AWS, you are given a set number of CPU credits and a baseline CPU level. For example, on a t2.medium you are given 24 CPU credits with a CPU baseline of 40%. A single CPU credit is worth 1 minute of 100% CPU usage. As credits are earned and stored (for 24 hours), you could burn through your entire balance of credits sequentially. For a t2.medium you can store a total of 576 credits (earned over 24 hours). In other words, with a t2.medium you are given 40% of the CPU resources across the board, with the ability to boost (on average) to 100% for 24 minutes per hour. Learn More about CPU Credits here: http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/t2-instances.html#t2-instances-cpu-credits

We will continue to dedicate effort to improving the performance of the platform and continue to test and troubleshoot in attempt to recreate the aforementioned issue(s).

When experiencing a performance issue, please include the following in your reports:

  • Platform Version (HEAD commit used)
  • Installation Mode (Production or Development)
  • Installation Method (Quick Setup or Provision with Options)
  • Linux Distribution
  • Linux Kernel
  • Nginx Version
  • HHVM Version
  • MySQL Version
  • Memcached Version
  • Top Output (during performance degradation - On each server when using multiple servers)
  • PS Output (during performance degradation - ps aux - On each server when using multiple servers)
  • Relevant HHVM Errors (/var/log/hhvm/error.log)
  • Profiling Reports (during performance degradation - https://github.com/facebook/hhvm/wiki/Profiling#linux-perf-tools - Sorted by Self and Children)
  • Number of Teams
  • Average Number of Sessions (during performance degradation)

justinwray avatar Sep 25 '17 20:09 justinwray

@justinwray

That is easy. hhvm was constantly on top with 5000-6000 % CPU consumption. That snippet of error.log I posted didn't contain the Resource Not Available errors because I wasn't able to locate them in the 1G text file. The ones I did mention are CERTAINLY not due to unauthenticated access, as those were done after the game was over and the only session active was mine. The perf command when used during the game to see syscall errors was riddled with EAGAINs.

I installed the game using provision.sh using the letsencrypt options and single server setup. The Db was then dumped using mysqldump, moved to Aurora and I edited settings.ini to reflect the changes made.

Interestingly, both disk and RDS were not the bottleneck. The RDS witnessed 400 connections at the peak but the CPU usage remained 20-25%. The disk usage (which was a provisioned IO disk on AWS) also didn't was out with it's provisioned IOPS values.

ubuntu@ip-172-31-20-69:~$ uname -a Linux ip-172-31-20-69 3.13.0-129-generic #178-Ubuntu SMP Fri Aug 11 12:48:20 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

ubuntu@ip-172-31-20-69:~$ memcached -help memcached 1.4.14

ubuntu@ip-172-31-20-69:~$ hhvm --version HipHop VM 3.18.5 (rel) Compiler: tags/HHVM-3.18.5-0-g61f6a1f9a199c929980408aff866f36a7b4a1515 Repo schema: 514949365dd9d370d84ea5a6db4a3dd3b619e484

ubuntu@ip-172-31-20-69:~$ nginx -v nginx version: nginx/1.4.6 (Ubuntu)

I don't have the top or ps outputs unfortunately.

Also, a comment was made on CTFTime about fbctf's performance.

illustris – Sept. 22, 2017, 5:45 p.m. You should have tested the platform well before using it. We considered using it for BITSCTF last year, but a simple load test with 50 people made it crawl on a fairly powerful VPS

5hubh4m avatar Sep 26 '17 01:09 5hubh4m

@5hubh4m Regarding those errors, they are generated when somebody does not have a session, but for instance leaves the scoreboard open. Until a hard refresh is performed, they are not redirected to the login page, and the requests are coming in without a session. These look to be the typical requests performed during an AJAX refresh, which happens every ~5 seconds on each platform window.

For instance, even if you removed all users, or disabled login, or invalidated all sessions in the platform. Any user that was previously authenticated, but still resides on the gameboard page, will generate these errors. They also come up upon a hard refresh that does redirect to the login page (before the redirect invalid requests are coming in).

As @justinwray pointed out, we are heavily focused on reproducing these issues and finding a resolution. Unfortunately the biggest problem has been duplicating these live environments with hundreds or thousands of users, and ensuring our test provision and configuration matches exactly what the game admins are utilizing.

This morning I setup a stress test where I provisioned on a VM with only 2GB of memory and 2 cores. After setting up various flags and quizzes, with different configurations, I then hammered the platform with constant logins and subsequent endpoint requests of all types for over 6 hours, with hundreds of sessions, and an estimated typical user load of a few hundred users.

I saw the memory for HHVM go up about 8% during the event, and HHVM CPU was very consistent at 100% (ranging between 50% and 150% or about 1.5 cores worth of utilization). NGINX and Memcached showed CPU at around 10-30% on spikes, and MySQL easily handled the requests without even coming up on top. The bottleneck seemed to be HHVM, and yet even after 6 hours I was able to capture levels, bring up countries, and perform typical platform actions in multiple browsers without any noticeable latency.

The perf command showed that requests to Memcached were using up the most CPU in the HHVM process, which probably makes sense considering that a majority of MySQL queries were cached, and very few queries are actually sent to MySQL. Once a user captures a level or other actions take place, Memcached is invalidated, MySQL gets the queries once, then they are again cached into Memcached.

The next step on our end is now to try to reproduce the problem on AWS itself on instance types such as the ones you used, and see if that environment is involved in the issue.

stevcoll avatar Sep 26 '17 03:09 stevcoll

@mpomarole : do you remember why this switched from the mcrouter extension to memcached in https://github.com/facebook/fbctf/commit/d5846c54eb418fae8ebf1eba7a926c53bba7faa3 ?

fredemmott avatar Sep 26 '17 03:09 fredemmott

not that I think that we should need an async mysql extension, just that the mcrouter extension is extremely thoroughly tested for perf problems, whereas the memcached extension basically exists for PHP compatibility.

fredemmott avatar Sep 26 '17 03:09 fredemmott

Steve, The problem grow when a team gets ones answer right amount all the other wrong answer. This causes there score to be recalculated by each team. I did a custom load testing script that worked well and was able to generate a fair amount of load. It's far from perfect, but worked for what I needed to generate load on the system that caused the issues. The problem I found with the stress testing is its about the refresh on the scoreboard that causes the most load as HHVM reserves CPU while waiting for the mysql command result which when you have a high number of failed guesses, takes alot of time.

https://github.com/wisco24/ctfloadtesting I never fully automated it, but got it working well enough for what i needed with some manual steps...

-Eric

On Tue, Sep 26, 2017 at 11:58 AM, Fred Emmott [email protected] wrote:

(not that I think that we should need an async mysql extension, just that the mcrouter extension is extremely thoroughly tested for perf problems)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/facebook/fbctf/issues/456#issuecomment-332079777, or mute the thread https://github.com/notifications/unsubscribe-auth/AUM9APf7uKdSK2BhTbjySVyR8a8gnwvXks5smHZ1gaJpZM4MChD2 .

-- Eric Shulze (920) 540-4735 [email protected]

wisco24 avatar Sep 26 '17 11:09 wisco24

The other issue is after the user interaction stops, HHVM never drops its resources, it still stays at the 100% for some reason. Also, for your cacheing example, its not as great as that.. Think of a one day event with 30 teams, and 30 flags. That's 900 times the memchache is invalidated.

That over 100 times an hour, or about once a minute... the caching doesn't help as much then...

On Tue, Sep 26, 2017 at 7:32 PM, [email protected] wrote:

Steve, The problem grow when a team gets ones answer right amount all the other wrong answer. This causes there score to be recalculated by each team. I did a custom load testing script that worked well and was able to generate a fair amount of load. It's far from perfect, but worked for what I needed to generate load on the system that caused the issues. The problem I found with the stress testing is its about the refresh on the scoreboard that causes the most load as HHVM reserves CPU while waiting for the mysql command result which when you have a high number of failed guesses, takes alot of time.

https://github.com/wisco24/ctfloadtesting I never fully automated it, but got it working well enough for what i needed with some manual steps...

-Eric

On Tue, Sep 26, 2017 at 11:58 AM, Fred Emmott [email protected] wrote:

(not that I think that we should need an async mysql extension, just that the mcrouter extension is extremely thoroughly tested for perf problems)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/facebook/fbctf/issues/456#issuecomment-332079777, or mute the thread https://github.com/notifications/unsubscribe-auth/AUM9APf7uKdSK2BhTbjySVyR8a8gnwvXks5smHZ1gaJpZM4MChD2 .

-- Eric Shulze (920) 540-4735 [email protected]

-- Eric Shulze (920) 540-4735 [email protected]

wisco24 avatar Sep 26 '17 11:09 wisco24

@5hubh4m @fredemmott @justinwray @wisco24

Last night I attempted to duplicate the @5hubh4m scenario with some load testing scripts. The general setup was starting with an AWS EC2 m4.large (2 Cores, 8GB memory). Standard production provision with all services running on the one server. Added 10 quizzes and 10 flags, description text heavy, some with attachments, hints, different point values, etc. Turned on registration and started the event.

Each Perl script "client" was run from one of two separate m4.large instances and these scripts registered on the platform, logged in, then went through an endless cycle of loading the gameboard page then performing AJAX requests every 5 seconds (3 times). The total cycle was 20 seconds. After each cycle, the script had a 2% randomized chance of submitting a flag. When that happened, the script then had a 20% chance of getting the flag answer correct.

Started off with 50 script sessions running. This brought HHVM to about 125% utilization out of a total 200%. The platform was usable, but I did notice some throttling of the HHVM access log every few seconds, where it would "jam up" before continuing. Essentially the CPU on HHVM ranged from 50% to 150% every few seconds. Memory on HHVM and the entire platform was a non-issue.

I increased to 150 script sessions, and that brought the CPU average up to around 170%. Platform still usable, but now seeing cases where Modals were taking 10 seconds to pop up.

At 300 script sessions, the CPUs were maxed out on HHVM at 200%. The platform at this point became completely unusable as the HHVM access log crawled and froze up constantly. The script sessions started dying from timeouts. I ran perf for the first time with these results:

perf01

perf02

I then stopped the instance and upgraded to a m4.16xlarge (64 Cores, 256GB memory). Resumed the event starting with 350 sessions. HHVM immediately ate up a significant amount of CPU on all 64 cores. The platform was usable but again Modals were taking seconds to popup.

I started seeing this error all over the place (which I had not seen even once on the m4.large:

Error executing AsyncMysql operation: Failed (mysql error: 1040: Too many connections)

This caused many of the scripts to timeout and/or die. Increased the number of connections to a million in the MySQL configuration and restarted that. This seemed to solve the issue for a few minutes, before I instead got this error:

Error executing AsyncMysql operation: Failed (mysql error: 2004: Can't create TCP/IP socket

Finally at around 400 users I started getting this error message on the scripts which was very strange:

Error GETing /data/teams.php: hphp_invoke
Error GETing /data/country-data.php: hphp_invoke

The admin sessions page would now not even load, with the following error message:

[Tue Sep 26 07:36:35 2017] [hphp] [1817:7f8f25bff700:1194:000001] [] \nFatal error: Uncaught exception 'HH\\InvariantException' with message 'all_teams should of type Team and not null' in /var/www/fbctf/src/models/MultiTeam.php:72\nStack trace:\n#0 /var/www/fbctf/src/models/MultiTeam.php(72): HH\\invariant_violation()\n#1 /var/www/fbctf/src/controllers/AdminController.php(3749): MultiTeam::genTeam()\n#2 /var/www/fbctf/src/controllers/AdminController.php(4091): AdminController->genRenderSessionsContent()\n#3 /var/www/fbctf/src/controllers/AdminController.php(4104): AdminController->genRenderPage()\n#4 /var/www/fbctf/src/controllers/Controller.php(37): AdminController->genRenderBody()\n#5 /var/www/fbctf/src/Router.php(73): Controller->genRender()\n#6 /var/www/fbctf/src/Router.php(20): Router::genRouteNormal()\n#7 /var/www/fbctf/src/index.php(7): Router::genRoute()\n#8 (): genInit()\n#9

Somewhere between 400 and 500 users the platform was again unusable. The HHVM access log again crawling and unable to even remotely keep up with requests. Could not load the platform page. I performed another perf and got the resource graphs from AWS:

image

image

image

The timeline on the graph is that the event started at 6:30GMT, and the instance was upgraded to the m4.16xlarge at 7:15GMT. The reason CPU isn't pegged some of the time is really my inability to keep the session scripts running while MySQL had too many connections or dying from timeouts.

Overall Memcached does seem to be a primary bottleneck inside of HHVM as far as I can read perf. I don't understand how HHVM could utilize so many cores on the m4.16xlarge. Neither MySQL or NGINX had any issues whatsoever in terms of resource usage. Memcached ended up utilizing 1.5 cores towards the end of my test with ~450 sessions.

stevcoll avatar Sep 26 '17 14:09 stevcoll

HHVM defaults to 2x number of cores, and you probably actually want higher: each concurrent request needs a thread.

A few things:

  • please use the perf map - this might require running perf report as root. This will replace the addresses with Hack function names
  • given how unserialize is showing up, we should look at exactly what’s been stored in memcached and if we can store something cheaper to serialize/unserialize instead

fredemmott avatar Sep 26 '17 14:09 fredemmott