quickstart-bitnami-wordpress Bad performance of EFS might be fixed with cachefilesd

EFS are performing very badly with WordPress due to the high amount of file access and latency, and are generally reported. This might be fixed with cachefilesd and the fsc option in fstab that I have been testing. How can this be set/configured when provisioning new instances? I reckon it is a part of the image.

Jan 31 '19 08:01 adionditsak

Is it safe to use cloud-init to run commands as a part of provisioning?

Jan 31 '19 08:01 adionditsak

Hi @adionditsak , Thanks for sharing this, I am interested in your findings, could you share your testing and your configuration ? It is true that NFS can be slow when accessing the files, but once PHP has cached the compilation of the file this should not be an issue. I would like to know is this could be a specific issue for you use case, or if this will hit every usecase.

Answering your question, It should be safe according to the cloud-init documentation the script only runs when booting, so it should not interfere the rest of services.

My concern is that if for cachefilesd you need the NFS filesystem to be already mounted and if the mounted filesystems are available at that stage.

Thanks forehand, Rafael Rios Saavedra

Jan 31 '19 10:01 rafariossaa

Hi @adionditsak, We have enabled opcache by default, and I have checked that is working.

You can check if you have enabled opcache creating a file info.php with the following code:

<?php
phpinfo();
?>

and then access to the url: http://your_ip_or_domain/info.php. Look for a section called: Zend OPcache. This is an example of what I got:

Opcode Caching | Up and Running
-- | --
Optimization | Enabled
SHM Cache | Enabled
File Cache | Disabled
Startup | OK
Shared memory model | mmap
Cache hits | 198464
Cache misses | 274
Used memory | 49994192
Free memory | 150485480
Wasted memory | 846920
Interned Strings Used memory | 1850792
Interned Strings Free memory | 14926424
Cached scripts | 266
Cached keys | 281
Max keys | 7963
OOM restarts | 0
Hash keys restarts | 0
Manual restarts |  0

As you can see I got almost 200k hits in the cache after doing running benchmarking test with ab:

ab -t 60 -c 10 http://your_ip_or_domain/

So, php should not be accessing the file sytem for each of your requests.

Could you check you have opcache enabled and you have a similar behavior ?

Best regards, Rafael Rios Saavedra

Jan 31 '19 16:01 rafariossaa

Hi @rafariossaa,

Thanks for your responses.

opcache is working as expected, so this is not the problem. I suspect plugins or similar. I even experimented with CDN, memcached and cachefilesd. I will investigate further in a few days and get back to you.

What is the expected average response times with your stack? After migration the application went from 800ms to 4-6 seconds to TTFB. I will see what I can do, and if necessary you should eventually expand the documentation to include optional cache layers

Feb 04 '19 09:02 adionditsak

Hi @adionditsak, I would suggest you to have a look into how is the load in database. Sometimes you need many queries to get one page rendered. There are many factors that affects the average response time, it will depends on the load of the servers, the load on aws, load on the database, the size of servers and database, if there is some backup running, the theme you are using, the plugins, etc. I am looking forward to your results.

Thanks forehand, Rafael Rios Saavedra

Feb 05 '19 07:02 rafariossaa

Hi @adionditsak, I going to perform a study on the performance, I would like if you can give me some details about the wordpress you are running.

I would like to know:

specs of the deployment you have done (number and type of nodes, region, if you are use memcached, etc ...)
If the WP you have tested is the default one, or if you have migrated a WP from other place to this one. In this case, I also would like to know the size of the wordpress site, the size of the database, running plugins.

Thanks forehand Rafael Rios Saavedra

Mar 14 '19 14:03 rafariossaa

@adionditsak @rafariossaa - We're also interested in data you can provide around EFS performance.

Mar 18 '19 15:03 andrew-glenn

Hi @adionditsak and @agamzn , I did some performance analysis on the impact of EFS in this solution and I found that it doesn't have a impact. Measuring this is complex because there are many variables involved, like the size of the site, number of requests per second, amount of persisted files (media files), plugins used, js used etc. So in order to have a base I used a wordpress with the default theme, and I generated some random content. I not using any plugins nor anything that can cache the requests (eg. memcached, w3total cache) or help serving (eg. CDN). This is to measure raw power, the only helping things are: the linux filesystem cache and opcache (in php).

In this sceneario, I simulated 100 concurrent users loading the page. and I tested using Max I/O and provisioned throughput, and it made no difference on the serving of the page. In our solution, the core of WP is in a local file system, and only the persisted data (media, plugins installed,etc) are recovered from the EFS. PHP files, once compiled by php-fpm are very very fast, and I got cache hits in the order of 99.36% so very little is done by filesystem access. What I found is that it is very sensitive about what happen in the client side, most of this things are theme related.

I hope this gives some light, anyways I am still interested in what is happening to you. Please, could you give me more information about your scenario? .

Best regards, Rafael Rios Saavedra.

Mar 21 '19 16:03 rafariossaa

Hi everyone, I had been testing a similar environment without luck, definitely EFS it's the problem.

Here it the thing, you probably don't want HA with auto scaling for small wordpress sites but for big ones. Meaning lots of plugins and lots of assets and probable logged in users (what makes caching a nightmare).

So in that scenery each request would access tons of file hit hard EFS. The latency it has (EFS) will make the site unusable. So, we can't have WP residing in EFS...

I will try cachefilesd (i will check https://blog.lawrencemcdaniel.com/tuning-aws-efs-for-wordpress/)... Hope next we I may share some info!

Best regards, Mauricio

Mar 28 '19 14:03 rusowyler

@rusowyler - I'm circling back on this. Did cachefilesd work?

Sep 20 '19 13:09 andrew-glenn

@andrew-glenn I have tested this, but it has no radical impact really. I am still interested in hearing other, as it might be due to the configuration

Oct 03 '19 17:10 adionditsak

Hi, everyone, I had been testing at ap-northeast-2. Very slow performance compared to us-east-2.

Of course I used opcache.

It's the EFS problem.

Nov 14 '19 05:11 steve-makestar

I have just switched from multi-node EC2 on NFS to a single-node EBS and my website is running 4x faster. Don't waste time with Bitnami HA solution.

Dec 05 '19 20:12 reza215r

I am having the same problem here and it is a challenge. I also switched to EBS and it works perfectly and I would say more than 10x the speed. However, I do not have a choice, I have to run HA wordpress. Is there a way we can extract the media directory and any another directory that requires sharing to a separate directory? This would at least give us the possibility of mounting that in a shared location with RWX access and still allow the site to run on EBS. With this only access to those files would be required to hit the shared location. It is a terrible experience to just sit and wait for a site to load.

Dec 18 '19 17:12 bsakweson

Now that it is possible to provision higher throughput with EFS without uploading 100GB of dummy files, did someone try that?

Jan 04 '20 04:01 hubertnguyen

Is EFS for wordpress helping or not?

Jan 23 '20 02:01 madhupnetfundu

From everything I've read the main issue is speed, even with file caching, etc. At Re:Invent 2019, Amazon has announced that it would increase the base performance by 5X and solve the burst credit issues that everyone is complaining about. We'll have to see how the new setup performs in the real-world, but it could be good enough for hosting WordPress on EFS.

Jan 23 '20 23:01 hubertnguyen

Can some one try to benchmark it again? It might be "fixed" now: https://aws.amazon.com/de/about-aws/whats-new/2020/04/amazon-elastic-file-system-announces-increase-in-read-operations-for-general-purpose-file-systems/

Apr 01 '20 20:04 tolzhabayev

FYI - multi-attach EBS is now avialable in provisioned iops EBS volumes. This might be a decent solution to replace EFS. https://aws.amazon.com/about-aws/whats-new/2020/02/ebs-multi-attach-available-provisioned-iops-ssd-volumes/

Apr 10 '20 02:04 reza215r

Has anyone benchmarked the new EFS performance or tried the multi-attach EBS with Wordpress?

I am also interested in hosting Wordpress on AWS, but I read many posts online about the terrible filesystem performance.

May 01 '20 12:05 collimarco

@collimarco , I ran tests with EFS burst and provisioned throughput. Here are my conclusions (for my use case, a large WP site)

PHP Page Generation goes from ~300ms to ~1200ms just by going from EBS to EFS, changing nothing else. (yes, NFS caching and opcode caching was ON)
Burst throughput can go high, but not for very long
Provisioning higher than ~20MBps EFS didn't yield higher performance, probably because WP is not an app that can create a lot of concurrency needed for EFS to scale.
Copying data using standard commands like cp or rsync is excruciatingly slow. Again, it scale if you can create hundreds of threads to copy stuff.

So, "it can work", but I didn't find it to be convenient, and I'm not against the wall where I have to scale horizontally.

A friend of mine mentioned that it is possible to cache NFS files using Memcache, but I haven't looked at it.

multi-attach EBS doesn't work with EXT4, and only runs on the Nitro instances, so not 100% sure how practical it is, but I'd love to know!

May 01 '20 19:05 hubertnguyen

@hubertnguyen Shouldn't opcode caching alleviate any of the bottlenecks of the file system in use? Warming up of the opcode cache can be dead slow, but once it caches, its doing little i/o, atleast that's the idea with all the approaches i.e. cache PHP files in someway or the other on the server itself.

Can you confirm with opcode caching, you did turn off stat functionality which checks for file modification with each request to a file before simply using it from cache?

What do you think is causing the EFS to be a bottleneck here?

May 05 '20 18:05 ashfame

@ashfame if you turn off file checks completely I guess that you break core / plugin updates... is there any solution?

May 05 '20 18:05 collimarco

@ashfame , on paper, it seems like "it should", and when you run traffic, most of the needed files should be warmed up after page view #1 (WP-core, your template etc). I can confirm opcode caching and NFS file caching was "on" -- I don't remember specifically removing the per-request checking though, so if you find out, let me know.

@collimarco , default TTL is ~60sec for opcode caching. As you point out, it's probably not a great idea to significantly extend it. I was worried of the same thing.

I'm not sure what the ~900ms gap is. Unfortunately during my research, I have not found an example where people ran with it, without any kind of significant trade-off.

May 05 '20 18:05 hubertnguyen

Update: Checking the Opcache statistics showed that memory wasn't large enough at 16MB to hold all scripts. Bumping to 32MB shows much better results which are at the bottom.

Cache hits   | 466415
Cache misses | 306
Used memory  | 25762104
Free memory  | 7792328

I'm running a very small Wordpress installation (four plugins, 1.2GB of content, a few hundred megs of DB). I decided to break down each performance improvement I had in place.

The TL;DR for the below is it seems that a t3.micro is a bad idea for a reasonably busy website? Performance goes down as more performance features are enabled. My usual response times I get after each upgrade are around 300-400ms. I've added them to the bottom. It's the very first ab run before all this experimenting which accidentally may have had Opcache completely disabled. 😆

Test machine:

t3.nano in ca-central-1
With Opcache enabled 232MB of RAM active
1GB of swap, ~100MB active
Ubuntu 18.04.4 TLS
Many many retests happened, T3 credits dipped from 144 (max) to 124

I didn't reboot the machine, but did re-build my docker-compose setup between each of the below calls. Every test had an equal call to ab call between them. Also, I didn't remove the fsc argument from fstab and instead chose to disable the cachefilesd daemon.

Starting with everything off:

me@web-01:~/web-service$ cat docker-compose/php-fpm/opcache.ini 
; From https://laravel-news.com/php-opcache-docker
;
[opcache]
opcache.enable=0
; 0 means it will check on every request
; 0 is irrelevant if opcache.validate_timestamps=0 which is desirable in production
opcache.revalidate_freq=600
opcache.validate_timestamps=1
opcache.max_accelerated_files=983
opcache.memory_consumption=16
opcache.max_wasted_percentage=10
opcache.interned_strings_buffer=8


me@web-01:~/web-service$ ab -n 1000 -c 20 https://example.com/index.php
This is ApacheBench, Version 2.3 <$Revision: 1807734 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking example.com (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests


Server Software:        nginx
Server Hostname:        example.com
Server Port:            443
SSL/TLS Protocol:       TLSv1.2,ECDHE-RSA-AES256-GCM-SHA384,2048,256
TLS Server Name:        example.com

Document Path:          /index.php
Document Length:        0 bytes

Concurrency Level:      20
Time taken for tests:   234.205 seconds
Complete requests:      1000
Failed requests:        0
Non-2xx responses:      1000
Total transferred:      217000 bytes
HTML transferred:       0 bytes
Requests per second:    4.27 [#/sec] (mean)
Time per request:       4684.098 [ms] (mean)
Time per request:       234.205 [ms] (mean, across all concurrent requests)
Transfer rate:          0.90 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        2    3   9.0      2     199
Processing:   963 4648 5207.1   3752   48704
Waiting:      963 4648 5207.1   3751   48704
Total:        985 4651 5210.2   3754   48708

Percentage of the requests served within a certain time (ms)
  50%   3754
  66%   3779
  75%   3797
  80%   3807
  90%   3840
  95%   3873
  98%  27081
  99%  38119
 100%  48708 (longest request)

Now with Opcache enabled but revalidate_freq set to zero

; From https://laravel-news.com/php-opcache-docker
;
[opcache]
opcache.enable=1
; 0 means it will check on every request
; 0 is irrelevant if opcache.validate_timestamps=0 which is desirable in production
opcache.revalidate_freq=0
opcache.validate_timestamps=1
opcache.max_accelerated_files=983
opcache.memory_consumption=16
opcache.max_wasted_percentage=10
opcache.interned_strings_buffer=8

me@web-01:~/web-service$ ab -n 1000 -c 20 https://example.com/index.php

... removing the repeated info ...

Concurrency Level:      20
Time taken for tests:   94.065 seconds
Complete requests:      1000
Failed requests:        0
Non-2xx responses:      1000
Total transferred:      217000 bytes
HTML transferred:       0 bytes
Requests per second:    10.63 [#/sec] (mean)
Time per request:       1881.296 [ms] (mean)
Time per request:       94.065 [ms] (mean, across all concurrent requests)
Transfer rate:          2.25 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        2    3   3.7      3      30
Processing:   489 1859 120.5   1859    2536
Waiting:      489 1859 120.5   1859    2536
Total:        515 1862 119.5   1862    2565

Percentage of the requests served within a certain time (ms)
  50%   1862
  66%   1882
  75%   1894
  80%   1903
  90%   1927
  95%   1966
  98%   2084
  99%   2187
 100%   2565 (longest request)

Now with revalidate_freq set to ten minutes:

me@web-01:~/web-service$ cat docker-compose/php-fpm/opcache.ini 
; From https://laravel-news.com/php-opcache-docker
;
[opcache]
opcache.enable=1
; 0 means it will check on every request
; 0 is irrelevant if opcache.validate_timestamps=0 which is desirable in production
opcache.revalidate_freq=600
opcache.validate_timestamps=1
opcache.max_accelerated_files=983
opcache.memory_consumption=16
opcache.max_wasted_percentage=10
opcache.interned_strings_buffer=8

me@web-01:~/web-service$ ab -n 1000 -c 20 https://example.com/index.php

... removing the repeated info ...

Concurrency Level:      20
Time taken for tests:   92.347 seconds
Complete requests:      1000
Failed requests:        0
Non-2xx responses:      1000
Total transferred:      217000 bytes
HTML transferred:       0 bytes
Requests per second:    10.83 [#/sec] (mean)
Time per request:       1846.940 [ms] (mean)
Time per request:       92.347 [ms] (mean, across all concurrent requests)
Transfer rate:          2.29 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        2    3   3.6      3      31
Processing:   464 1829 114.7   1832    2312
Waiting:      464 1829 114.7   1832    2311
Total:        478 1832 113.6   1835    2343

Percentage of the requests served within a certain time (ms)
  50%   1835
  66%   1854
  75%   1867
  80%   1876
  90%   1902
  95%   1929
  98%   1939
  99%   1981
 100%   2343 (longest request)

Now turning the cachefilesd daemon back on:

me@web-01:~/web-service$ sudo service cachefilesd start
me@web-01:~/web-service$ ab -n 1000 -c 20 https://example.com/index.php

... removing the repeated info ...

Concurrency Level:      20
Time taken for tests:   94.687 seconds
Complete requests:      1000
Failed requests:        0
Non-2xx responses:      1000
Total transferred:      217000 bytes
HTML transferred:       0 bytes
Requests per second:    10.56 [#/sec] (mean)
Time per request:       1893.749 [ms] (mean)
Time per request:       94.687 [ms] (mean, across all concurrent requests)
Transfer rate:          2.24 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        2    3   2.7      3      28
Processing:   505 1871 111.8   1874    2368
Waiting:      505 1871 111.8   1874    2368
Total:        521 1874 110.9   1876    2392

Percentage of the requests served within a certain time (ms)
  50%   1876
  66%   1895
  75%   1906
  80%   1912
  90%   1929
  95%   1950
  98%   1975
  99%   2019
 100%   2392 (longest request)

Now turning off access time in fstab:

me@web-01:~/web-service$ cat /etc/fstab 
LABEL=cloudimg-rootfs	/	 ext4	defaults,discard	0 0
/var/swap swap swap defaults 0 0
fs-abcd1234.efs.ca-central-1.amazonaws.com:/ /mnt/efs nfs4 nfsvers=4.1,rsize=1048576,wsize=1048576,hard,timeo=600,retrans=2,async,noatime,fsc 0 0

me@web-01:~/web-service$ sudo umount /mnt/efs && sudo mount /mnt/efs
me@web-01:~/web-service$ ab -n 1000 -c 20 https://example.com/index.php
This is ApacheBench, Version 2.3 <$Revision: 1807734 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking example.com (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests


Server Software:        nginx
Server Hostname:        example.com
Server Port:            443
SSL/TLS Protocol:       TLSv1.2,ECDHE-RSA-AES256-GCM-SHA384,2048,256
TLS Server Name:        example.com

Document Path:          /index.php
Document Length:        0 bytes

Concurrency Level:      20
Time taken for tests:   92.908 seconds
Complete requests:      1000
Failed requests:        0
Non-2xx responses:      1000
Total transferred:      217000 bytes
HTML transferred:       0 bytes
Requests per second:    10.76 [#/sec] (mean)
Time per request:       1858.155 [ms] (mean)
Time per request:       92.908 [ms] (mean, across all concurrent requests)
Transfer rate:          2.28 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        2    3   3.5      3      30
Processing:   604 1837 110.2   1839    2416
Waiting:      604 1837 110.3   1838    2416
Total:        615 1841 109.7   1841    2446

Percentage of the requests served within a certain time (ms)
  50%   1841
  66%   1863
  75%   1876
  80%   1884
  90%   1904
  95%   1927
  98%   1974
  99%   2165
 100%   2446 (longest request)

What I saw at the beginning of these tests

me@web-01:~/web-service$ ab -n 1000 -c 20 https://example.com/index.php
This is ApacheBench, Version 2.3 <$Revision: 1807734 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking example.com (be patient)
Completed 100 requests
Completed 200 requests
Completed 300 requests
Completed 400 requests
Completed 500 requests
Completed 600 requests
Completed 700 requests
Completed 800 requests
Completed 900 requests
Completed 1000 requests
Finished 1000 requests


Server Software:        nginx
Server Hostname:        example.com
Server Port:            443
SSL/TLS Protocol:       TLSv1.2,ECDHE-RSA-AES256-GCM-SHA384,2048,256
TLS Server Name:        example.com

Document Path:          /index.php
Document Length:        0 bytes

Concurrency Level:      20
Time taken for tests:   11.618 seconds
Complete requests:      1000
Failed requests:        0
Non-2xx responses:      1000
Total transferred:      217000 bytes
HTML transferred:       0 bytes
Requests per second:    86.07 [#/sec] (mean)
Time per request:       232.365 [ms] (mean)
Time per request:       11.618 [ms] (mean, across all concurrent requests)
Transfer rate:          18.24 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        2    3   3.5      3      29
Processing:    32  227  22.8    227     515
Waiting:       32  227  22.8    227     515
Total:         58  231  21.7    230     541

Percentage of the requests served within a certain time (ms)
  50%    230
  66%    236
  75%    240
  80%    242
  90%    248
  95%    256
  98%    273
  99%    281
 100%    541 (longest request)

For completeness, here are my graphs for EFS and t3 credits:

Results after increasing Opcache buffers:

Concurrency Level:      20
Time taken for tests:   11.507 seconds
Complete requests:      1000
Failed requests:        0
Non-2xx responses:      1000
Total transferred:      217000 bytes
HTML transferred:       0 bytes
Requests per second:    86.91 [#/sec] (mean)
Time per request:       230.136 [ms] (mean)
Time per request:       11.507 [ms] (mean, across all concurrent requests)
Transfer rate:          18.42 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        2    4   4.4      3      36
Processing:    38  225  20.8    226     264
Waiting:       38  225  20.8    226     264
Total:         50  228  18.9    229     285

Percentage of the requests served within a certain time (ms)
  50%    229
  66%    235
  75%    238
  80%    240
  90%    247
  95%    253
  98%    259
  99%    265
 100%    285 (longest request)

Jul 25 '20 18:07 RandomInsano

We are also trying to run our PHP-based application (not Wordpress) over EFS, and we found there is a significant difference between EC2 instance types due to their network bandwidth. We benchmarked t3.large (5Gb/s), m5.large (10Gb/s) and m5n.large (25Gb/s). These all have the same number of vCPUs and memory.

t3.large -> m5.large: +10% performance
m5.large -> m5n.large: +10% performance
t3.large -> m5n.large: +20% performance

Be aware that the smaller t3 instances have 1Gb/s, which means network latency will be considerably higher.

Aug 08 '20 20:08 andrewmatthews

AWS has a HA setup using LightSail with media offload. No reason for EFS. Has anyone tried the LightSail approach?

Sep 11 '20 20:09 minumula

@minumula , how do they keep the code in sync between multiple nodes?

Lightsail only guarantees 20% or 30% of the CPU, the rest can be stolen by other tenants. I did my tests with Lightsail, but don't want to use it in production.

Sep 11 '20 20:09 hubertnguyen

@hubertnguyen There is an article and a video published on AWS Online Tech Talks. There is a plugin that offloads media to S3 including the uploads folder. Other than that, there is no file sync between the instances.

I am reading AWS PDF whitepaper and referencing other architectures, which makes WordPress HA such a challenge. Do we just mount wp-content only on the EFS and put the rest of the files on individual EC2 servers?

Sep 11 '20 20:09 minumula

Got it. If you have the link, could you share it? If not, I'll do a search later. Not having the code sync is king of a big problem, unless your deployment method is to have a master, which is then cloned and you rebuild a cluster everytime.

Even then, the DB might and code may be out of sync and cause problems, so you'd have to clone the DB too.

With EFS or some kind of file sync, your WP runs like one system. You can mount only /uploads/ on EFS if you want, but you still need to deal with the code sync. If you mount EFS on the web root, then EFS does the code sync for you (no sync needed)

EFS is based on an NFS filesystem, you can find more liiterature on WP+NFS.

Sep 11 '20 21:09 hubertnguyen

quickstart-bitnami-wordpress quickstart-bitnami-wordpress copied to clipboard

Bad performance of EFS might be fixed with cachefilesd

quickstart-bitnami-wordpress
quickstart-bitnami-wordpress copied to clipboard