leofs
leofs copied to clipboard
[leo_storage] Unresponsive when Disk IO takes a long time
Description
leo_storage
gets overloaded and becomes unresponsive under continuous workload.
Hence, requests from leo_gateway
are not handled and thus client requests are blocked.
Details
Test Setup
- LeoFS Version: 1.3.1-dev
- OS: Ubuntu 14.04
- 5x Storage, SSD-backed
- 1x Gateway, No Cache
Test Methodology
Continuously write images into LeoFS at a rate of 300MB/s Test Report
Observation
- After 1.5 minutes, throughput dropped to <50MB/s
- High IOWait at node08
Logs
Info Log at node08, slow operations
[I] [email protected] 2016-10-24 11:25:17.121436 +0900 1477275917 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{mehtod,put},{key,<<"test/3342136">>},{processing_time,1050}]
[I] [email protected] 2016-10-24 11:25:20.141601 +0900 1477275920 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{mehtod,put},{key,<<"test/2904877">>},{processing_time,1260}]
[I] [email protected] 2016-10-24 11:25:21.605988 +0900 1477275921 leo_object_storage_event:handle_event/2 54 [{cause,"slow operation"},{mehtod,put},{key,<<"test/1404813">>},{processing_time,1460}]
Further Analysis
Object Timestamp
(Statistics gathered with leofs-adm whereis
of entire key domain)
Time | Number of Entries |
---|---|
11:25:16 | 3957 |
11:25:17 | 18120 |
11:25:18 | 12144 |
11:25:19 | 0 |
11:25:20 | 0 |
11:25:21 | 2916 |
11:25:22 | 25464 |
Normally, there should be around 24,000 entries with 8,000 ops * 3 replicas. Number of requests got processed dropped around the same time of the slow operations
Further Analysis (Test with Processing Time Logging)
Preparation
Processing Time is logged at leo_gateway
and leo_storage
to check the request flow
-
leo_gateway
- Duration spent on put object request toleo_storage
atleo_gateway_rpc_handler:put
-
leo_storage
- Duration spent on processing request fromleo_gateway
and replication fromleo_storage
(self/remote)
Latency Log
storage_info_trimmed.txt
gateway_info_trimmed.txt
Monitor
Entop
(While entop is refreshing at 2-second interval, the one at 17:09:59 is also delayed)
Summary
- Write operations got delayed
- Replication requests were queued
- New requests were delayed
- Long time (3s) in GW process
- Very short time (1ms) in Storage process
- RPC seemed to be stuck
- entop showed similar delay (10ms -> 3s)
Investigation
To find out the root cause of the problem
Methodology
The process of storing an object in leo_object_storage
involves two types of Disk IOs
- Metadata Related (
leo_backend_db:put/get
) - Object Content (
file:pwrite
inleo_object_storage_haystack:put
)
We try to test the behavior when the IO is commented out
Result
Normal
No Disk IO (Neither Metadata nor Content) - Flat at 750 MB/s
No Content Write - Flat at 750 MB/s
No Metadata - Flat at 750 MB/s (One Big Drop)
No Metadata + No Seek - Flat at 750 MB/s (Small Drops)
Investigation
Check if the backend db is a bottleneck
Methodology
Relocate the backend DB to memory
$ mkdir /dev/shm/metadata
$ ln -s /dev/shm/metadata /ssd/avs/metadata
Result
Stable Bandwidth at 800MB/s and then around 600MB/s
At later part, client received server overloaded messages and thus throttle
...
Server Overloaded, Sleep 500 ms and Retry
...
After loading the 4mil image dataset, each storage node holds 2.5mil objects and metadata disk usage is around 200MB
$ du -sh /dev/shm/metadata
200M /dev/shm/metadata
@windkit in my experiences leveldb's compaction is likey to cause this kind of stop the world(STW) stall. you can check leveldb's compaction status using https://github.com/basho/eleveldb/blob/2.0.33/src/eleveldb.erl#L258 pass a db handle to the 1st arg and "leveldb.stats" to the 2nd according to https://github.com/basho/leveldb/blob/2.0.33/db/db_impl.cc#L2113-L2135 .
https://github.com/leo-project/leo_backend_db/blob/develop/src/leo_backend_db_server.erl#L306-L309
Now leo_backend_db for ets only support status
so also supporting leveldb by using the above stats would be the first step to further dig into.
@mocchira Thanks for the comment, I will test it today
LeoFS's LevelDB Settings seem to be too conservative
Description
To begin with, I checked eleveldb
and leveldb
projects for clues.
At basho/leveldb, the team suggests settings as they have modified the compaction process over the upstream.
Therefore I decided to compare the results of different parameters.
Methodology
leo_backend_db_eleveldb
sets several parameters at eleveldb:open
(https://github.com/leo-project/leo_backend_db/blob/develop/src/leo_backend_db_eleveldb.erl#L58)
- Write Buffer Size: 256KB ~ 3MB [random(?!)]
- Max Open Files: 32
- Cache Size: 64MB (Such Option Not Found)
- Block Size: 10KB (Deprecated, sst_block_size Instead)
defaults for basho/leveldb
(https://github.com/basho/leveldb/blob/develop/util/options.cc#L28)
- Write Buffer Size: 60MB
- Max Open File: 1000
- SST Block Size: 4KB
Testing
Image Set, (4.000,000 Files, ~32KB Average)
Results
leo_backend_db_eleveldb
defaults
eleveldb
(leveldb
) defaults
@windkit Thanks for your report. I've recognized we need to be able to configure Leveldb's configuration because in the current implementation, those configurations is fixed, hard-coding
. I'd like to ask you to implement that.
**Max Open File: 1000, Write Buffer Size: 60MB
**Max Open File: 10000, Write Buffer Size: 600MB
The results are similar despite of the 10x difference in parameters, there could be other bottlenecks or we need to dig down into more details
Monitoring how GC in leveldb actually works using eleveldb:status may be helpful to find out where the bottleneck is and also tuning I/O related settings in OS like described at http://docs.basho.com/riak/kv/2.2.0/setup/planning/backend/leveldb/#configuring-eleveldb may give further performance improvement if you want more performance. (IMHO already reached a minimum requirement)
Setting the
{fadvise_willneed, true}
actually helps to overcome the IO spikes by using the fadvise hint. Yet the throttling still occurs from time to time, it is observed that client would get error 503.
And in leo_storage
there are many queued put requests.
I will follow up the issue after 1.3.1 release.
PR for fadvise_willneed in eleveldb https://github.com/leo-project/leo_backend_db/pull/16