leofs [leo_storage] Unresponsive when Disk IO takes a long time

Description

leo_storage gets overloaded and becomes unresponsive under continuous workload. Hence, requests from leo_gateway are not handled and thus client requests are blocked.

Details

Test Setup

LeoFS Version: 1.3.1-dev
OS: Ubuntu 14.04
5x Storage, SSD-backed
1x Gateway, No Cache

Test Methodology

Continuously write images into LeoFS at a rate of 300MB/s Test Report

Observation

After 1.5 minutes, throughput dropped to <50MB/s
High IOWait at node08

Logs

Info Log at node08, slow operations

[I]     [email protected]       2016-10-24 11:25:17.121436 +0900        1477275917      leo_object_storage_event:handle_event/2 54      [{cause,"slow operation"},{mehtod,put},{key,<<"test/3342136">>},{processing_time,1050}]
[I]     [email protected]       2016-10-24 11:25:20.141601 +0900        1477275920      leo_object_storage_event:handle_event/2 54      [{cause,"slow operation"},{mehtod,put},{key,<<"test/2904877">>},{processing_time,1260}]
[I]     [email protected]       2016-10-24 11:25:21.605988 +0900        1477275921      leo_object_storage_event:handle_event/2 54      [{cause,"slow operation"},{mehtod,put},{key,<<"test/1404813">>},{processing_time,1460}]

Nov 02 '16 05:11 windkit

Further Analysis

Object Timestamp

(Statistics gathered with leofs-adm whereis of entire key domain)

Time	Number of Entries
11:25:16	3957
11:25:17	18120
11:25:18	12144
11:25:19	0
11:25:20	0
11:25:21	2916
11:25:22	25464

Normally, there should be around 24,000 entries with 8,000 ops * 3 replicas. Number of requests got processed dropped around the same time of the slow operations

Nov 02 '16 05:11 windkit

Further Analysis (Test with Processing Time Logging)

Preparation

Processing Time is logged at leo_gateway and leo_storage to check the request flow

leo_gateway - Duration spent on put object request to leo_storage at leo_gateway_rpc_handler:put
leo_storage - Duration spent on processing request from leo_gateway and replication from leo_storage (self/remote)

Latency Log

storage_info_trimmed.txt gateway_info_trimmed.txt

Monitor

untitled

Entop

(While entop is refreshing at 2-second interval, the one at 17:09:59 is also delayed) entop_normal entop_jammed

Summary

Write operations got delayed
Replication requests were queued
New requests were delayed
- Long time (3s) in GW process
- Very short time (1ms) in Storage process
RPC seemed to be stuck
- entop showed similar delay (10ms -> 3s)

Nov 02 '16 06:11 windkit

Investigation

To find out the root cause of the problem

Methodology

The process of storing an object in leo_object_storage involves two types of Disk IOs

Metadata Related (leo_backend_db:put/get)
Object Content (file:pwrite in leo_object_storage_haystack:put)

We try to test the behavior when the IO is commented out

Result

Normal

normal

No Disk IO (Neither Metadata nor Content) - Flat at 750 MB/s

nodiskio

No Content Write - Flat at 750 MB/s

nodiskwrite

No Metadata - Flat at 750 MB/s (One Big Drop)

nobackendb

No Metadata + No Seek - Flat at 750 MB/s (Small Drops)

nobackendb noseek

Nov 04 '16 07:11 windkit

Investigation

Check if the backend db is a bottleneck

Methodology

Relocate the backend DB to memory

$ mkdir /dev/shm/metadata
$ ln -s /dev/shm/metadata /ssd/avs/metadata

Result

Stable Bandwidth at 800MB/s and then around 600MB/s backendb_mem

At later part, client received server overloaded messages and thus throttle

...
Server Overloaded, Sleep 500 ms and Retry
...

Nov 11 '16 06:11 windkit

After loading the 4mil image dataset, each storage node holds 2.5mil objects and metadata disk usage is around 200MB

$ du -sh /dev/shm/metadata
200M    /dev/shm/metadata

Nov 11 '16 06:11 windkit

@windkit in my experiences leveldb's compaction is likey to cause this kind of stop the world(STW) stall. you can check leveldb's compaction status using https://github.com/basho/eleveldb/blob/2.0.33/src/eleveldb.erl#L258 pass a db handle to the 1st arg and "leveldb.stats" to the 2nd according to https://github.com/basho/leveldb/blob/2.0.33/db/db_impl.cc#L2113-L2135 .

https://github.com/leo-project/leo_backend_db/blob/develop/src/leo_backend_db_server.erl#L306-L309 Now leo_backend_db for ets only support status so also supporting leveldb by using the above stats would be the first step to further dig into.

Dec 08 '16 04:12 mocchira

@mocchira Thanks for the comment, I will test it today

Dec 09 '16 00:12 windkit

LeoFS's LevelDB Settings seem to be too conservative

Description

To begin with, I checked eleveldb and leveldb projects for clues.

At basho/leveldb, the team suggests settings as they have modified the compaction process over the upstream.

Therefore I decided to compare the results of different parameters.

Methodology

leo_backend_db_eleveldb sets several parameters at eleveldb:open (https://github.com/leo-project/leo_backend_db/blob/develop/src/leo_backend_db_eleveldb.erl#L58)

Write Buffer Size: 256KB ~ 3MB [random(?!)]
Max Open Files: 32
Cache Size: 64MB (Such Option Not Found)
Block Size: 10KB (Deprecated, sst_block_size Instead)

defaults for basho/leveldb (https://github.com/basho/leveldb/blob/develop/util/options.cc#L28)

Write Buffer Size: 60MB
Max Open File: 1000
SST Block Size: 4KB

Testing

Image Set, (4.000,000 Files, ~32KB Average)

Results

`leo_backend_db_eleveldb` defaults

grafana_leofs

`eleveldb` (`leveldb`) defaults

grafana_def

Dec 16 '16 07:12 windkit

@windkit Thanks for your report. I've recognized we need to be able to configure Leveldb's configuration because in the current implementation, those configurations is fixed, hard-coding. I'd like to ask you to implement that.

Dec 19 '16 01:12 yosukehara

**Max Open File: 1000, Write Buffer Size: 60MB 1time

**Max Open File: 10000, Write Buffer Size: 600MB 10times

The results are similar despite of the 10x difference in parameters, there could be other bottlenecks or we need to dig down into more details

Dec 27 '16 06:12 windkit

Monitoring how GC in leveldb actually works using eleveldb:status may be helpful to find out where the bottleneck is and also tuning I/O related settings in OS like described at http://docs.basho.com/riak/kv/2.2.0/setup/planning/backend/leveldb/#configuring-eleveldb may give further performance improvement if you want more performance. (IMHO already reached a minimum requirement)

Dec 27 '16 07:12 mocchira

fadvise Setting the {fadvise_willneed, true} actually helps to overcome the IO spikes by using the fadvise hint. Yet the throttling still occurs from time to time, it is observed that client would get error 503. And in leo_storage there are many queued put requests.

I will follow up the issue after 1.3.1 release.

Dec 27 '16 07:12 windkit

PR for fadvise_willneed in eleveldb https://github.com/leo-project/leo_backend_db/pull/16

Dec 06 '17 00:12 windkit

leofs leofs copied to clipboard

[leo_storage] Unresponsive when Disk IO takes a long time

Description

Details

Test Setup

Test Methodology

Observation

Logs

Further Analysis

Object Timestamp

Further Analysis (Test with Processing Time Logging)

Preparation

Latency Log

Monitor

Entop

Summary

Investigation

Methodology

Result

Normal

No Disk IO (Neither Metadata nor Content) - Flat at 750 MB/s

No Content Write - Flat at 750 MB/s

No Metadata - Flat at 750 MB/s (One Big Drop)

No Metadata + No Seek - Flat at 750 MB/s (Small Drops)

Investigation

Methodology

Result

LeoFS's LevelDB Settings seem to be too conservative

Description

Methodology

Testing

Results

leo_backend_db_eleveldb defaults

eleveldb (leveldb) defaults

leofs
leofs copied to clipboard

`leo_backend_db_eleveldb` defaults

`eleveldb` (`leveldb`) defaults