leofs icon indicating copy to clipboard operation
leofs copied to clipboard

[leo_storage] Unresponsive when Disk IO takes a long time

Open windkit opened this issue 8 years ago • 13 comments

Description

leo_storage gets overloaded and becomes unresponsive under continuous workload. Hence, requests from leo_gateway are not handled and thus client requests are blocked.

Details

Test Setup

  • LeoFS Version: 1.3.1-dev
  • OS: Ubuntu 14.04
  • 5x Storage, SSD-backed
  • 1x Gateway, No Cache

Test Methodology

Continuously write images into LeoFS at a rate of 300MB/s Test Report

Observation

picture1

  • After 1.5 minutes, throughput dropped to <50MB/s
  • High IOWait at node08

Logs

Info Log at node08, slow operations

[I]     [email protected]       2016-10-24 11:25:17.121436 +0900        1477275917      leo_object_storage_event:handle_event/2 54      [{cause,"slow operation"},{mehtod,put},{key,<<"test/3342136">>},{processing_time,1050}]
[I]     [email protected]       2016-10-24 11:25:20.141601 +0900        1477275920      leo_object_storage_event:handle_event/2 54      [{cause,"slow operation"},{mehtod,put},{key,<<"test/2904877">>},{processing_time,1260}]
[I]     [email protected]       2016-10-24 11:25:21.605988 +0900        1477275921      leo_object_storage_event:handle_event/2 54      [{cause,"slow operation"},{mehtod,put},{key,<<"test/1404813">>},{processing_time,1460}]

windkit avatar Nov 02 '16 05:11 windkit

Further Analysis

Object Timestamp

(Statistics gathered with leofs-adm whereis of entire key domain)

Time Number of Entries
11:25:16 3957
11:25:17 18120
11:25:18 12144
11:25:19 0
11:25:20 0
11:25:21 2916
11:25:22 25464

Normally, there should be around 24,000 entries with 8,000 ops * 3 replicas. Number of requests got processed dropped around the same time of the slow operations

windkit avatar Nov 02 '16 05:11 windkit

Further Analysis (Test with Processing Time Logging)

Preparation

Processing Time is logged at leo_gateway and leo_storage to check the request flow

  1. leo_gateway - Duration spent on put object request to leo_storage at leo_gateway_rpc_handler:put
  2. leo_storage - Duration spent on processing request from leo_gateway and replication from leo_storage (self/remote)

Latency Log

image storage_info_trimmed.txt gateway_info_trimmed.txt

Monitor

untitled

Entop

(While entop is refreshing at 2-second interval, the one at 17:09:59 is also delayed) entop_normal entop_jammed

Summary

  • Write operations got delayed
  • Replication requests were queued
  • New requests were delayed
    • Long time (3s) in GW process
    • Very short time (1ms) in Storage process
  • RPC seemed to be stuck
    • entop showed similar delay (10ms -> 3s)

windkit avatar Nov 02 '16 06:11 windkit

Investigation

To find out the root cause of the problem

Methodology

The process of storing an object in leo_object_storage involves two types of Disk IOs

  1. Metadata Related (leo_backend_db:put/get)
  2. Object Content (file:pwrite in leo_object_storage_haystack:put)

We try to test the behavior when the IO is commented out

Result

Normal

normal

No Disk IO (Neither Metadata nor Content) - Flat at 750 MB/s

nodiskio

No Content Write - Flat at 750 MB/s

nodiskwrite

No Metadata - Flat at 750 MB/s (One Big Drop)

nobackendb

No Metadata + No Seek - Flat at 750 MB/s (Small Drops)

nobackendb noseek

windkit avatar Nov 04 '16 07:11 windkit

Investigation

Check if the backend db is a bottleneck

Methodology

Relocate the backend DB to memory

$ mkdir /dev/shm/metadata
$ ln -s /dev/shm/metadata /ssd/avs/metadata

Result

Stable Bandwidth at 800MB/s and then around 600MB/s backendb_mem

At later part, client received server overloaded messages and thus throttle

...
Server Overloaded, Sleep 500 ms and Retry
...

windkit avatar Nov 11 '16 06:11 windkit

After loading the 4mil image dataset, each storage node holds 2.5mil objects and metadata disk usage is around 200MB

$ du -sh /dev/shm/metadata
200M    /dev/shm/metadata

windkit avatar Nov 11 '16 06:11 windkit

@windkit in my experiences leveldb's compaction is likey to cause this kind of stop the world(STW) stall. you can check leveldb's compaction status using https://github.com/basho/eleveldb/blob/2.0.33/src/eleveldb.erl#L258 pass a db handle to the 1st arg and "leveldb.stats" to the 2nd according to https://github.com/basho/leveldb/blob/2.0.33/db/db_impl.cc#L2113-L2135 .

https://github.com/leo-project/leo_backend_db/blob/develop/src/leo_backend_db_server.erl#L306-L309 Now leo_backend_db for ets only support status so also supporting leveldb by using the above stats would be the first step to further dig into.

mocchira avatar Dec 08 '16 04:12 mocchira

@mocchira Thanks for the comment, I will test it today

windkit avatar Dec 09 '16 00:12 windkit

LeoFS's LevelDB Settings seem to be too conservative

Description

To begin with, I checked eleveldb and leveldb projects for clues.

At basho/leveldb, the team suggests settings as they have modified the compaction process over the upstream.

Therefore I decided to compare the results of different parameters.

Methodology

leo_backend_db_eleveldb sets several parameters at eleveldb:open (https://github.com/leo-project/leo_backend_db/blob/develop/src/leo_backend_db_eleveldb.erl#L58)

  1. Write Buffer Size: 256KB ~ 3MB [random(?!)]
  2. Max Open Files: 32
  3. Cache Size: 64MB (Such Option Not Found)
  4. Block Size: 10KB (Deprecated, sst_block_size Instead)

defaults for basho/leveldb (https://github.com/basho/leveldb/blob/develop/util/options.cc#L28)

  1. Write Buffer Size: 60MB
  2. Max Open File: 1000
  3. SST Block Size: 4KB

Testing

Image Set, (4.000,000 Files, ~32KB Average)

Results

leo_backend_db_eleveldb defaults

grafana_leofs

eleveldb (leveldb) defaults

grafana_def

windkit avatar Dec 16 '16 07:12 windkit

@windkit Thanks for your report. I've recognized we need to be able to configure Leveldb's configuration because in the current implementation, those configurations is fixed, hard-coding. I'd like to ask you to implement that.

yosukehara avatar Dec 19 '16 01:12 yosukehara

**Max Open File: 1000, Write Buffer Size: 60MB 1time

**Max Open File: 10000, Write Buffer Size: 600MB 10times

The results are similar despite of the 10x difference in parameters, there could be other bottlenecks or we need to dig down into more details

windkit avatar Dec 27 '16 06:12 windkit

Monitoring how GC in leveldb actually works using eleveldb:status may be helpful to find out where the bottleneck is and also tuning I/O related settings in OS like described at http://docs.basho.com/riak/kv/2.2.0/setup/planning/backend/leveldb/#configuring-eleveldb may give further performance improvement if you want more performance. (IMHO already reached a minimum requirement)

mocchira avatar Dec 27 '16 07:12 mocchira

fadvise Setting the {fadvise_willneed, true} actually helps to overcome the IO spikes by using the fadvise hint. Yet the throttling still occurs from time to time, it is observed that client would get error 503. And in leo_storage there are many queued put requests.

I will follow up the issue after 1.3.1 release.

windkit avatar Dec 27 '16 07:12 windkit

PR for fadvise_willneed in eleveldb https://github.com/leo-project/leo_backend_db/pull/16

windkit avatar Dec 06 '17 00:12 windkit