ardb icon indicating copy to clipboard operation
ardb copied to clipboard

ARDB crashes, no error message in log...

Open rjonesxmoz opened this issue 6 years ago • 22 comments

Seems like it might be related to using the redis pipeline feature, but I can't seem to figure out where to look to figure out why it might have crashed.

rjonesxmoz avatar Aug 17 '18 11:08 rjonesxmoz

Is there a specific error log where I should be able to see why it went down? It just goes down w/o any notice. It is now going down after every 300K requests (hGet) or so.

Drive space is good --- 300G 106G 195G 36% /data

Server stats look fine - ARDB at about 50% CPU and 30% RAM --- 20 0 3945348 3.451g 2360 S 47.9 29.4 1:07.12 ardb-server

rjonesxmoz avatar Aug 17 '18 13:08 rjonesxmoz

any thing in syslog?

yinqiwen avatar Aug 17 '18 13:08 yinqiwen

Aug 17 12:40:24 dalwalrusrocksdb01 kernel: [762268.331828] ardb-server invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0 Aug 17 12:40:24 dalwalrusrocksdb01 kernel: [762268.331834] ardb-server cpuset=/ mems_allowed=0 Aug 17 12:40:24 dalwalrusrocksdb01 kernel: [762268.331837] CPU: 0 PID: 6832 Comm: ardb-server Not tainted 3.13.0-98-generic #145-Ubuntu Aug 17 12:40:24 dalwalrusrocksdb01 kernel: [762268.332126] [ 6808] 0 6808 4228321 2941740 8120 1001920 0 ardb-server Aug 17 12:40:24 dalwalrusrocksdb01 kernel: [762268.332130] Out of memory: Kill process 6808 (ardb-server) score 929 or sacrifice child Aug 17 12:40:24 dalwalrusrocksdb01 kernel: [762268.333614] Killed process 6808 (ardb-server) total-vm:16913284kB, anon-rss:11766960kB, file -rss:0kB Aug 17 13:06:53 dalwalrusrocksdb01 kernel: [763857.356090] [ 8980] 0 8980 4349153 2979894 8357 996540 0 ardb-server Aug 17 13:06:53 dalwalrusrocksdb01 kernel: [763857.356105] Out of memory: Kill process 8980 (ardb-server) score 937 or sacrifice child Aug 17 13:06:53 dalwalrusrocksdb01 kernel: [763857.358158] Killed process 8980 (ardb-server) total-vm:17396612kB, anon-rss:11919496kB, file -rss:80kB

rjonesxmoz avatar Aug 17 '18 13:08 rjonesxmoz

those are listings that had "ardb" listed

rjonesxmoz avatar Aug 17 '18 13:08 rjonesxmoz

is there a max memory option or something that I need to set, or do I need to give the machine more RAM?

rjonesxmoz avatar Aug 17 '18 13:08 rjonesxmoz

Ive got 12GB of RAM on this machine so Im really surprised it would run out.

rjonesxmoz avatar Aug 17 '18 14:08 rjonesxmoz

I removed the pipeline method and am just looking everything up 1 at a time, Im watching TOP. ARDB has risen to 31.9% memory and seems to rest there nicely for a while and then OOM kills it

rjonesxmoz avatar Aug 17 '18 15:08 rjonesxmoz

If I start requests the second I begin the ardb-server, it very rapidly progresses to 100% memory and crashes. If I wait a little bit (maybe 3 minutes), it then takes maybe 15 minutes to crash, although it is intermittent, sometimes only a few minutes, sometimes it lasts 30.

rjonesxmoz avatar Aug 17 '18 15:08 rjonesxmoz

It seems like it has to be some sort of memory leak since it is occurring not because of inserts but because of hGet requests.

rjonesxmoz avatar Aug 17 '18 15:08 rjonesxmoz

I will happily donate $1k to this project (if that is possible) if we can fix this problem. It is pretty mission critical at this moment.

rjonesxmoz avatar Aug 17 '18 15:08 rjonesxmoz

@yinqiwen I'm not sure if this helps...

[22001] 08-17 16:00:00,453 INFO coststat_hget_all:calls=300745,costs=81819213,cost_per_call=272,percents=100.0000% [22001] 08-17 16:00:00,453 INFO coststat_hget_range[0-1000]:calls=299616,costs=74902330,cost_per_call=249,percents=99.6246% [22001] 08-17 16:00:00,453 INFO coststat_hget_range[1000-5000]:calls=865,costs=1608681,cost_per_call=1859,percents=0.2876% [22001] 08-17 16:00:00,453 INFO coststat_hget_range[5000-10000]:calls=24,costs=164977,cost_per_call=6874,percents=0.0080% [22001] 08-17 16:00:00,453 INFO coststat_hget_range[10000-20000]:calls=140,costs=2466021,cost_per_call=17614,percents=0.0466% [22001] 08-17 16:00:00,453 INFO coststat_hget_range[20000-50000]:calls=98,costs=2555608,cost_per_call=26077,percents=0.0326% [22001] 08-17 16:00:00,453 INFO coststat_hget_range[50000-100000]:calls=2,costs=121596,cost_per_call=60798,percents=0.0007%

rjonesxmoz avatar Aug 17 '18 16:08 rjonesxmoz

@yinqiwen updated to latest version but there was no improvement. Still had the memory leak.

The DB has A LOT of keys/hashes. Each hash has probably 48 values in it.

Do you think if I were to move this out of a hash system and just stored json blobs of the data in standard gets that it would work better?

rjonesxmoz avatar Aug 17 '18 17:08 rjonesxmoz

@yinqiwen I tried switching to hGetAll to see if perhaps just getting all the raw data would be better and less overhead, but no luck.

rjonesxmoz avatar Aug 17 '18 20:08 rjonesxmoz

I tried setting redis-compatible-mode to yes as I read in some other threads that this could cause memory issues. Unfortunately that did not do the trick, if anything it hastened the crash.

rjonesxmoz avatar Aug 17 '18 20:08 rjonesxmoz

use info databases command in redis-cli, which would report the memory usage of rocksdb. there is already some reports about rocksdb using too much memory, which can be solved by tweaking the configuration of rocksdb.

and is the hgetall/hget test with same key or random keys? if there is memory leak for pipeline mode in the test, use same key to hget/hgetall or just ping can also make an OOM error.

yinqiwen avatar Aug 18 '18 01:08 yinqiwen

i just test with ping, hget with same key, hgetall with same key by redis-benchmark in pieline mode with total number 100000000 , and there is no memory leak.

is the ardb is the latest version 0.10? and what's the configuration of rocksdb ?
since hget/hgetall just read&cache data from rocksdb, this issue seems related with rocksdb cache.

yinqiwen avatar Aug 18 '18 01:08 yinqiwen

and i just add 2 configs to disable read cache for hget/hgetall in rockdb into the master/0.10 branch.

u can add the below configs into ardb.conf.

rocksdb.read_fill_cache  no
rocksdb.iter_fill_cache    no

yinqiwen avatar Aug 18 '18 02:08 yinqiwen

First, thank you so much for your help!

  1. I removed the pipelining and still had the problem.
  2. The keys are random
  3. I am using the latest from just doing "git clone https://github.com/yinqiwen/ardb.git"
  4. Here are the results of the "info databases" command...

data_dir:/data/rocksdb used_disk_space:107968311063 rocksdb_version:5.14.2 rocksdb.block_table_usage:0 rocksdb.block_table_pinned_usage:0 rocksdb_memtable_total:117442128 rocksdb_memtable_unflushed:117442128 rocksdb_table_readers_total:0 rocksdb.estimate-table-readers-mem:0 rocksdb.cur-size-all-mem-tables:58721064

** Compaction Stats [0] ** Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W- Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop


L0 2/0 127.32 MB 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0 L1 3/0 208.97 MB 0.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0 L2 44/0 2.46 GB 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0 .0 0.0 0.0 0 0 0.000 0 0 L3 388/0 24.98 GB 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0 .0 0.0 0.0 0 0 0.000 0 0 L4 1082/0 72.79 GB 0.3 0.0 0.0 0.0 0.0 0.0 0.0 0 .0 0.0 0.0 0 0 0.000 0 0 Sum 1519/0 100.55 GB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0 Int 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 .0 0.0 0.0 0 0 0.000 0 0 Uptime(secs): 65.9 total, 65.9 interval Flush(GB): cumulative 0.000, interval 0.000 AddFile(GB): cumulative 0.000, interval 0.000 AddFile(Total Files): cumulative 0, interval 0 AddFile(L0 Files): cumulative 0, interval 0 AddFile(Keys): cumulative 0, interval 0 Cumulative compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 se conds Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pen ding_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count

** File Read Latency Histogram By Level [0] **

** DB Stats ** Uptime(secs): 65.9 total, 65.9 interval Cumulative writes: 0 writes, 0 keys, 0 commit groups, 0.0 writes per commit group, ingest : 0.00 GB, 0.00 MB/s Cumulative WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent Interval writes: 0 writes, 0 keys, 0 commit groups, 0.0 writes per commit group, ingest: 0.00 MB, 0.00 MB/s Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 MB, 0.00 MB/s Interval stall: 00:00:0.000 H:M:S, 0.0 percent

  1. I added "rocksdb.read_fill_cache no" and "rocksdb.iter_fill_cache no" to ardb.conf, and upgraded to .10 - It still crashed when I attempted to run my code. It actually crashes faster when I use hGetAll rather than just hGet and grab the specific field I need, which makes me suspicious that it is a cache getting filled somewhere. Just what cache and how to prevent it, I don't know.

  2. RocksDB has the out-of-the-box configuration. I'm not even sure how/where to make changes to that configuration but I would be happy to do so if you know how it might help this issue.

  3. I ran a test where I just requested the same key over and over again using hGetAll and it had no problems, so it is definitely filling up a cache or something of that sort as I request more keys. Is there some way to flush the cache periodically?

rjonesxmoz avatar Aug 18 '18 15:08 rjonesxmoz

  1. can u past all info all & client list results when the memory usage is high?
  2. to change rockddb's config , just edit rocksdb.options in ardb.conf
  3. according this wiki https://github.com/facebook/rocksdb/wiki/Memory-usage-in-RocksDB , we can increase block size & optimize bloomfilter settings to reduce memory usage.
  4. this configuration is change block size to 32k from 4k, use 8bit per key for bloomfilter, and enable the flag optimize_filters_for_hits
rocksdb.options               write_buffer_size=512M;max_write_buffer_number=5;min_write_buffer_number_to_merge=3;compression=kSnappyCompression;\
                              bloom_locality=1;memtable_prefix_bloom_size_ratio=0.1;\
                              block_based_table_factory={block_cache=512M;filter_policy=bloomfilter:8:true};\
                              create_if_missing=true;max_open_files=10000;rate_limiter_bytes_per_sec=50M;\
                              use_direct_io_for_flush_and_compaction=true;use_adaptive_mutex=true;optimize_filters_for_hits=true
  1. disable replication log in ardb.conf if there is no slave of current instance, which could save 1G memory at most.
repl-backlog-size           0
  1. add reply-pool-size 100 into ardb.conf would reduce reply pool size from 10000 per thread to 100 per thread.
  2. just add redis reply count report in info command in the latest code.

yinqiwen avatar Aug 19 '18 03:08 yinqiwen

Woohoo! Your changes seemed to have paused the effort at somewhere around 58% memory! Here is the info all output...

=======================

rjonesxmoz avatar Aug 19 '18 20:08 rjonesxmoz

Server

ardb_version:0.10.0 redis_version:2.8.0 engine:rocksdb ardb_home:/install os:Linux 3.13.0-98-generic x86_64 gcc_version:4.8.4 process_id:18207 run_id:2bc2903a5bf449dc3eab36ab947a185db400a24b tcp_port:6379 listen:0.0.0.0:6379 uptime_in_seconds:24 uptime_in_days:0 executable:/install/ardb/src/ardb-server config_file:/install/ardb/ardb.conf

Databases

data_dir:/data/rocksdb used_disk_space:107968311059 living_iterator_num:0 rocksdb_version:5.14.2 rocksdb.block_table_usage:0 rocksdb.block_table_pinned_usage:0 rocksdb_memtable_total:117442128 rocksdb_memtable_unflushed:117442128 rocksdb_table_readers_total:0 rocksdb.estimate-table-readers-mem:0 rocksdb.cur-size-all-mem-tables:58721064

** Compaction Stats [0] ** Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W- Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop


L0 2/0 127.32 MB 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0 L1 3/0 208.97 MB 0.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0 L2 44/0 2.46 GB 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0 .0 0.0 0.0 0 0 0.000 0 0 L3 388/0 24.98 GB 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0 .0 0.0 0.0 0 0 0.000 0 0 L4 1082/0 72.79 GB 0.3 0.0 0.0 0.0 0.0 0.0 0.0 0 .0 0.0 0.0 0 0 0.000 0 0 Sum 1519/0 100.55 GB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0 Int 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 .0 0.0 0.0 0 0 0.000 0 0 Uptime(secs): 23.4 total, 23.4 interval Flush(GB): cumulative 0.000, interval 0.000 AddFile(GB): cumulative 0.000, interval 0.000 AddFile(Total Files): cumulative 0, interval 0 AddFile(L0 Files): cumulative 0, interval 0 AddFile(Keys): cumulative 0, interval 0 Cumulative compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 se conds Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pen ding_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count

** File Read Latency Histogram By Level [0] **

** DB Stats ** Uptime(secs): 23.4 total, 23.4 interval Cumulative writes: 0 writes, 0 keys, 0 commit groups, 0.0 writes per commit group, ingest : 0.00 GB, 0.00 MB/s Cumulative WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent Interval writes: 0 writes, 0 keys, 0 commit groups, 0.0 writes per commit group, ingest: 0.00 MB, 0.00 MB/s Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 MB, 0.00 MB/s Interval stall: 00:00:0.000 H:M:S, 0.0 percent

Clients

connected_clients:1 blocked_clients:0

Persistence

loading:0 rdb_last_save_time:0 rdb_last_bgsave_status:ok rdb_last_bgsave_time_sec:0 rdb_bgsave_in_progress:0 rdb_current_bgsave_time_sec:0

CPU

used_cpu_sys:0.08 used_cpu_user:0.04

Replication

role: singleton repl_backlog_size: 0

Memory

used_memory_rss:261099520

Stats

total_commands_processed:0 instantaneous_ops_per_sec:0 total_connections_received:1 rejected_connections:0 sync_full:0 sync_partial_ok:0 sync_partial_err:0 pubsub_channels:0 pubsub_patterns:0 expire_scan_keys:0

Keyspace

db0:keys=4964608156

Commandstats

Coststats

127.0.0.1:6379> info all

Server

ardb_version:0.10.0 redis_version:2.8.0 engine:rocksdb ardb_home:/install os:Linux 3.13.0-98-generic x86_64 gcc_version:4.8.4 process_id:18207 run_id:2bc2903a5bf449dc3eab36ab947a185db400a24b tcp_port:6379 listen:0.0.0.0:6379 uptime_in_seconds:159 uptime_in_days:0 executable:/install/ardb/src/ardb-server config_file:/install/ardb/ardb.conf

Databases

data_dir:/data/rocksdb used_disk_space:107968311059 living_iterator_num:1 rocksdb_version:5.14.2 rocksdb.block_table_usage:4096 rocksdb.block_table_pinned_usage:4096 rocksdb_memtable_total:117442128 rocksdb_memtable_unflushed:117442128 rocksdb_table_readers_total:7064739869 rocksdb.estimate-table-readers-mem:0 rocksdb.cur-size-all-mem-tables:58721064

** Compaction Stats [0] ** Level Files Size Score Read(GB) Rn(GB) Rnp1(GB) Write(GB) Wnew(GB) Moved(GB) W- Amp Rd(MB/s) Wr(MB/s) Comp(sec) Comp(cnt) Avg(sec) KeyIn KeyDrop


L0 2/0 127.32 MB 0.5 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0 L1 3/0 208.97 MB 0.8 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0 L2 44/0 2.46 GB 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0 .0 0.0 0.0 0 0 0.000 0 0 L3 388/0 24.98 GB 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0 .0 0.0 0.0 0 0 0.000 0 0 L4 1082/0 72.79 GB 0.3 0.0 0.0 0.0 0.0 0.0 0.0 0 .0 0.0 0.0 0 0 0.000 0 0 Sum 1519/0 100.55 GB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0 0.000 0 0 Int 0/0 0.00 KB 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 .0 0.0 0.0 0 0 0.000 0 0 Uptime(secs): 158.4 total, 135.0 interval Flush(GB): cumulative 0.000, interval 0.000 AddFile(GB): cumulative 0.000, interval 0.000 AddFile(Total Files): cumulative 0, interval 0 AddFile(L0 Files): cumulative 0, interval 0 AddFile(Keys): cumulative 0, interval 0 Cumulative compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 seconds Interval compaction: 0.00 GB write, 0.00 MB/s write, 0.00 GB read, 0.00 MB/s read, 0.0 se conds Stalls(count): 0 level0_slowdown, 0 level0_slowdown_with_compaction, 0 level0_numfiles, 0 level0_numfiles_with_compaction, 0 stop for pending_compaction_bytes, 0 slowdown for pen ding_compaction_bytes, 0 memtable_compaction, 0 memtable_slowdown, interval 0 total count

** File Read Latency Histogram By Level [0] **

** DB Stats ** Uptime(secs): 158.4 total, 135.0 interval Cumulative writes: 0 writes, 0 keys, 0 commit groups, 0.0 writes per commit group, ingest : 0.00 GB, 0.00 MB/s Cumulative WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 GB, 0.00 MB/s Cumulative stall: 00:00:0.000 H:M:S, 0.0 percent Interval writes: 0 writes, 0 keys, 0 commit groups, 0.0 writes per commit group, ingest: 0.00 MB, 0.00 MB/s Interval WAL: 0 writes, 0 syncs, 0.00 writes per sync, written: 0.00 MB, 0.00 MB/s Interval stall: 00:00:0.000 H:M:S, 0.0 percent

Clients

connected_clients:2 blocked_clients:0

Persistence

loading:0 rdb_last_save_time:0 rdb_last_bgsave_status:ok rdb_last_bgsave_time_sec:0 rdb_bgsave_in_progress:0 rdb_current_bgsave_time_sec:0

CPU

used_cpu_sys:7.59 used_cpu_user:32.23

Replication

role: singleton repl_backlog_size: 0

Memory

used_memory_rss:7389245440

Stats

total_commands_processed:44540 instantaneous_ops_per_sec:1169 total_connections_received:2 rejected_connections:0 sync_full:0 sync_partial_ok:0 sync_partial_err:0 pubsub_channels:0 pubsub_patterns:0 expire_scan_keys:0

Keyspace

db0:keys=4964608156

Commandstats

cmdstat_hgetall:calls=44539,usec=54426420,usecpercall=1221 cmdstat_info:calls=1,usec=2553,usecpercall=2553

Coststats

coststat_info_all:calls=1,costs=2553,cost_per_call=2553,percents=100.0000% coststat_info_range[1000-5000]:calls=1,costs=2553,cost_per_call=2553,percents=100.0000% coststat_hgetall_all:calls=44539,costs=54426420,cost_per_call=1221,percents=100.0000% coststat_hgetall_range[0-1000]:calls=42494,costs=24641496,cost_per_call=579,percents=95.4 085% coststat_hgetall_range[1000-5000]:calls=933,costs=1403273,cost_per_call=1504,percents=2.0 948% coststat_hgetall_range[5000-10000]:calls=61,costs=426992,cost_per_call=6999,percents=0.13 70% coststat_hgetall_range[10000-20000]:calls=553,costs=10021981,cost_per_call=18122,percents =1.2416% coststat_hgetall_range[20000-50000]:calls=429,costs=13214553,cost_per_call=30803,percents =0.9632% coststat_hgetall_range[50000-100000]:calls=65,costs=4209337,cost_per_call=64759,percents= 0.1459% coststat_hgetall_range[100000-200000]:calls=4,costs=508788,cost_per_call=127197,percents= 0.0090%

rjonesxmoz avatar Aug 19 '18 20:08 rjonesxmoz

in the info output, u can see most memory used by rocksdb's index/filter, this wiki shows a way to alleviates the overhead. https://github.com/facebook/rocksdb/wiki/Partitioned-Index-Filters

rocksdb_table_readers_total:7064739869
....
used_memory_rss:7389245440

yinqiwen avatar Aug 20 '18 01:08 yinqiwen