carbon-clickhouse icon indicating copy to clipboard operation
carbon-clickhouse copied to clipboard

The process has stuck when reached `too many open files`

Open Felixoid opened this issue 4 years ago • 2 comments

Hello. Yesterday the process has stuck in a strange state without doing anything.

I did check the CH logs, there wasn't any suspicious.

There's a log from the stuck time:

> grep -C15 2020-02-06T20:12:33.379Z TMOP-CCH.log
[2020-02-06T20:12:27.381Z] INFO [upload] start handle {"name": "graphite_tagged", "filename": "/var/lib/carbon-clickhouse/graphite_tagged/default.1581019944197075408.lz4"}
[2020-02-06T20:12:27.403Z] INFO [upload] start handle {"name": "graphite_index", "filename": "/var/lib/carbon-clickhouse/graphite_index/default.1581019944197075408.lz4"}
[2020-02-06T20:12:27.617Z] ERROR [upload] handle failed {"name": "graphite_tagged", "filename": "/var/lib/carbon-clickhouse/graphite_tagged/default.1581019944197075408.lz4", "error": "Post http://ch-host:8123/?query=INSERT+INTO+graphite.tagged+%28Date%2C+Tag1%2C+Path%2C+Tags%2C+Version%29+FORMAT+RowBinary: EOF", "time": 0.235977166}
[2020-02-06T20:12:27.754Z] INFO [upload] handle success {"name": "graphite", "filename": "/var/lib/carbon-clickhouse/graphite/default.1581019944197075408.lz4", "time": 0.374817767}
[2020-02-06T20:12:28.129Z] INFO [upload] handle success {"name": "graphite_index", "filename": "/var/lib/carbon-clickhouse/graphite_index/default.1581019944197075408.lz4", "time": 0.727357581}
[2020-02-06T20:12:28.380Z] INFO [upload] start handle {"name": "graphite_tagged", "filename": "/var/lib/carbon-clickhouse/graphite_tagged/default.1581019944197075408.lz4"}
[2020-02-06T20:12:28.486Z] INFO [upload] handle success {"name": "graphite_tagged", "filename": "/var/lib/carbon-clickhouse/graphite_tagged/default.1581019944197075408.lz4", "time": 0.105951873}
[2020-02-06T20:12:30.379Z] INFO [upload] start handle {"name": "graphite", "filename": "/var/lib/carbon-clickhouse/graphite/default.1581019947197180216.lz4"}
[2020-02-06T20:12:30.380Z] INFO [upload] start handle {"name": "graphite_tagged", "filename": "/var/lib/carbon-clickhouse/graphite_tagged/default.1581019947197180216.lz4"}
[2020-02-06T20:12:30.380Z] INFO [upload] start handle {"name": "graphite_index", "filename": "/var/lib/carbon-clickhouse/graphite_index/default.1581019947197180216.lz4"}
[2020-02-06T20:12:30.478Z] INFO [upload] handle success {"name": "graphite_tagged", "filename": "/var/lib/carbon-clickhouse/graphite_tagged/default.1581019947197180216.lz4", "time": 0.098053346}
[2020-02-06T20:12:30.625Z] INFO [upload] handle success {"name": "graphite", "filename": "/var/lib/carbon-clickhouse/graphite/default.1581019947197180216.lz4", "time": 0.245967823}
[2020-02-06T20:12:30.875Z] ERROR [upload] handle failed {"name": "graphite_index", "filename": "/var/lib/carbon-clickhouse/graphite_index/default.1581019947197180216.lz4", "error": "Post http://ch-host:8123/?query=INSERT+INTO+graphite.index+%28Date%2C+Level%2C+Path%2C+Version%29+FORMAT+RowBinary: EOF", "time": 0.495387526}
[2020-02-06T20:12:31.380Z] INFO [upload] start handle {"name": "graphite_index", "filename": "/var/lib/carbon-clickhouse/graphite_index/default.1581019947197180216.lz4"}
[2020-02-06T20:12:31.639Z] INFO [upload] handle success {"name": "graphite_index", "filename": "/var/lib/carbon-clickhouse/graphite_index/default.1581019947197180216.lz4", "time": 0.25905138}
[2020-02-06T20:12:33.379Z] INFO [upload] start handle {"name": "graphite", "filename": "/var/lib/carbon-clickhouse/graphite/default.1581019950197305764.lz4"}
[2020-02-06T20:12:33.380Z] INFO [upload] start handle {"name": "graphite_index", "filename": "/var/lib/carbon-clickhouse/graphite_index/default.1581019950197305764.lz4"}
[2020-02-06T20:12:33.380Z] INFO [upload] start handle {"name": "graphite_tagged", "filename": "/var/lib/carbon-clickhouse/graphite_tagged/default.1581019950197305764.lz4"}
[2020-02-06T20:12:45.198Z] INFO [writer] chunk interval changed {"interval": "10s"}
[2020-02-06T20:12:49.234Z] WARN [tcp] unfinished line {"peer": "[2a00:1f78:fffd:402e::101]:35402", "line": "servers.a"}
[2020-02-06T20:12:50.012Z] WARN [tcp] unfinished line {"peer": "[2a00:1f78:fffd:4002::a9d]:56246", "line": "metric.name"}
[2020-02-06T20:12:50.709Z] WARN [tcp] failed to accept connection {"error": "accept tcp [::]:2003: accept4: too many open files"}
[2020-02-06T20:12:50.709Z] WARN [tcp] failed to accept connection {"error": "accept tcp [::]:2003: accept4: too many open files"}
[2020-02-06T20:12:50.709Z] WARN [tcp] failed to accept connection {"error": "accept tcp [::]:2003: accept4: too many open files"}
**** the line repeats next 1.3GiB****

Another strange thing - disk space was allocated in df output, but du didn't show anything: (upd: looks like logrotate has removed the file, but carbon-clickhouse hasn't release it by some reason)

> df
Filesystem     1K-blocks    Used Available Use% Mounted on
udev              980608       0    980608   0% /dev
tmpfs             198304   22024    176280  12% /run
/dev/vda         7329792 7329772        20 100% /
# cuted
> sudo du -hscx / | sort -h 
1.7G	/
1.7G	total
> sudo lsof | grep -i del
COMMAND     PID   TID             USER   FD      TYPE             DEVICE SIZE/OFF       NODE NAME
carbon-cl 19199       carbon-clickhouse   13w      REG              254,0 5603639296    4872077 /var/log/carbon-clickhouse/carbon-clickhouse.log.1 (deleted)
carbon-cl 19199  6028 carbon-clickhouse   13w      REG              254,0 5603639296    4872077 /var/log/carbon-clickhouse/carbon-clickhouse.log.1 (deleted)
carbon-cl 19199 19200 carbon-clickhouse   13w      REG              254,0 5603639296    4872077 /var/log/carbon-clickhouse/carbon-clickhouse.log.1 (deleted)
*** dozens of line with different TID ***

I'll increase nofiles limit for sure because it was the reason for the outage. But maybe you'd find a way to avoid such deadlocks. Even panic and die would be better behavior IMHO.

Felixoid avatar Feb 07 '20 10:02 Felixoid

bump! Faced the same errors in log with no upload to clickhouse:

-- Logs begin at Wed 2021-09-01 20:28:48 +05, end at Mon 2021-09-06 20:06:59 +05. --
Sep 01 20:28:48 Monitoring carbon-clickhouse[1162]: [2021-09-01T20:28:48.017+0500] ERROR [upload] ReadDir failed {"name": "graphite_tagged", "error": "open /data/carbon-clickhouse/graphite_tagged: too many open f>
Sep 01 20:28:48 Monitoring carbon-clickhouse[1162]: [2021-09-01T20:28:48.017+0500] ERROR [writer] ReadDir failed {"error": "open /data/carbon-clickhouse/: too many open files"}
Sep 01 20:28:48 Monitoring carbon-clickhouse[1162]: [2021-09-01T20:28:48.019+0500] ERROR [upload] ReadDir failed {"name": "graphite_index", "error": "open /data/carbon-clickhouse/graphite_index: too many open fil>
Sep 01 20:28:48 Monitoring carbon-clickhouse[1162]: [2021-09-01T20:28:48.043+0500] ERROR [upload] ReadDir failed {"name": "graphite", "error": "open /data/carbon-clickhouse/graphite: too many open files"}
Sep 01 20:28:48 Monitoring carbon-clickhouse[1162]: [2021-09-01T20:28:48.513+0500] ERROR [writer] create failed {"filename": "/data/carbon-clickhouse/default.1630510128513008811.lz4", "error": "open /data/carbon->
Sep 01 20:28:49 Monitoring carbon-clickhouse[1162]: [2021-09-01T20:28:49.017+0500] ERROR [upload] ReadDir failed {"name": "graphite_tagged", "error": "open /data/carbon-clickhouse/graphite_tagged: too many open f>
Sep 01 20:28:49 Monitoring carbon-clickhouse[1162]: [2021-09-01T20:28:49.017+0500] ERROR [writer] ReadDir failed {"error": "open /data/carbon-clickhouse/: too many open files"}
Sep 01 20:28:49 Monitoring carbon-clickhouse[1162]: [2021-09-01T20:28:49.018+0500] ERROR [upload] ReadDir failed {"name": "graphite_index", "error": "open /data/carbon-clickhouse/graphite_index: too many open fil>
Sep 01 20:28:49 Monitoring carbon-clickhouse[1162]: [2021-09-01T20:28:49.043+0500] ERROR [upload] ReadDir failed {"name": "graphite", "error": "open /data/carbon-clickhouse/graphite: too many open files"}
Sep 01 20:28:49 Monitoring carbon-clickhouse[1162]: [2021-09-01T20:28:49.513+0500] ERROR [writer] create failed {"filename": "/data/carbon-clickhouse/default.1630510129513915235.lz4", "error": "open /data/carbon->
Sep 01 20:28:50 Monitoring carbon-clickhouse[1162]: [2021-09-01T20:28:50.018+0500] ERROR [writer] ReadDir failed {"error": "open /data/carbon-clickhouse/: too many open files"}

Can't confirm that problem is in logrotate-carbon interaction because this 'strange' thing of @Felixoid doesn't appear to me. But from my side can admit that in times of normal work carbon was consuming about 8% of cpu. Now, at the moment of bad behaviour, it consumes 100% of cpu (one full core). /data/carbon-clickhouse/graphite(index,tagged) folders contains 540k of broken symlinks in each (5 hosts x 7 days of system and app-specific metrics) and this count increases. Date of creation of oldest symlink in folder corresponds with time when metrics has stopped being uploaded.

[root@Monitoring carbon-clickhouse]# du -sh /data/carbon-clickhouse/*
4.0K	/data/carbon-clickhouse/default.1617361073904867900.lz4
16K	/data/carbon-clickhouse/default.1630397796503127702.lz4
32M	/data/carbon-clickhouse/graphite
32M	/data/carbon-clickhouse/graphite_index
32M	/data/carbon-clickhouse/graphite_tagged

Folders cleaning and restart helped. carbon-clickhouse -version 0.11.1

Sooo, any updates with this issue, диарс? :) BTW my humble opinion is same to @Felixoid that it it not good that carbon shows green state in systemctl status when in fact not working properly.

phuntik avatar Sep 06 '21 15:09 phuntik

The best thing I can see here is putting all ports down and writing an ERROR to log. Does it sound like a good solution?

Felixoid avatar Jan 10 '22 11:01 Felixoid