VictoriaMetrics icon indicating copy to clipboard operation
VictoriaMetrics copied to clipboard

vminsert upgraded to v1.101.0-cluster, CPU usage increased from 35% to 100%

Open aluode99 opened this issue 1 year ago • 13 comments

Describe the bug

after upgrading vminsert to version v1.101.0-cluster, under the same conditions, the CPU usage has increased from 35% to 100%.

before the upgrade, the CPU usage was as follows: image

image

after upgrading to v1.101.0-cluster version, the CPU usage is as follows: image

image

by analyzing pprof, it is suspected that insert_ctx_pool is the cause. Roll insert_ctx_pool back to v1.100.0-cluster, recompile vminsert(v1.101.0-cluster), and CPU usage returns to normal. image image repeated testing many times, the CPU usage will increase after upgrading to v1.101.0. Only rolling insert_ctx_pool back to v1.100.0 solves the problem. should insert_ctx_pool be rolled back to v1.100.0 ?

To Reproduce

upgrading vminsert to version v1.101.0-cluster and observe changes in cpu usage.

Version

v1.101.0-cluster

Logs

No response

Screenshots

No response

Used command-line flags

No response

Additional information

No response

aluode99 avatar Jul 30 '24 14:07 aluode99

On 1.102.0 we have the same issue

Sinketsu avatar Jul 30 '24 15:07 Sinketsu

I'm linking the related commit as it's not mentioned in CHANGELOG.

https://github.com/VictoriaMetrics/VictoriaMetrics/commit/498fe1cfa523be5bfecaa372293c3cded85e75ab

@aluode99 could you provide the pprof result?

jiekun avatar Jul 31 '24 03:07 jiekun

@aluode99 would be also great if you could provide resource requests/limits for your vminsert components.

hagen1778 avatar Jul 31 '24 11:07 hagen1778

@aluode99 would be also great if you could provide resource requests/limits for your vminsert components.

@hagen1778 vminsert resources are as follows: replicas: 6
resource requests: 7c6G resource limits : 7c6G Datapoints ingestion rate: 3.5 Mil image

ff2daa50c81cb46fe0860fdf5478f252 image

aluode99 avatar Jul 31 '24 13:07 aluode99

I'm linking the related commit as it's not mentioned in CHANGELOG.

498fe1c

@aluode99 could you provide the pprof result? @jiekun Sorry, I didn't save pprof. I can provide monitoring data if needed.

aluode99 avatar Jul 31 '24 13:07 aluode99

Trying to reproduce it locally with very simple setup. And here's my profile for v1.100.0/v1.100.0-without-ch/v1.101.0. profile.zip

I did not observe a significant difference in CPU usage (I did not scrape the precise metrics), but I noticed some differences in these profiles in terms of Total Samples, which may indicate the difference of CPU usage:

  • v1.100.0: ~100%
  • v1.101.0 / v1.100.0-without-ch: ~200%

I may need to setup a test env and re-test it.

jiekun avatar Aug 01 '24 04:08 jiekun

Hello! Tell me, please, what is the status of issue now? This blocks us from updating to the latest version. Is there anything I can do to help? Thanks!

Sinketsu avatar Aug 07 '24 10:08 Sinketsu

@Sinketsu @aluode99 Hi. I'm running the related version of vminsert in our internal cluster to reproduce the issue.

It would be helpful if you could provide the monitor dashboard under vminsert (including: Requests rate, Concurrent inserts, CPU usage, Memory usage, Storage connection saturation, Storage reachability, Network usage: clients, Network usage: vmstorage, Row per insert) when running v1.101.0 (or 1.102.0).

Also, please try to capture the cpu profile.

jiekun avatar Aug 07 '24 12:08 jiekun

Hello! We have independent clusters of vminsert in different AZ. Different AZ have exactly the same load. So, I deploy 1.97.3 to one AZ and 1.102.0 to another to compare in real time.

Graphs per one instance of each cluster (all other instances are similiar):

image image image image image

pprofs: pprof.zip

Sinketsu avatar Aug 09 '24 09:08 Sinketsu

@Sinketsu Thank you for the support and feedback. I've also observed similar issues during testing. We are discussing internally and will continue to update progress on this issue.

jiekun avatar Aug 09 '24 10:08 jiekun

Hello, can you please try to provide GOGC=100 env variable to vminsert?

By default, VictoriaMetrics uses GOGC=30 and it seems, that It could make sync.Pool inefficient for some cases.

f41gh7 avatar Aug 09 '24 13:08 f41gh7

I set GOGC=100 on 1.102.0 version. There are CPU/Mem metrics image

Cpu has become better (I think it is similiar with 1.97.3). But now memory was increase)

Sinketsu avatar Aug 09 '24 14:08 Sinketsu

The change was reverted to the state before v1.101.0 in this PR https://github.com/VictoriaMetrics/VictoriaMetrics/pull/6794

hagen1778 avatar Aug 13 '24 14:08 hagen1778

This bugfix was included into v1.103.0 and v1.102.2 releases.

hagen1778 avatar Aug 29 '24 12:08 hagen1778