What problem does this PR solve?

If we have a large number of tikv-server, it will cost a lot of time(several hours) to generate config

What is changed and how it works?

./tiup-cluster scale-in Kvstore_UAT_0 --node <IP:port> -y --ignore-config-roles tikv

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Code changes

Has exported function/method change
Has exported variable/fields change
Has interface methods change
Has persistent data change

Side effects

Possible performance regression
Increased code complexity
Breaking backward compatibility

Related changes

Need to cherry-pick to the release branch
Need to update the documentation

Release notes:

NONE

Jul 14 '22 11:07 Smityz

[REVIEW NOTIFICATION]

This pull request has not been approved.

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment. After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review. Reviewer can cancel approval by submitting a request changes review.

Jul 14 '22 11:07 ti-chi-bot

All committers have signed the CLA.

Jul 14 '22 11:07 CLAassistant

Codecov Report

Base: 56.31% // Head: 50.94% // Decreases project coverage by -5.37% :warning:

Coverage data is based on head (ad17bd3) compared to base (9e2e464). Patch coverage: 100.00% of modified lines in pull request are covered.

:exclamation: Current head ad17bd3 differs from pull request most recent head 286e65e. Consider uploading reports for the commit 286e65e to get more accurate results

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #1987      +/-   ##
==========================================
- Coverage   56.31%   50.94%   -5.37%     
==========================================
  Files         313      312       -1     
  Lines       33492    33481      -11     
==========================================
- Hits        18858    17055    -1803     
- Misses      12415    14212    +1797     
+ Partials     2219     2214       -5

Flag	Coverage Δ
tiup	`16.17% <ø> (ø)`
unittest	`?`

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
pkg/cluster/operation/operation.go	`80.65% <ø> (ø)`
components/cluster/command/prune.go	`59.09% <100.00%> (ø)`
components/cluster/command/scale_in.go	`75.00% <100.00%> (ø)`
components/cluster/command/scale_out.go	`74.29% <100.00%> (ø)`
pkg/cluster/manager/builder.go	`67.20% <100.00%> (ø)`
components/dm/ansible/worker.go	`0.00% <0.00%> (-100.00%)`	:arrow_down:
pkg/meta/err.go	`0.00% <0.00%> (-76.19%)`	:arrow_down:
pkg/cluster/api/error.go	`0.00% <0.00%> (-75.00%)`	:arrow_down:
pkg/crypto/rand/passwd.go	`0.00% <0.00%> (-75.00%)`	:arrow_down:
pkg/telemetry/node_info.go	`0.00% <0.00%> (-70.73%)`	:arrow_down:
... and 53 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

Jul 14 '22 11:07 codecov-commenter

  - Download blackbox_exporter: (linux/amd64) ... Done
  - Download node_exporter: (linux/amd64) ... Done
  - Download alertmanager: (linux/amd64) ... Error
Error: read manifest from mirror(https://tiup-mirrors.pingcap.com/) failed: invalid signature for file root.json: not enough signatures (2) for threshold 3 in root.json

I can't fix this unit test, can someone help me?

Jul 15 '22 06:07 Smityz

I think we should resolve that it takes a lot of time(several hours) to generate config, rather than provide a flag to ignore it

Jul 15 '22 10:07 nexustar

I think we should resolve that it takes a lot of time(several hours) to generate config, rather than provide a flag to ignore it

I don't know why we should generate configs in all nodes when scale in. The config seems unchanged. Could you please explain it to me?

Jul 15 '22 12:07 Smityz

PTAL

Aug 03 '22 10:08 Smityz

If any PD node is scaled in, we should re-generate configs for all TiKV and TiDB nodes as they are in the startup scripts. And the Prometheus config is always updated if any node is added or removed from the cluster.

I agree that we don't have to regenerate configs for all nodes in some cases, but that could be quite complex to implement, the current approach is a reasonable workaround.

Could you rename the --ignore-components argument to something like --ignore-config-roles to show that it is for configs? And I think it could be better to mark it as hidden as well.

Aug 30 '22 03:08 AstroProfundis

Thanks for your explanation @AstroProfundis I think maybe we can disable regenerate configs when TiDB/TiKV scale in/out or prune? I think this operation is safe.

Sep 01 '22 07:09 Smityz

Sorry for the delay...

I think maybe we can disable regenerate configs when TiDB/TiKV scale in/out or prune?

I agree, and I think TiFlash is also safe to be ignored, but I'm not 100% sure about that...

Oct 12 '22 08:10 AstroProfundis

In our production environment(only TiKV cluster), I have used this code many times and found nothing unusual, how about adding this feature as optional? Our cluster has hundreds of nodes, and init config for every node when scaling is really slow.

Oct 15 '22 10:10 Smityz

I agree that adding it as an optional switch for users to decide what components should be ignored when updating configs could be reasonable.

Could you rename the --ignore-components argument to something like --ignore-config-roles to show that it is for configs? And I think it could be better to mark it as hidden as well.

How about like this?

Oct 17 '22 06:10 AstroProfundis

@AstroProfundis It's been a long time, Do you still interested in this PR?

Nov 24 '22 09:11 Smityz

cluster: support IngoreInitConfigComps

What problem does this PR solve?

What is changed and how it works?

Check List

Codecov Report