etcd DB size diverged but kvhash were consistent across 3 nodes

What happened?

The DB size diff can be as high as 3GiB. Here is one screenshot of the db size metric. The drop of the blue line is actually our automation workflow replacing a new node with member removal/add.

node1

# HELP etcd_mvcc_db_total_size_in_bytes Total size of the underlying database physically allocated in bytes.
# TYPE etcd_mvcc_db_total_size_in_bytes gauge
etcd_mvcc_db_total_size_in_bytes 5.299228672e+09
# HELP etcd_mvcc_db_total_size_in_use_in_bytes Total size of the underlying database logically in use in bytes.
# TYPE etcd_mvcc_db_total_size_in_use_in_bytes gauge
etcd_mvcc_db_total_size_in_use_in_bytes 5.24292096e+09

node2

# HELP etcd_mvcc_db_total_size_in_bytes Total size of the underlying database physically allocated in bytes.
# TYPE etcd_mvcc_db_total_size_in_bytes gauge
etcd_mvcc_db_total_size_in_bytes 3.873603584e+09
# HELP etcd_mvcc_db_total_size_in_use_in_bytes Total size of the underlying database logically in use in bytes.
# TYPE etcd_mvcc_db_total_size_in_use_in_bytes gauge
etcd_mvcc_db_total_size_in_use_in_bytes 3.866206208e+09

node3

# HELP etcd_mvcc_db_total_size_in_bytes Total size of the underlying database physically allocated in bytes.
# TYPE etcd_mvcc_db_total_size_in_bytes gauge
etcd_mvcc_db_total_size_in_bytes 5.16868096e+09
# HELP etcd_mvcc_db_total_size_in_use_in_bytes Total size of the underlying database logically in use in bytes.
# TYPE etcd_mvcc_db_total_size_in_use_in_bytes gauge
etcd_mvcc_db_total_size_in_use_in_bytes 5.153357824e+09

Interestingly, multiple run of the hashkv command output indicates there are no data inconsistency.

REVISION=$(etcdctl get foo -w json | jq -r '.header.revision')
etcdctl endpoint hashkv \
--endpoints IP_1,IP_2,IP_3 \
--rev $REVISION -w table
+---------------------------+-----------+
|         ENDPOINT          |   HASH    |
+---------------------------+-----------+
| xxxxx                IP_1 | 637567879 |
| xxxxx                IP_2 | 637567879 |
| xxxxx                IP_3 | 637567879 |
+---------------------------+-----------+

ClientURLs are obfuscated.

After taking the db snapshot of node1 and node2. I ran some bbolt inspection commands to gain more insights.

node1

$ sudo bbolt page /investigate-node-a/datadir/member/snap/db 0
Page ID:    0
Page Type:  meta
Total Size: 4096 bytes
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=686251>
Freelist:   <pgid=18446744073709551615>
HWM:        <pgid=956035>
Txn ID:     171026
Checksum:   34db10a3202fae85

$ sudo ls -lh /investigate-node-a/datadir/member/snap/db
-rw------- 1 root root 3.7G Jul 20 05:31 /investigate-node-a/datadir/member/snap/db

$ sudo bbolt stats /investigate-node-a/datadir/member/snap/db
Aggregate statistics for 10 buckets

Page count statistics
	Number of logical branch pages: 6722
	Number of physical branch overflow pages: 0
	Number of logical leaf pages: 696527
	Number of physical leaf overflow pages: 233880
Tree statistics
	Number of keys/value pairs: 2018088
	Number of levels in B+tree: 4
Page size utilization
	Bytes allocated for physical branch pages: 27533312
	Bytes actually used for branch data: 23318078 (84%)
	Bytes allocated for physical leaf pages: 3810947072
	Bytes actually used for leaf data: 3087101146 (81%)
Bucket statistics
	Total number of buckets: 10
	Total number on inlined buckets: 8 (80%)
	Bytes used for inlined buckets: 874 (0%)

node2

sudo bbolt page /investigate-node-b/datadir/member/snap/db 0
Page ID:    0
Page Type:  meta
Total Size: 4096 bytes
Version:    2
Page Size:  4096 bytes
Flags:      00000000
Root:       <pgid=1277606>
Freelist:   <pgid=18446744073709551615>
HWM:        <pgid=1277607>
Txn ID:     159180
Checksum:   cf97efb8536103bf

$ sudo ls -lh /investigate-node-b/datadir/member/snap/db
-rw------- 1 root root 4.9G Jul 20 05:38 /investigate-node-b/datadir/member/snap/db

$ sudo bbolt stats /investigate-node-b/datadir/member/snap/db
Aggregate statistics for 10 buckets

Page count statistics
	Number of logical branch pages: 7970
	Number of physical branch overflow pages: 0
	Number of logical leaf pages: 834707
	Number of physical leaf overflow pages: 434922
Tree statistics
	Number of keys/value pairs: 2411556
	Number of levels in B+tree: 4
Page size utilization
	Bytes allocated for physical branch pages: 32645120
	Bytes actually used for branch data: 27944661 (85%)
	Bytes allocated for physical leaf pages: 5200400384
	Bytes actually used for leaf data: 4192003086 (80%)
Bucket statistics
	Total number of buckets: 10
	Total number on inlined buckets: 8 (80%)
	Bytes used for inlined buckets: 874 (0%)

Note: our etcd operator rans defrag periodically in randomized (+- 10mins) 6 hours exclusively which will shrink/reclaim the un-used free page. But it does not explain the above situation.

What did you expect to happen?

I expect DB size diff can not be diverged like above. It cannot monotonically grow without decrease.

How can we reproduce it (as minimally and precisely as possible)?

The issue is still ongoing.

Anything else we need to know?

N/A

Etcd version (please run commands below)

sh-4.2$ etcdctl version
etcdctl version: 3.5.4
API version: 3.5

sh-4.2$ etcd --version
etcd Version: 3.5.4
Git SHA: b4d0ba964
Go Version: go1.16.15
Go OS/Arch: linux/amd64

Etcd configuration (command line flags or environment variables)

/usr/bin/etcd --name node-1 \
--data-dir /mnt/xvdb/datadir \
--wal-dir /mnt/xvdb/datadir/member/wal --snapshot-count 10000 \
--initial-cluster-token 8e84c0ec-ef6d-4616-98fc-008b0f120f4f \
--advertise-client-urls http://IP_1:2379 --listen-client-urls http://0.0.0.0:2379 \
--listen-peer-urls http://0.0.0.0:2380 --initial-advertise-peer-urls http://IP_1:2380 \
--quota-backend-bytes 8388608000 --max-request-bytes 1572864 \
--grpc-keepalive-interval 2h0m0s --pre-vote=true --logger zap

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

$ etcdctl member list -w table
# paste output here

$ etcdctl --endpoints=<member list> endpoint status -w table
# paste output here

Relevant log output

No response

Jul 21 '22 01:07 chaochn47

I haven't looked to deep why this happens (expect you don't run defrag on all members), but please remember that db_total_size is different than db_in_use_size.

Jul 21 '22 09:07 serathius

https://github.com/etcd-io/etcd/blob/6d27a42b7d6191da43d27930282de5b9e54ead7c/server/storage/backend/backend.go#L621-L630

The diff is the number of freePage * page size (4k). I think it might be because of compaction (need to confirm).

But in the manual scraping /metric result, even db_in_use_size diverged as big as 2G.

Jul 21 '22 17:07 chaochn47

The diff is the number of freePage * page size (4k). I think it might be because of compaction (need to confirm).

Compaction windows is negotiated on raft, so I would not expect compaction could cause members diverging. Only option would be that compaction running in background bugged out on one of the members, but that would be strange.

But in the manual scraping /metric result, even db_in_use_size diverged as big as 2G.

This definitely is unexpected. Would love to see boltdb dump diff. To rule out compaction as a culprit could to scan revision space?

Jul 22 '22 07:07 serathius

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Oct 22 '22 18:10 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Mar 18 '23 09:03 stale[bot]

感谢，这边也遇到相同的问题

root@k8s-master01 17:19:10 ~ # etcdctl --endpoints="10.10.11.99:2379,10.10.11.100:2379,10.10.11.102:2379" --cacert=/etc/kubernetes/pki/etcd/etcd-ca.pem --cert=/etc/kubernetes/pki/etcd/etcd.pem --key=/etc/kubernetes/pki/etcd/etcd-key.pem endpoint status --write-out=table +-------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +-------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | 10.10.11.99:2379 | ce473dddcf114fa6 | 3.5.6 | 129 MB | true | false | 81 | 127683790 | 127683790 | | | 10.10.11.100:2379 | 8c3d1a8f0a5bc682 | 3.5.6 | 127 MB | false | false | 81 | 127683790 | 127683790 | | | 10.10.11.102:2379 | 2a25a0237b706a0d | 3.5.6 | 127 MB | false | false | 81 | 127683790 | 127683790 | | +-------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ root@k8s-master01 17:20:53 ~ # REVISION=$(etcdctl --endpoints="10.10.11.99:2379,10.10.11.100:2379,10.10.11.102:2379" --cacert=/etc/kubernetes/pki/etcd/etcd-ca.pem --cert=/etc/kubernetes/pki/etcd/etcd.pem --key=/etc/kubernetes/pki/etcd/etcd-key.pem get foo -w json | jq -r '.header.revision') root@k8s-master01 17:26:20 ~ # etcdctl --endpoints="10.10.11.99:2379,10.10.11.100:2379,10.10.11.102:2379" --cacert=/etc/kubernetes/pki/etcd/etcd-ca.pem --cert=/etc/kubernetes/pki/etcd/etcd.pem --key=/etc/kubernetes/pki/etcd/etcd-key.pem endpoint hashkv --rev $REVISION -w table +-------------------+------------+ | ENDPOINT | HASH | +-------------------+------------+ | 10.10.11.99:2379 | 1979968429 | | 10.10.11.100:2379 | 1979968429 | | 10.10.11.102:2379 | 1979968429 | +-------------------+------------+ root@k8s-master01 17:26:47 ~ # etcdctl --endpoints="10.10.11.99:2379,10.10.11.100:2379,10.10.11.102:2379" --cacert=/etc/kubernetes/pki/etcd/etcd-ca.pem --cert=/etc/kubernetes/pki/etcd/etcd.pem --key=/etc/kubernetes/pki/etcd/etcd-key.pem endpoint status --write-out=table +-------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | ENDPOINT | ID | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS | +-------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ | 10.10.11.99:2379 | ce473dddcf114fa6 | 3.5.6 | 129 MB | true | false | 81 | 127689679 | 127689679 | | | 10.10.11.100:2379 | 8c3d1a8f0a5bc682 | 3.5.6 | 128 MB | false | false | 81 | 127689679 | 127689679 | | | 10.10.11.102:2379 | 2a25a0237b706a0d | 3.5.6 | 128 MB | false | false | 81 | 127689679 | 127689679 | | +-------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+ root@k8s-master01 17:27:00 ~ #

Jun 01 '23 09:06 gzwxh

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

Sep 17 '23 23:09 stale[bot]

etcd etcd copied to clipboard

DB size diverged but kvhash were consistent across 3 nodes

What happened?

node1

node2

node3

node1

node2

What did you expect to happen?

How can we reproduce it (as minimally and precisely as possible)?

Anything else we need to know?

Etcd version (please run commands below)

Etcd configuration (command line flags or environment variables)

Etcd debug information (please run commands blow, feel free to obfuscate the IP address or FQDN in the output)

Relevant log output

etcd
etcd copied to clipboard