nebula
nebula copied to clipboard
Cluster (v3.2.0) becomes unstable when more data was ingested to a existing space
Please check the FAQ documentation before raising an issue
Describe the bug (required)
-
Cluster becomes unstable beyond these statistics -- vertices: 50 Million -- Edges: 3.3 Billon
-
We are seeing this in our TEST cluster
I20220915 16:28:09.498340 102 NebulaSnapshotManager.cpp:67] Space 61 Part 34 start send snapshot of commitLogId 89598562 commitLogTerm 2, rate limited to 10485760, batch size is 1048576
I20220915 16:28:09.498370 102 SnapshotManager.cpp:51] [Port: 9780, Space: 61, Part: 34] Snapshot send failed, the leader changed?
I20220915 16:28:09.498400 102 Host.cpp:355] [Port: 9780, Space: 61, Part: 34] [Host: nebula-cluster-storaged-1.nebula-cluster-storaged-headless.cohort-search.svc.cluster.local:9780] Send snapshot failed!
I20220915 16:28:09.498530 53 Host.cpp:337] [Port: 9780, Space: 61, Part: 115] [Host: nebula-cluster-storaged-1.nebula-cluster-storaged-headless.cohort-search.svc.cluster.local:9780] Can't find log 1 in wal, send the snapshot, logIdToSend = 32549073, firstLogId in wal = 32538198, lastLogId in wal = 32549073
I20220915 16:28:09.498550 66 Host.cpp:337] [Port: 9780, Space: 61, Part: 124] [Host: nebula-cluster-storaged-1.nebula-cluster-storaged-headless.cohort-search.svc.cluster.local:9780] Can't find log 1 in wal, send the snapshot, logIdToSend = 97588614, firstLogId in wal = 97582907, lastLogId in wal = 97588614
Your Environments (required)
- OS: 18.04.1-Ubuntu x86_64 GNU/Linux
- Compiler: Using docker images
- CPU:
lscpu
Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 NUMA node(s): 2 Vendor ID: AuthenticAMD CPU family: 23 Model: 1 Model name: AMD EPYC 7551 32-Core Processor Stepping: 2 CPU MHz: 1996.300 BogoMIPS: 3992.60 Hypervisor vendor: ***** Virtualization type: full L1d cache: 32K L1i cache: 64K L2 cache: 512K L3 cache: 8192K NUMA node0 CPU(s): 0-7 NUMA node1 CPU(s): 8-15 - Commit id (e.g.
a3ffc7d8
) Not sure how to get this
How To Reproduce(required)
Steps to reproduce the behavior:
- Create a Nebula cluster -- graphd VM count: 3 (16 vCPU, 128 GB, 2 X 2 TB Premium SSD NVMe Disks) -- metad VM count: 3 (16 vCPU, 128 GB, 2 X 2 TB Premium SSD NVMe Disks) -- storaged VM count: 9 (16 vCPU, 128 GB, 2 X 2 TB Premium SSD NVMe Disks)
- Create a space with INT64 Vid Type, 200 Patition_num, 2 Replica Factor
- Initially, ingest data using spark-connector -- vertices: 50 Million -- Edges: 3.3 Billon
- Run 'SUBMITJOB COMPACT', executed successfully
- Verify cluster is very healthy, physical size of Data & Log on disk, they are not significant
- Ingest more data, observe cluster becomes unstable
Expected behavior
Cluster should be stable for at least 1 billion vertices and 50 billion edges
Additional context
- We did not run BALANCE DATA
- We created massive cluster to eliminate storage and memory issues
cc @Sophie-Xie
the Replica Factor is 2? we suggest the replica factor is odd number, please try again with replica factor 3 or 1
the Replica Factor is 2? we suggest the replica factor is odd number, please try again with replica factor 3 or 1
Thanks for the reply. I will try with replica factor 3 and report back. Which is better Vid Type, INT64 or STRING? Also, Partition_num 200 seems okay?
the Replica Factor is 2? we suggest the replica factor is odd number, please try again with replica factor 3 or 1
Thanks for the reply. I will try with replica factor 3 and report back. Which is better Vid Type, INT64 or STRING? Also, Partition_num 200 seems okay?
INT64 and STRING are both ok, I think. Partition_num 200 is ok.
the Replica Factor is 2? we suggest the replica factor is odd number, please try again with replica factor 3 or 1
Thanks for the reply. I will try with replica factor 3 and report back. Which is better Vid Type, INT64 or STRING? Also, Partition_num 200 seems okay?
INT64 and STRING are both ok, I think. Partition_num 200 is ok.
After setting Replica Factor 3; cluster seems to be more stable.