nebula icon indicating copy to clipboard operation
nebula copied to clipboard

Cluster (v3.2.0) becomes unstable when more data was ingested to a existing space

Open porscheme opened this issue 2 years ago • 4 comments

Please check the FAQ documentation before raising an issue

Describe the bug (required)

  • Cluster becomes unstable beyond these statistics -- vertices: 50 Million -- Edges: 3.3 Billon

  • We are seeing this in our TEST cluster

I20220915 16:28:09.498340   102 NebulaSnapshotManager.cpp:67] Space 61 Part 34 start send snapshot of commitLogId 89598562 commitLogTerm 2, rate limited to 10485760, batch size is 1048576
I20220915 16:28:09.498370   102 SnapshotManager.cpp:51] [Port: 9780, Space: 61, Part: 34] Snapshot send failed, the leader changed?
I20220915 16:28:09.498400   102 Host.cpp:355] [Port: 9780, Space: 61, Part: 34] [Host: nebula-cluster-storaged-1.nebula-cluster-storaged-headless.cohort-search.svc.cluster.local:9780] Send snapshot failed!
I20220915 16:28:09.498530    53 Host.cpp:337] [Port: 9780, Space: 61, Part: 115] [Host: nebula-cluster-storaged-1.nebula-cluster-storaged-headless.cohort-search.svc.cluster.local:9780] Can't find log 1 in wal, send the snapshot, logIdToSend = 32549073, firstLogId in wal = 32538198, lastLogId in wal = 32549073
I20220915 16:28:09.498550    66 Host.cpp:337] [Port: 9780, Space: 61, Part: 124] [Host: nebula-cluster-storaged-1.nebula-cluster-storaged-headless.cohort-search.svc.cluster.local:9780] Can't find log 1 in wal, send the snapshot, logIdToSend = 97588614, firstLogId in wal = 97582907, lastLogId in wal = 97588614

Your Environments (required)

  • OS: 18.04.1-Ubuntu x86_64 GNU/Linux
  • Compiler: Using docker images
  • CPU: lscpu Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 16 On-line CPU(s) list: 0-15 Thread(s) per core: 2 Core(s) per socket: 8 Socket(s): 1 NUMA node(s): 2 Vendor ID: AuthenticAMD CPU family: 23 Model: 1 Model name: AMD EPYC 7551 32-Core Processor Stepping: 2 CPU MHz: 1996.300 BogoMIPS: 3992.60 Hypervisor vendor: ***** Virtualization type: full L1d cache: 32K L1i cache: 64K L2 cache: 512K L3 cache: 8192K NUMA node0 CPU(s): 0-7 NUMA node1 CPU(s): 8-15
  • Commit id (e.g. a3ffc7d8) Not sure how to get this

How To Reproduce(required)

Steps to reproduce the behavior:

  1. Create a Nebula cluster -- graphd VM count: 3 (16 vCPU, 128 GB, 2 X 2 TB Premium SSD NVMe Disks) -- metad VM count: 3 (16 vCPU, 128 GB, 2 X 2 TB Premium SSD NVMe Disks) -- storaged VM count: 9 (16 vCPU, 128 GB, 2 X 2 TB Premium SSD NVMe Disks)
  2. Create a space with INT64 Vid Type, 200 Patition_num, 2 Replica Factor
  3. Initially, ingest data using spark-connector -- vertices: 50 Million -- Edges: 3.3 Billon
  4. Run 'SUBMITJOB COMPACT', executed successfully
  5. Verify cluster is very healthy, physical size of Data & Log on disk, they are not significant
  6. Ingest more data, observe cluster becomes unstable

Expected behavior

Cluster should be stable for at least 1 billion vertices and 50 billion edges

Additional context

  • We did not run BALANCE DATA
  • We created massive cluster to eliminate storage and memory issues

porscheme avatar Sep 22 '22 04:09 porscheme

cc @Sophie-Xie

wey-gu avatar Sep 22 '22 04:09 wey-gu

the Replica Factor is 2? we suggest the replica factor is odd number, please try again with replica factor 3 or 1

liwenhui-soul avatar Sep 22 '22 04:09 liwenhui-soul

the Replica Factor is 2? we suggest the replica factor is odd number, please try again with replica factor 3 or 1

Thanks for the reply. I will try with replica factor 3 and report back. Which is better Vid Type, INT64 or STRING? Also, Partition_num 200 seems okay?

porscheme avatar Sep 22 '22 06:09 porscheme

the Replica Factor is 2? we suggest the replica factor is odd number, please try again with replica factor 3 or 1

Thanks for the reply. I will try with replica factor 3 and report back. Which is better Vid Type, INT64 or STRING? Also, Partition_num 200 seems okay?

INT64 and STRING are both ok, I think. Partition_num 200 is ok.

liwenhui-soul avatar Sep 22 '22 09:09 liwenhui-soul

the Replica Factor is 2? we suggest the replica factor is odd number, please try again with replica factor 3 or 1

Thanks for the reply. I will try with replica factor 3 and report back. Which is better Vid Type, INT64 or STRING? Also, Partition_num 200 seems okay?

INT64 and STRING are both ok, I think. Partition_num 200 is ok.

After setting Replica Factor 3; cluster seems to be more stable.

porscheme avatar Oct 04 '22 05:10 porscheme