milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: [laion1b-test] mixcoord panic: runtime error: invalid memory address or nil pointer dereference

Open ThreadDao opened this issue 1 year ago • 3 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: cardinal-milvus-io-2.3-b2d3278-20240206
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar   
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus 2.3.6rc3

- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

  1. create collection laion_stable_3 with 64 num_partitions (partition-key field), it's schema is:
{'auto_id': False, 'description': '', 'fields': [{'name': 'id', 'description': '', 'type': <DataType.INT64: 5>, 'is_primary': True, 'auto_id': False}, {'name': 'float_vector', 'description': '', 'type': <DataType.FLOAT_VECTOR: 101>, 'params': {'dim': 768}}, {'name': 'int64_pk_5b', 'description': '', 'type': <DataType.INT64: 5>, 'is_partition_key': True}, {'name': 'varchar_caption', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 8192}}, {'name': 'varchar_NSFW', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 8192}}, {'name': 'float64_similarity', 'description': '', 'type': <DataType.FLOAT: 10>}, {'name': 'int64_width', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'int64_height', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'int64_original_width', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'int64_original_height', 'description': '', 'type': <DataType.INT64: 5>}, {'name': 'varchar_md5', 'description': '', 'type': <DataType.VARCHAR: 21>, 'params': {'max_length': 8192}}], 'enable_dynamic_field': True}
  1. build hnsw index and load collection
  2. insert 50m-768d data and flush -> index again -> load again
  3. concurrent: insert + delete + flush + search + query
 'concurrent_params': {'concurrent_number': 100,
                       'during_time': '10h',
                       'interval': 120,
                       'spawn_rate': None},
 'concurrent_tasks': [{'type': 'query',
                       'weight': 4,
                       'params': {'expr': '50000000 '
                                          '< '
                                          'id '
                                          '< '
                                          '5010000',
                                  'timeout': 1200}},
                      {'type': 'search',
                       'weight': 25,
                       'params': {'nq': 10,
                                  'top_k': 100,
                                  'random_data': True,
                                  'search_param': {'ef': 100},
                                  'timeout': 600}},
                      {'type': 'insert',
                       'weight': 10,
                       'params': {'nb': 200,
                                  'start_id': 50000000,
                                  'random_id': True,
                                  'random_vector': True,
                                  'timeout': 600}},
                      {'type': 'delete',
                       'weight': 10,
                       'params': {'delete_length': 100,
                                  'timeout': 600}},
                      {'type': 'flush',
                       'weight': 1,
                       'params': {'timeout': 600}}]},
  1. problems: a. mixcoord panic: mc_qqg4d_pre_panic.log
\":\"files/insert_log/447619508238296599/447619508238296630/447619508291373964/102/447619508290724826\",\"log_size\":1584}]}]"]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x3caf1ee]

goroutine 26463 [running]:
panic({0x4b333a0, 0x74ac860})
    /usr/local/go/src/runtime/panic.go:987 +0x3bb fp=0xc021134090 sp=0xc021133fd0 pc=0x1a99cdb
runtime.panicmem(...)
    /usr/local/go/src/runtime/panic.go:260
runtime.sigpanic()
    /usr/local/go/src/runtime/signal_unix.go:841 +0x37d fp=0xc0211340f0 sp=0xc021134090 pc=0x1ab1fdd
github.com/milvus-io/milvus/internal/distributed/datanode/client.wrapGrpcCall[...]({0x56e7a78, 0xc033927f50?}, 0x0, 0xc036ab76c0)
    /go/src/github.com/milvus-io/milvus/internal/distributed/datanode/client/client.go:90 +0xae fp=0xc021134128 sp=0xc0211340f0 pc=0x3caf1ee
github.com/milvus-io/milvus/internal/distributed/datanode/client.(*Client).GetMetrics(0xc010e1c120?, {0x56e7a78?, 0xc033927f50}, 0x40b0000000000000?, {0x652?, 0x27?, 0x26?})
    /go/src/github.com/milvus-io/milvus/internal/distributed/datanode/client/client.go:168 +0x107 fp=0xc021134188 sp=0xc021134128 pc=0x3ca6b27
github.com/milvus-io/milvus/internal/datacoord.(*Server).getDataNodeMetrics(_, {_, _}, _, _)
    /go/src/github.com/milvus-io/milvus/internal/datacoord/metrics_info.go:154 +0x134 fp=0xc021134418 sp=0xc021134188 pc=0x3deadf4
github.com/milvus-io/milvus/internal/datacoord.(*Server).getSystemInfoMetrics(0xc00138f8c0, {0x56e7a78, 0xc033927f50}, 0x0?)
    /go/src/github.com/milvus-io/milvus/internal/datacoord/metrics_info.go:63 +0x1d8 fp=0xc021134ed8 sp=0xc021134418 pc=0x3dea1d8
github.com/milvus-io/milvus/internal/datacoord.(*Server).GetMetrics(0xc00138f8c0, {0x56e7a78, 0xc033927f50}, 0xc033994f80)
    /go/src/github.com/milvus-io/milvus/internal/datacoord/services.go:987 +0x1b8 fp=0xc0211353f8 sp=0xc021134ed8 pc=0x3e10278
github.com/milvus-io/milvus/internal/distributed/datacoord.(*Server).GetMetrics(0xc0327af000?, {0x56e7a78?, 0xc033927f50?}, 0xc03399a338?)
    /go/src/github.com/milvus-io/milvus/internal/distributed/datacoord/service.go:352 +0x2f fp=0xc021135428 sp=0xc0211353f8 pc=0x3e2cb2f
github.com/milvus-io/milvus/internal/proto/datapb._DataCoord_GetMetrics_Handler.func1({0x56e7a78, 0xc033927f50}, {0x4e7eb40?, 0xc033994f80})
    /go/src/github.com/milvus-io/milvus/internal/proto/datapb/data_coord.pb.go:6806 +0x7b fp=0xc021135468 sp=0xc021135428 pc=0x2947afb
b. many flush 120s timeout

image c. datanode laion1b-test-2-milvus-datanode-5989b844f5-pw5sg oomkilled and laion1b-test-2-milvus-datanode-5989b844f5-zm6lp ERROR 1 ExitCode terminated

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

laion1b-test-2-etcd-0                                             1/1     Running       1 (40d ago)      67d
laion1b-test-2-etcd-1                                             1/1     Running       0                67d
laion1b-test-2-etcd-2                                             1/1     Running       0                67d
laion1b-test-2-milvus-datanode-5989b844f5-pw5sg                   1/1     Running       121 (10h ago)    14d
laion1b-test-2-milvus-datanode-5989b844f5-zm6lp                   1/1     Running       116 (10h ago)    14d
laion1b-test-2-milvus-indexnode-7bb59785b5-46clt                  1/1     Running       0                14d
laion1b-test-2-milvus-indexnode-7bb59785b5-7hfxj                  1/1     Running       0                14d
laion1b-test-2-milvus-indexnode-7bb59785b5-gfp8l                  1/1     Running       0                14d
laion1b-test-2-milvus-indexnode-7bb59785b5-jlrnw                  1/1     Running       0                14d
laion1b-test-2-milvus-indexnode-7bb59785b5-knxwn                  1/1     Running       0                14d
laion1b-test-2-milvus-mixcoord-868c566c7c-qqg4d                   1/1     Running       22 (10h ago)     14d
laion1b-test-2-milvus-proxy-64b6d7787-zck8d                       1/1     Running       1 (14d ago)      14d
laion1b-test-2-milvus-querynode-1-6f8889c79b-2j46t                1/1     Running       0                14d
laion1b-test-2-milvus-querynode-1-6f8889c79b-jfbtf                1/1     Running       0                14d
laion1b-test-2-milvus-querynode-1-6f8889c79b-nv75z                1/1     Running       0                14d
laion1b-test-2-milvus-querynode-1-6f8889c79b-w5dg2                1/1     Running       0                14d
laion1b-test-2-pulsar-bookie-0                                    1/1     Running       0                67d
laion1b-test-2-pulsar-bookie-1                                    1/1     Running       0                44d
laion1b-test-2-pulsar-bookie-2                                    1/1     Running       0                67d
laion1b-test-2-pulsar-broker-0                                    1/1     Running       0                61d
laion1b-test-2-pulsar-proxy-0                                     1/1     Running       0                67d
laion1b-test-2-pulsar-recovery-0                                  1/1     Running       0                67d
laion1b-test-2-pulsar-zookeeper-0                                 1/1     Running       0                67d
laion1b-test-2-pulsar-zookeeper-1                                 1/1     Running       0                67d
laion1b-test-2-pulsar-zookeeper-2                                 1/1     Running       0                67d

Anything else?

No response

ThreadDao avatar Feb 21 '24 04:02 ThreadDao

/assign @xiaocai2333

xiaofan-luan avatar Feb 21 '24 04:02 xiaofan-luan

/unassign

yanliang567 avatar Feb 21 '24 09:02 yanliang567

Bug: set nil struct pointer to describe nil interface.

func defaultSessionCreator() dataNodeCreatorFunc {
	return func(ctx context.Context, addr string, nodeID int64) (types.DataNodeClient, error) {
		return grpcdatanodeclient.NewClient(ctx, addr, nodeID) // default
	}
}

func NewClient(ctx context.Context, addr string, nodeID int64) (*Client, error) {
...
}

chyezh avatar Feb 22 '24 06:02 chyezh

did not appear again

ThreadDao avatar Feb 29 '24 06:02 ThreadDao