milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: [Nightly] [Add Field]Add collection field frequently failed for timeout(120s)

Open NicoYuan1986 opened this issue 5 months ago • 4 comments

Is there an existing issue for this?

  • [x] I have searched the existing issues

Environment

- Milvus version: 9873e0e
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka):    kafka
- SDK version(e.g. pymilvus v2.0.0rc2):2.6.0rc139
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

Add collection field frequently failed for timeout(120s).

[2025-06-12 21:48:58 - DEBUG - ci_test]: (api_request)  : [MilvusClient.add_collection_field] args: ['client_index_1tHyxUwl', 'field_int', <DataType.INT32: 4>, ''], kwargs: {'nullable': True, 'timeout': 120} (api_request.py:52)
[2025-06-12 21:50:58 - ERROR - ci_test]: Traceback (most recent call last):
  File "/milvus/tests/python_client/utils/api_request.py", line 22, in inner_wrapper
    res = func(*args, **_kwargs)
  File "/milvus/tests/python_client/utils/api_request.py", line 53, in api_request
    return func(*arg, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/pymilvus/milvus_client/milvus_client.py", line 1143, in add_collection_field
    conn.add_collection_field(
  File "/usr/local/lib/python3.10/site-packages/pymilvus/decorators.py", line 155, in handler
    raise e from e
  File "/usr/local/lib/python3.10/site-packages/pymilvus/decorators.py", line 137, in handler
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/pymilvus/decorators.py", line 173, in handler
    return func(self, *args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/pymilvus/decorators.py", line 90, in handler
    raise e from e
  File "/usr/local/lib/python3.10/site-packages/pymilvus/decorators.py", line 86, in handler
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/pymilvus/client/grpc_handler.py", line 337, in add_collection_field
    status = self._stub.AddCollectionField(
  File "/usr/local/lib/python3.10/site-packages/grpc/_interceptor.py", line 277, in __call__
    response, ignored_call = self._with_call(
  File "/usr/local/lib/python3.10/site-packages/grpc/_interceptor.py", line 332, in _with_call
    return call.result(), call
  File "/usr/local/lib/python3.10/site-packages/grpc/_channel.py", line 440, in result
    raise self
  File "/usr/local/lib/python3.10/site-packages/grpc/_interceptor.py", line 315, in continuation
    response, call = self._thunk(new_method).with_call(
  File "/usr/local/lib/python3.10/site-packages/grpc/_interceptor.py", line 343, in with_call
    return self._with_call(
  File "/usr/local/lib/python3.10/site-packages/grpc/_interceptor.py", line 332, in _with_call
    return call.result(), call
  File "/usr/local/lib/python3.10/site-packages/grpc/_channel.py", line 440, in result
    raise self
  File "/usr/local/lib/python3.10/site-packages/grpc/_interceptor.py", line 315, in continuation
    response, call = self._thunk(new_method).with_call(
  File "/usr/local/lib/python3.10/site-packages/grpc/_interceptor.py", line 343, in with_call
    return self._with_call(
  File "/usr/local/lib/python3.10/site-packages/grpc/_interceptor.py", line 332, in _with_call
    return call.result(), call
  File "/usr/local/lib/python3.10/site-packages/grpc/_channel.py", line 440, in result
    raise self
  File "/usr/local/lib/python3.10/site-packages/grpc/_interceptor.py", line 315, in continuation
    response, call = self._thunk(new_method).with_call(
  File "/usr/local/lib/python3.10/site-packages/grpc/_channel.py", line 1198, in with_call
    return _end_unary_response_blocking(state, call, True, None)
  File "/usr/local/lib/python3.10/site-packages/grpc/_channel.py", line 1006, in _end_unary_response_blocking
    raise _InactiveRpcError(state)  # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "UNKNOWN:Error received from peer  {created_time:"2025-06-12T21:50:58.025800373+00:00", grpc_status:4, grpc_message:"Deadline Exceeded"}"
>
 (api_request.py:35)
[2025-06-12 21:50:58 - ERROR - ci_test]: (api_response) : <_InactiveRpcError of RPC that terminated with:
	status = StatusCode.DEADLINE_EXCEEDED
	details = "Deadline Exceeded"
	debug_error_string = "UNKNOWN:Error received from peer  {created_time:"2025-06-12T21:50:58.025800373+00:00", grpc_status:4, grpc_message:"Deadline Exceeded"}"
> (api_request.py:36)

Expected Behavior

add field successfully

Steps To Reproduce


Milvus Log

  1. link: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI(new)/detail/master/379/pipeline/109
  2. log: artifacts-milvus-distributed-kafka-mdk-master-379-py-n-379-e2e-logs.tar.gz

Anything else?

No response

NicoYuan1986 avatar Jun 13 '25 07:06 NicoYuan1986

In Nightly CI environment, a field cannot be created within 120 seconds, which is not in line with expectations. However, my local execution is normal. @congqixia

qixuan0212 avatar Jun 13 '25 07:06 qixuan0212

/assign @congqixia

qixuan0212 avatar Jun 13 '25 08:06 qixuan0212

Take test_milvus_client_index_auto_index as an example, add collection field on nightly environment always time out.

Image

Image https://grafana-4am.zilliz.cc/explore?orgId=1&panes=%7B%22MaO%22:%7B%22datasource%22:%22vhI6Vw67k%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bcluster%3D%5C%224am%5C%22,namespace%3D%5C%22milvus-ci%5C%22,pod%3D~%5C%22mdw-master-382-py-n-milvus.%2A%5C%22%7D%7C%3D%20%5C%22WARN%5C%22%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22vhI6Vw67k%22%7D,%22editorMode%22:%22code%22,%22queryType%22:%22range%22%7D%5D,%22range%22:%7B%22from%22:%221750021980000%22,%22to%22:%221750022115000%22%7D%7D%7D&schemaVersion=1

qixuan0212 avatar Jun 16 '25 03:06 qixuan0212

/unassign

qixuan0212 avatar Jun 16 '25 03:06 qixuan0212

After some digging, the add_collection_field request was blocked in broadcast procedure

Image

Discussed offline with @chyezh, this problem is caused by sensitive balancer of WAL. In nightly cases, lots of collections were created and dropped causing frequent balancing WAL, which cause broadcast message cannot be acked.

I believe that chyezh is working on this issue

congqixia avatar Jul 10 '25 08:07 congqixia

Schemachange is ddl, will acquire lock for rc. The broadcast operation will rely on the flusher of underlying wal on streamingnode. The flusher recovery will use the GetChannelRecoveryInfo, GetChannelRecoveryInfo will request DescribeCollectionInternal that acquire lock for rc.

[2025/07/09 21:43:14.693 +00:00] [INFO] [datacoord/services.go:1006] ["get channel recovery info request received"] [traceID=311fdffcd3d4c4d2c1f684aada4839fb] [vchannel=by-dev-rootcoord-dml_12_459301765154612047v0]
[2025/07/09 21:45:12.604 +00:00] [INFO] [datacoord/services.go:1042] ["datacoord get channel recovery info"] [traceID=311fdffcd3d4c4d2c1f684aada4839fb] [vchannel=by-dev-rootcoord-dml_12_459301765154612047v0] [channel=by-dev-rootcoord-dml_12_459301765154612047v0] ["# of unflushed segments"=0] ["# of flushed segments"=0] ["# of dropped segments"=0] ["# of indexed segments"=0] ["# of l0 segments"=0] ["# of segments not created by streaming"=0]

So the broadcast operation will block until the add schema operation timeout.

I'll fix it by move the ack of broadcast from flusher to recovery storage (which doesn't use any coord rpc).

/assign @chyezh

chyezh avatar Jul 10 '25 09:07 chyezh

should be fixed by #43235

chyezh avatar Jul 11 '25 02:07 chyezh

/assign @NicoYuan1986 please help to verify it

chyezh avatar Jul 14 '25 01:07 chyezh

Fixed!

NicoYuan1986 avatar Jul 14 '25 03:07 NicoYuan1986