milvus
milvus copied to clipboard
[Bug]: [Nightly] [Add Field]Add collection field frequently failed for timeout(120s)
Is there an existing issue for this?
- [x] I have searched the existing issues
Environment
- Milvus version: 9873e0e
- Deployment mode(standalone or cluster):cluster
- MQ type(rocksmq, pulsar or kafka): kafka
- SDK version(e.g. pymilvus v2.0.0rc2):2.6.0rc139
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
Add collection field frequently failed for timeout(120s).
[2025-06-12 21:48:58 - DEBUG - ci_test]: (api_request) : [MilvusClient.add_collection_field] args: ['client_index_1tHyxUwl', 'field_int', <DataType.INT32: 4>, ''], kwargs: {'nullable': True, 'timeout': 120} (api_request.py:52)
[2025-06-12 21:50:58 - ERROR - ci_test]: Traceback (most recent call last):
File "/milvus/tests/python_client/utils/api_request.py", line 22, in inner_wrapper
res = func(*args, **_kwargs)
File "/milvus/tests/python_client/utils/api_request.py", line 53, in api_request
return func(*arg, **kwargs)
File "/usr/local/lib/python3.10/site-packages/pymilvus/milvus_client/milvus_client.py", line 1143, in add_collection_field
conn.add_collection_field(
File "/usr/local/lib/python3.10/site-packages/pymilvus/decorators.py", line 155, in handler
raise e from e
File "/usr/local/lib/python3.10/site-packages/pymilvus/decorators.py", line 137, in handler
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/pymilvus/decorators.py", line 173, in handler
return func(self, *args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/pymilvus/decorators.py", line 90, in handler
raise e from e
File "/usr/local/lib/python3.10/site-packages/pymilvus/decorators.py", line 86, in handler
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/site-packages/pymilvus/client/grpc_handler.py", line 337, in add_collection_field
status = self._stub.AddCollectionField(
File "/usr/local/lib/python3.10/site-packages/grpc/_interceptor.py", line 277, in __call__
response, ignored_call = self._with_call(
File "/usr/local/lib/python3.10/site-packages/grpc/_interceptor.py", line 332, in _with_call
return call.result(), call
File "/usr/local/lib/python3.10/site-packages/grpc/_channel.py", line 440, in result
raise self
File "/usr/local/lib/python3.10/site-packages/grpc/_interceptor.py", line 315, in continuation
response, call = self._thunk(new_method).with_call(
File "/usr/local/lib/python3.10/site-packages/grpc/_interceptor.py", line 343, in with_call
return self._with_call(
File "/usr/local/lib/python3.10/site-packages/grpc/_interceptor.py", line 332, in _with_call
return call.result(), call
File "/usr/local/lib/python3.10/site-packages/grpc/_channel.py", line 440, in result
raise self
File "/usr/local/lib/python3.10/site-packages/grpc/_interceptor.py", line 315, in continuation
response, call = self._thunk(new_method).with_call(
File "/usr/local/lib/python3.10/site-packages/grpc/_interceptor.py", line 343, in with_call
return self._with_call(
File "/usr/local/lib/python3.10/site-packages/grpc/_interceptor.py", line 332, in _with_call
return call.result(), call
File "/usr/local/lib/python3.10/site-packages/grpc/_channel.py", line 440, in result
raise self
File "/usr/local/lib/python3.10/site-packages/grpc/_interceptor.py", line 315, in continuation
response, call = self._thunk(new_method).with_call(
File "/usr/local/lib/python3.10/site-packages/grpc/_channel.py", line 1198, in with_call
return _end_unary_response_blocking(state, call, True, None)
File "/usr/local/lib/python3.10/site-packages/grpc/_channel.py", line 1006, in _end_unary_response_blocking
raise _InactiveRpcError(state) # pytype: disable=not-instantiable
grpc._channel._InactiveRpcError: <_InactiveRpcError of RPC that terminated with:
status = StatusCode.DEADLINE_EXCEEDED
details = "Deadline Exceeded"
debug_error_string = "UNKNOWN:Error received from peer {created_time:"2025-06-12T21:50:58.025800373+00:00", grpc_status:4, grpc_message:"Deadline Exceeded"}"
>
(api_request.py:35)
[2025-06-12 21:50:58 - ERROR - ci_test]: (api_response) : <_InactiveRpcError of RPC that terminated with:
status = StatusCode.DEADLINE_EXCEEDED
details = "Deadline Exceeded"
debug_error_string = "UNKNOWN:Error received from peer {created_time:"2025-06-12T21:50:58.025800373+00:00", grpc_status:4, grpc_message:"Deadline Exceeded"}"
> (api_request.py:36)
Expected Behavior
add field successfully
Steps To Reproduce
Milvus Log
- link: https://jenkins.milvus.io:18080/blue/organizations/jenkins/Milvus%20Nightly%20CI(new)/detail/master/379/pipeline/109
- log: artifacts-milvus-distributed-kafka-mdk-master-379-py-n-379-e2e-logs.tar.gz
Anything else?
No response
In Nightly CI environment, a field cannot be created within 120 seconds, which is not in line with expectations. However, my local execution is normal. @congqixia
/assign @congqixia
Take test_milvus_client_index_auto_index as an example, add collection field on nightly environment always time out.
https://grafana-4am.zilliz.cc/explore?orgId=1&panes=%7B%22MaO%22:%7B%22datasource%22:%22vhI6Vw67k%22,%22queries%22:%5B%7B%22refId%22:%22A%22,%22expr%22:%22%7Bcluster%3D%5C%224am%5C%22,namespace%3D%5C%22milvus-ci%5C%22,pod%3D~%5C%22mdw-master-382-py-n-milvus.%2A%5C%22%7D%7C%3D%20%5C%22WARN%5C%22%22,%22datasource%22:%7B%22type%22:%22loki%22,%22uid%22:%22vhI6Vw67k%22%7D,%22editorMode%22:%22code%22,%22queryType%22:%22range%22%7D%5D,%22range%22:%7B%22from%22:%221750021980000%22,%22to%22:%221750022115000%22%7D%7D%7D&schemaVersion=1
/unassign
After some digging, the add_collection_field request was blocked in broadcast procedure
Discussed offline with @chyezh, this problem is caused by sensitive balancer of WAL. In nightly cases, lots of collections were created and dropped causing frequent balancing WAL, which cause broadcast message cannot be acked.
I believe that chyezh is working on this issue
Schemachange is ddl, will acquire lock for rc.
The broadcast operation will rely on the flusher of underlying wal on streamingnode.
The flusher recovery will use the GetChannelRecoveryInfo,
GetChannelRecoveryInfo will request DescribeCollectionInternal that acquire lock for rc.
[2025/07/09 21:43:14.693 +00:00] [INFO] [datacoord/services.go:1006] ["get channel recovery info request received"] [traceID=311fdffcd3d4c4d2c1f684aada4839fb] [vchannel=by-dev-rootcoord-dml_12_459301765154612047v0]
[2025/07/09 21:45:12.604 +00:00] [INFO] [datacoord/services.go:1042] ["datacoord get channel recovery info"] [traceID=311fdffcd3d4c4d2c1f684aada4839fb] [vchannel=by-dev-rootcoord-dml_12_459301765154612047v0] [channel=by-dev-rootcoord-dml_12_459301765154612047v0] ["# of unflushed segments"=0] ["# of flushed segments"=0] ["# of dropped segments"=0] ["# of indexed segments"=0] ["# of l0 segments"=0] ["# of segments not created by streaming"=0]
So the broadcast operation will block until the add schema operation timeout.
I'll fix it by move the ack of broadcast from flusher to recovery storage (which doesn't use any coord rpc).
/assign @chyezh
should be fixed by #43235
/assign @NicoYuan1986 please help to verify it
Fixed!