milvus
milvus copied to clipboard
[Bug]: Load failed: show collection failed: std::exception: service internal error: UnknownError
Is there an existing issue for this?
- [X] I have searched the existing issues
Environment
- Milvus version: cardinal-foxspy-fix_binlog_index_concurrency-1367e8e-20240114
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka): pulsar
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus 2.3.5rc5
- OS(Ubuntu or CentOS):
- CPU/Memory:
- GPU:
- Others:
Current Behavior
- release-load collection sometimes failed with error:
[2024-01-19 16:06:05,774 - INFO - fouram]: [Base] Start load collection laion_coll_8,replica_number:1,kwargs:{} (base.py:311)
[2024-01-19 16:18:12,381 - ERROR - fouram]: RPC error: [get_loading_progress], <MilvusException: (code=65535, message=show collection failed: std::exception: service internal error: UnknownError)>, <Time:{'RPC start': '2024-01-19 16:18:12.378695', 'RPC error': '2024-01-19 16:18:12.381070'}> (decorators.py:128)
[2024-01-19 16:18:12,430 - ERROR - fouram]: RPC error: [wait_for_loading_collection], <MilvusException: (code=65535, message=show collection failed: std::exception: service internal error: UnknownError)>, <Time:{'RPC start': '2024-01-19 16:06:05.819916', 'RPC error': '2024-01-19 16:18:12.430141'}> (decorators.py:128)
[2024-01-19 16:18:12,431 - ERROR - fouram]: RPC error: [load_collection], <MilvusException: (code=65535, message=show collection failed: std::exception: service internal error: UnknownError)>, <Time:{'RPC start': '2024-01-19 16:06:05.774747', 'RPC error': '2024-01-19 16:18:12.430960'}> (decorators.py:128)
[2024-01-19 16:18:12,437 - ERROR - fouram]: (api_response) : <MilvusException: (code=65535, message=show collection failed: std::exception: service internal error: UnknownError)> (api_request.py:54)
failed argo:
- laion1b-test-cron-1705680000
- laion1b-test-cron-1705766400 some times load succ:
{'segment_counts': 6,
'segment_total_vectors': 3063006}
Expected Behavior
No response
Steps To Reproduce
No response
Milvus Log
grafana: https://grafana-4am.zilliz.cc/d/uLf5cJ3Ga/milvus2-0?orgId=1&var-datasource=prometheus&var-cluster=&var-namespace=qa-milvus&var-instance=laion1b-test-2&var-collection=All&var-app_name=milvus&from=1705767000000&to=1705768800000
Anything else?
No response
2024-01-22 00:17:04 | querynodequerynode4am-node13laion1b-test-2-milvus-querynode-848cf877ff-xjfkx | I20240121 16:17:04.713279 1996 Utils.cpp:735] [SEGCORE][LoadFieldDatasFromRemote][milvus] failed to load data from remote: Error in GetObjectSize[errcode:404, exception:, errmessage:No response body., params:params, bucket=bucket-laion2, object=files/insert_log/447031349293500360/447031349293500364/447031349321777298/106/447031349320567089]
file cannot be found on remote oss:
files/insert_log/447031349293500360/447031349293500364/447031349321777298/106/447031349320567089
Report 404 on cpp client, but report context cancel on go client.
2024-01-21 04:13:13 | querynodequerynodelaion1b-test-2-milvus-querynode-848cf877ff-qj7v7 | I20240121 04:13:13.683010 201749 Utils.cpp:735] [SEGCORE][LoadFieldDatasFromRemote][milvus] failed to load data from remote: Error in GetObjectSize[errcode:404, exception:, errmessage:No response body., params:params, bucket=bucket-laion2, object=files/insert_log/447031349293500360/447031349293500364/447031349321777298/106/447031349320567089]
-- | -- | --
| | 2024-01-21 04:13:13 | datanodedatanodelaion1b-test-2-milvus-datanode-67dc4b9d99-668r5 | [2024/01/21 04:13:13.857 +00:00] [WARN] [storage/remote_chunk_manager.go:179] ["failed to read object"] [path=files/insert_log/447031349293500360/447031349293500364/447031349321777298/106/447031349320567089] [error="context deadline exceeded"]
| | 2024-01-21 04:13:14 | querynodequerynodelaion1b-test-2-milvus-querynode-848cf877ff-qj7v7 | I20240121 04:13:14.685618 201769 Utils.cpp:735] [SEGCORE][LoadFieldDatasFromRemote][milvus] failed to load data from remote: Error in GetObjectSize[errcode:404, exception:, errmessage:No response body., params:params, bucket=bucket-laion2, object=files/insert_log/447031349293500360/447031349293500364/447031349321777298/106/447031349320567089]
| | 2024-01-21 04:13:15 | querynodequerynodelaion1b-test-2-milvus-querynode-848cf877ff-qj7v7 | I20240121 04:13:15.706427 201765 Utils.cpp:735] [SEGCORE][LoadFieldDatasFromRemote][milvus] failed to load data from remote: Error in GetObjectSize[errcode:404, exception:, errmessage:No response body., params:params, bucket=bucket-laion2, object=files/insert_log/447031349293500360/447031349293500364/447031349321777298/106/447031349320567089]
A busy loop is used to GetObject
in go-side. binlog is lost on minio, but go still report context cancel without real error.
Seems that data is gced for some reason?
/assign @bigsheeper
Seems that data is gced for some reason?
It seems that the problem is not caused by GC, the file can't be read in about a minute after flushing meta. no related gc log is found.
[2024/01/18 04:49:58.908 +00:00] [INFO] [datacoord/services.go:482] ["flush segment with meta"] [traceID=1716d6805ec4361dfe36d37de4d3424a] [nodeID=1167] [collectionID=447031349293500360] [segmentID=447031349321777298] [meta="[{\"fieldID\":108,\"binlogs\":[{\"entries_num\":27,\"timestamp_from\":447100580627742724,\"timestamp_to\":447100580627742724,\"log_path\":\"files/insert_log/447031349293500360/447031349293500364/447031349321777298/108/447031349320567091\",\"log_size\":216}]},{\"fieldID\":111,\"binlogs\":[{\"entries_num\":27,\"timestamp_from\":447100580627742724,\"timestamp_to\":447100580627742724,\"log_path\":\"files/insert_log/447031349293500360/447031349293500364/447031349321777298/111/447031349320567094\",\"log_size\":486}]},{\"fieldID\":1,\"binlogs\":[{\"entries_num\":27,\"timestamp_from\":447100580627742724,\"timestamp_to\":447100580627742724,\"log_path\":\"files/insert_log/447031349293500360/447031349293500364/447031349321777298/1/447031349320567096\",\"log_size\":216}]},{\"fieldID\":100,\"binlogs\":[{\"entries_num\":27,\"timestamp_from\":447100580627742724,\"timestamp_to\":447100580627742724,\"log_path\":\"files/insert_log/447031349293500360/447031349293500364/447031349321777298/100/447031349320567083\",\"log_size\":216}]},{\"fieldID\":101,\"binlogs\":[{\"entries_num\":27,\"timestamp_from\":447100580627742724,\"timestamp_to\":447100580627742724,\"log_path\":\"files/insert_log/447031349293500360/447031349293500364/447031349321777298/101/447031349320567084\",\"log_size\":82948}]},{\"fieldID\":104,\"binlogs\":[{\"entries_num\":27,\"timestamp_from\":447100580627742724,\"timestamp_to\":447100580627742724,\"log_path\":\"files/insert_log/447031349293500360/447031349293500364/447031349321777298/104/447031349320567087\",\"log_size\":648}]},{\"fieldID\":110,\"binlogs\":[{\"entries_num\":27,\"timestamp_from\":447100580627742724,\"timestamp_to\":447100580627742724,\"log_path\":\"files/insert_log/447031349293500360/447031349293500364/447031349321777298/110/447031349320567093\",\"log_size\":648}]},{\"binlogs\":[{\"entries_num\":27,\"timestamp_from\":447100580627742724,\"timestamp_to\":447100580627742724,\"log_path\":\"files/insert_log/447031349293500360/447031349293500364/447031349321777298/0/447031349320567095\",\"log_size\":216}]},{\"fieldID\":106,\"binlogs\":[{\"entries_num\":27,\"timestamp_from\":447100580627742724,\"timestamp_to\":447100580627742724,\"log_path\":\"files/insert_log/447031349293500360/447031349293500364/447031349321777298/106/447031349320567089\",\"log_size\":216}]},{\"fieldID\":107,\"binlogs\":[{\"entries_num\":27,\"timestamp_from\":447100580627742724,\"timestamp_to\":447100580627742724,\"log_path\":\"files/insert_log/447031349293500360/447031349293500364/447031349321777298/107/447031349320567090\",\"log_size\":216}]},{\"fieldID\":109,\"binlogs\":[{\"entries_num\":27,\"timestamp_from\":447100580627742724,\"timestamp_to\":447100580627742724,\"log_path\":\"files/insert_log/447031349293500360/447031349293500364/447031349321777298/109/447031349320567092\",\"log_size\":216}]},{\"fieldID\":103,\"binlogs\":[{\"entries_num\":27,\"timestamp_from\":447100580627742724,\"timestamp_to\":447100580627742724,\"log_path\":\"files/insert_log/447031349293500360/447031349293500364/447031349321777298/103/447031349320567086\",\"log_size\":648}]},{\"fieldID\":105,\"binlogs\":[{\"entries_num\":27,\"timestamp_from\":447100580627742724,\"timestamp_to\":447100580627742724,\"log_path\":\"files/insert_log/447031349293500360/447031349293500364/447031349321777298/105/447031349320567088\",\"log_size\":108}]},{\"fieldID\":102,\"binlogs\":[{\"entries_num\":27,\"timestamp_from\":447100580627742724,\"timestamp_to\":447100580627742724,\"log_path\":\"files/insert_log/447031349293500360/447031349293500364/447031349321777298/102/447031349320567085\",\"log_size\":216}]}]"]
I20240118 04:50:01.442193 158181 Utils.cpp:735] [SEGCORE][LoadFieldDatasFromRemote][milvus] failed to load data from remote: Error in GetObjectSize[errcode:404, exception:, errmessage:No response body., params:params, bucket=bucket-laion2, object=files/insert_log/447031349293500360/447031349293500364/447031349321777298/106/447031349320567089]
failed to load data from remote: Error in GetObjectSize[errcode:404, exception:, errmessage:No response body., params:params, bucket=bucket-laion2
404 doesn't mean file not exist right?
failed to load data from remote: Error in GetObjectSize[errcode:404, exception:, errmessage:No response body., params:params, bucket=bucket-laion2
404 doesn't mean file not exist right?
yes,I have tested it in go client.
As document illustrated.
is this issue confirmed a bug? any fixing plan? we met a similar case: indexnode reboot continuous because the getobjectsize 404 error, the file is not exist in the storage
is this issue confirmed a bug? any fixing plan? we met a similar case: indexnode reboot continuous because the getobjectsize 404 error, the file is not exist in the storage
nope, this is not a bug but due to lack of some access to S3