milvus icon indicating copy to clipboard operation
milvus copied to clipboard

[Bug]: Load failed: show collection failed: std::exception: service internal error: UnknownError

Open ThreadDao opened this issue 1 year ago • 7 comments

Is there an existing issue for this?

  • [X] I have searched the existing issues

Environment

- Milvus version: cardinal-foxspy-fix_binlog_index_concurrency-1367e8e-20240114
- Deployment mode(standalone or cluster): cluster
- MQ type(rocksmq, pulsar or kafka):  pulsar  
- SDK version(e.g. pymilvus v2.0.0rc2): pymilvus 2.3.5rc5
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

  1. release-load collection sometimes failed with error:
[2024-01-19 16:06:05,774 -  INFO - fouram]: [Base] Start load collection laion_coll_8,replica_number:1,kwargs:{} (base.py:311)
[2024-01-19 16:18:12,381 - ERROR - fouram]: RPC error: [get_loading_progress], <MilvusException: (code=65535, message=show collection failed: std::exception: service internal error: UnknownError)>, <Time:{'RPC start': '2024-01-19 16:18:12.378695', 'RPC error': '2024-01-19 16:18:12.381070'}> (decorators.py:128)
[2024-01-19 16:18:12,430 - ERROR - fouram]: RPC error: [wait_for_loading_collection], <MilvusException: (code=65535, message=show collection failed: std::exception: service internal error: UnknownError)>, <Time:{'RPC start': '2024-01-19 16:06:05.819916', 'RPC error': '2024-01-19 16:18:12.430141'}> (decorators.py:128)
[2024-01-19 16:18:12,431 - ERROR - fouram]: RPC error: [load_collection], <MilvusException: (code=65535, message=show collection failed: std::exception: service internal error: UnknownError)>, <Time:{'RPC start': '2024-01-19 16:06:05.774747', 'RPC error': '2024-01-19 16:18:12.430960'}> (decorators.py:128)
[2024-01-19 16:18:12,437 - ERROR - fouram]: (api_response) : <MilvusException: (code=65535, message=show collection failed: std::exception: service internal error: UnknownError)> (api_request.py:54)

failed argo:

{'segment_counts': 6,
 'segment_total_vectors': 3063006}

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

grafana: https://grafana-4am.zilliz.cc/d/uLf5cJ3Ga/milvus2-0?orgId=1&var-datasource=prometheus&var-cluster=&var-namespace=qa-milvus&var-instance=laion1b-test-2&var-collection=All&var-app_name=milvus&from=1705767000000&to=1705768800000

Anything else?

No response

ThreadDao avatar Jan 22 '24 08:01 ThreadDao

2024-01-22 00:17:04 | querynodequerynode4am-node13laion1b-test-2-milvus-querynode-848cf877ff-xjfkx | I20240121 16:17:04.713279  1996 Utils.cpp:735] [SEGCORE][LoadFieldDatasFromRemote][milvus] failed to load data from remote: Error in GetObjectSize[errcode:404, exception:, errmessage:No response body., params:params, bucket=bucket-laion2, object=files/insert_log/447031349293500360/447031349293500364/447031349321777298/106/447031349320567089]

file cannot be found on remote oss: files/insert_log/447031349293500360/447031349293500364/447031349321777298/106/447031349320567089

chyezh avatar Jan 22 '24 09:01 chyezh

Report 404 on cpp client, but report context cancel on go client.

2024-01-21 04:13:13 | querynodequerynodelaion1b-test-2-milvus-querynode-848cf877ff-qj7v7 | I20240121 04:13:13.683010 201749 Utils.cpp:735] [SEGCORE][LoadFieldDatasFromRemote][milvus] failed to load data from remote: Error in GetObjectSize[errcode:404, exception:, errmessage:No response body., params:params, bucket=bucket-laion2, object=files/insert_log/447031349293500360/447031349293500364/447031349321777298/106/447031349320567089]
-- | -- | --
  |   | 2024-01-21 04:13:13 | datanodedatanodelaion1b-test-2-milvus-datanode-67dc4b9d99-668r5 | [2024/01/21 04:13:13.857 +00:00] [WARN] [storage/remote_chunk_manager.go:179] ["failed to read object"] [path=files/insert_log/447031349293500360/447031349293500364/447031349321777298/106/447031349320567089] [error="context deadline exceeded"]
  |   | 2024-01-21 04:13:14 | querynodequerynodelaion1b-test-2-milvus-querynode-848cf877ff-qj7v7 | I20240121 04:13:14.685618 201769 Utils.cpp:735] [SEGCORE][LoadFieldDatasFromRemote][milvus] failed to load data from remote: Error in GetObjectSize[errcode:404, exception:, errmessage:No response body., params:params, bucket=bucket-laion2, object=files/insert_log/447031349293500360/447031349293500364/447031349321777298/106/447031349320567089]
  |   | 2024-01-21 04:13:15 | querynodequerynodelaion1b-test-2-milvus-querynode-848cf877ff-qj7v7 | I20240121 04:13:15.706427 201765 Utils.cpp:735] [SEGCORE][LoadFieldDatasFromRemote][milvus] failed to load data from remote: Error in GetObjectSize[errcode:404, exception:, errmessage:No response body., params:params, bucket=bucket-laion2, object=files/insert_log/447031349293500360/447031349293500364/447031349321777298/106/447031349320567089]

A busy loop is used to GetObject in go-side. binlog is lost on minio, but go still report context cancel without real error.

image image

log backup.txt

chyezh avatar Jan 22 '24 09:01 chyezh

Seems that data is gced for some reason?

xiaofan-luan avatar Jan 26 '24 13:01 xiaofan-luan

/assign @bigsheeper

xiaofan-luan avatar Jan 26 '24 13:01 xiaofan-luan

Seems that data is gced for some reason?

It seems that the problem is not caused by GC, the file can't be read in about a minute after flushing meta. no related gc log is found.

[2024/01/18 04:49:58.908 +00:00] [INFO] [datacoord/services.go:482] ["flush segment with meta"] [traceID=1716d6805ec4361dfe36d37de4d3424a] [nodeID=1167] [collectionID=447031349293500360] [segmentID=447031349321777298] [meta="[{\"fieldID\":108,\"binlogs\":[{\"entries_num\":27,\"timestamp_from\":447100580627742724,\"timestamp_to\":447100580627742724,\"log_path\":\"files/insert_log/447031349293500360/447031349293500364/447031349321777298/108/447031349320567091\",\"log_size\":216}]},{\"fieldID\":111,\"binlogs\":[{\"entries_num\":27,\"timestamp_from\":447100580627742724,\"timestamp_to\":447100580627742724,\"log_path\":\"files/insert_log/447031349293500360/447031349293500364/447031349321777298/111/447031349320567094\",\"log_size\":486}]},{\"fieldID\":1,\"binlogs\":[{\"entries_num\":27,\"timestamp_from\":447100580627742724,\"timestamp_to\":447100580627742724,\"log_path\":\"files/insert_log/447031349293500360/447031349293500364/447031349321777298/1/447031349320567096\",\"log_size\":216}]},{\"fieldID\":100,\"binlogs\":[{\"entries_num\":27,\"timestamp_from\":447100580627742724,\"timestamp_to\":447100580627742724,\"log_path\":\"files/insert_log/447031349293500360/447031349293500364/447031349321777298/100/447031349320567083\",\"log_size\":216}]},{\"fieldID\":101,\"binlogs\":[{\"entries_num\":27,\"timestamp_from\":447100580627742724,\"timestamp_to\":447100580627742724,\"log_path\":\"files/insert_log/447031349293500360/447031349293500364/447031349321777298/101/447031349320567084\",\"log_size\":82948}]},{\"fieldID\":104,\"binlogs\":[{\"entries_num\":27,\"timestamp_from\":447100580627742724,\"timestamp_to\":447100580627742724,\"log_path\":\"files/insert_log/447031349293500360/447031349293500364/447031349321777298/104/447031349320567087\",\"log_size\":648}]},{\"fieldID\":110,\"binlogs\":[{\"entries_num\":27,\"timestamp_from\":447100580627742724,\"timestamp_to\":447100580627742724,\"log_path\":\"files/insert_log/447031349293500360/447031349293500364/447031349321777298/110/447031349320567093\",\"log_size\":648}]},{\"binlogs\":[{\"entries_num\":27,\"timestamp_from\":447100580627742724,\"timestamp_to\":447100580627742724,\"log_path\":\"files/insert_log/447031349293500360/447031349293500364/447031349321777298/0/447031349320567095\",\"log_size\":216}]},{\"fieldID\":106,\"binlogs\":[{\"entries_num\":27,\"timestamp_from\":447100580627742724,\"timestamp_to\":447100580627742724,\"log_path\":\"files/insert_log/447031349293500360/447031349293500364/447031349321777298/106/447031349320567089\",\"log_size\":216}]},{\"fieldID\":107,\"binlogs\":[{\"entries_num\":27,\"timestamp_from\":447100580627742724,\"timestamp_to\":447100580627742724,\"log_path\":\"files/insert_log/447031349293500360/447031349293500364/447031349321777298/107/447031349320567090\",\"log_size\":216}]},{\"fieldID\":109,\"binlogs\":[{\"entries_num\":27,\"timestamp_from\":447100580627742724,\"timestamp_to\":447100580627742724,\"log_path\":\"files/insert_log/447031349293500360/447031349293500364/447031349321777298/109/447031349320567092\",\"log_size\":216}]},{\"fieldID\":103,\"binlogs\":[{\"entries_num\":27,\"timestamp_from\":447100580627742724,\"timestamp_to\":447100580627742724,\"log_path\":\"files/insert_log/447031349293500360/447031349293500364/447031349321777298/103/447031349320567086\",\"log_size\":648}]},{\"fieldID\":105,\"binlogs\":[{\"entries_num\":27,\"timestamp_from\":447100580627742724,\"timestamp_to\":447100580627742724,\"log_path\":\"files/insert_log/447031349293500360/447031349293500364/447031349321777298/105/447031349320567088\",\"log_size\":108}]},{\"fieldID\":102,\"binlogs\":[{\"entries_num\":27,\"timestamp_from\":447100580627742724,\"timestamp_to\":447100580627742724,\"log_path\":\"files/insert_log/447031349293500360/447031349293500364/447031349321777298/102/447031349320567085\",\"log_size\":216}]}]"]

I20240118 04:50:01.442193 158181 Utils.cpp:735] [SEGCORE][LoadFieldDatasFromRemote][milvus] failed to load data from remote: Error in GetObjectSize[errcode:404, exception:, errmessage:No response body., params:params, bucket=bucket-laion2, object=files/insert_log/447031349293500360/447031349293500364/447031349321777298/106/447031349320567089]

chyezh avatar Jan 30 '24 06:01 chyezh

failed to load data from remote: Error in GetObjectSize[errcode:404, exception:, errmessage:No response body., params:params, bucket=bucket-laion2

404 doesn't mean file not exist right?

xiaofan-luan avatar Jan 30 '24 13:01 xiaofan-luan

failed to load data from remote: Error in GetObjectSize[errcode:404, exception:, errmessage:No response body., params:params, bucket=bucket-laion2

404 doesn't mean file not exist right?

yes,I have tested it in go client.

As document illustrated.

image image

chyezh avatar Feb 01 '24 10:02 chyezh

is this issue confirmed a bug? any fixing plan? we met a similar case: indexnode reboot continuous because the getobjectsize 404 error, the file is not exist in the storage

wwli05 avatar Feb 25 '24 07:02 wwli05

is this issue confirmed a bug? any fixing plan? we met a similar case: indexnode reboot continuous because the getobjectsize 404 error, the file is not exist in the storage

nope, this is not a bug but due to lack of some access to S3

xiaofan-luan avatar Feb 25 '24 10:02 xiaofan-luan