snet-daemon
snet-daemon copied to clipboard
Channel is not unlocked after processing payment
This issue started happening frequently on sound-spleeter service instance. It happens about two times per week.
In client logs one can see that client first receives "cannot get mutex for channel: {ID: 14}" error. And after this client receives "another transaction on channel: {ID: 14} is in progress".
In server logs one can see "rpc error: code = Internal desc = cannot get mutex for channel: {ID: 14}" before payment completed.
Daemon version:
# snetd version
INFO[0000] Cobra initialized
INFO[0000] Configuration file is not set, using default configuration
version tag: v.5.0.1
built on: 2021-05-16_19:16:27
sha1 revision: b0217231faf2620ab0e6510dcfb122e8e7f453b1
One can see client and server logs below: Client log:
2021-10-01 20:03:16,983 [grpc-default-executor-27500] ERROR i.s.soundspleeter.Proxy - request[117402] completed with error
io.grpc.StatusRuntimeException: INTERNAL: cannot get mutex for channel: {ID: 14}
at io.grpc.Status.asRuntimeException(Status.java:533)
at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:449)
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:426)
at io.grpc.internal.ClientCallImpl.access$500(ClientCallImpl.java:66)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:689)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$900(ClientCallImpl.java:577)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:751)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:740)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
...
2021-10-01 20:04:46,616 [grpc-default-executor-27500] ERROR i.s.soundspleeter.Proxy - request[117406] completed with error
io.grpc.StatusRuntimeException: FAILED_PRECONDITION: another transaction on channel: {ID: 14} is in progress
at io.grpc.Status.asRuntimeException(Status.java:533)
at io.grpc.stub.ClientCalls$StreamObserverToCallListenerAdapter.onClose(ClientCalls.java:449)
at io.grpc.internal.ClientCallImpl.closeObserver(ClientCallImpl.java:426)
at io.grpc.internal.ClientCallImpl.access$500(ClientCallImpl.java:66)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.close(ClientCallImpl.java:689)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl.access$900(ClientCallImpl.java:577)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInternal(ClientCallImpl.java:751)
at io.grpc.internal.ClientCallImpl$ClientStreamListenerImpl$1StreamClosed.runInContext(ClientCallImpl.java:740)
at io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
at io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Server log:
time="2021-10-01T20:03:13.847393841Z" level=debug msg="New gRPC call received" context="{MD: map [:authority:[sound-spleeter.singularitynet.io:7068] content-type:[application/grpc] grpc-accept-encoding:[gzip] snet-payment-channel-amount:[5805900] snet-payment-channel-id:[14] snet-payment-channel-nonce:[0] snet-payment-channel-signature-bin:[...] snet-payment-type:[escrow] user-agent:[grpc-java-netty/1.28.0]], Info: {/SoundSpleeter/spleeter true true}"
time="2021-10-01T20:03:13.84754259Z" level=debug msg="Return payment handler by type" paymentType=escrow
time="2021-10-01T20:03:16.98287053Z" level=debug msg="etcd Transaction took 3.000481805s"
time="2021-10-01T20:03:16.983087638Z" level=debug msg=payload payload="{\"type\":\"response\",\"registry_address_key\":\"0x247DEbEBB766E4fA99667265A158060018D5f4F8\",\"ethereum_json_rpc_endpoint\":\"https://mainnet.infura.io/v3/09027f4a13e841d48dbfefc67e7685d5\",\"request_id\":\"c5bmj0bbg458pmbl6s50\",\"organization_id\":\"snet\",\"service_id\":\"sound-spleeter\",\"group_id\":\"nZdFbyUlpWfOuTn0WpJCpKtQATrU6gxz6Wn9zAC2mno=\",\"service_method\":\"/SoundSpleeter/spleeter\",\"response_sent_time\":\"2021-10-01 20:03:16.982996817\",\"request_received_time\":\"2021-10-01 20:03:13.847349713\",\"response_time\":\"3.1356\",\"response_code\":\"Internal\",\"error_message\":\"rpc error: code = Internal desc = cannot get mutex for channel: {ID: 14}\",\"version\":\"v.5.0.1\",\"client_type\":\"\",\"user_details\":\"\",\"user_agent\":\"grpc-java-netty/1.28.0\",\"channel_id\":\"14\",\"username\":\"\",\"operation\":\"read\",\"usage_type\":\"apicall\",\"status\":\"failed\",\"start_time\":\"2021-10-01 20:03:13.847349713\",\"end_time\":\"2021-10-01 20:03:16.982996817\",\"usage_value\":1,\"time_zone\":\"\",\"payment_mode\":\"escrow\",\"user_address\":\"\"}"
time="2021-10-01T20:03:17.533955042Z" level=warning msg="Service call failed with status code : 500 "
time="2021-10-01T20:03:17.534002203Z" level=warning msg="Unable to publish metrics" payload="{\"type\":\"response\",\"registry_address_key\":\"0x247DEbEBB766E4fA99667265A158060018D5f4F8\",\"ethereum_json_rpc_endpoint\":\"https://mainnet.infura.io/v3/09027f4a13e841d48dbfefc67e7685d5\",\"request_id\":\"c5bmj0bbg458pmbl6s50\",\"organization_id\":\"snet\",\"service_id\":\"sound-spleeter\",\"group_id\":\"nZdFbyUlpWfOuTn0WpJCpKtQATrU6gxz6Wn9zAC2mno=\",\"service_method\":\"/SoundSpleeter/spleeter\",\"response_sent_time\":\"2021-10-01 20:03:16.982996817\",\"request_received_time\":\"2021-10-01 20:03:13.847349713\",\"response_time\":\"3.1356\",\"response_code\":\"Internal\",\"error_message\":\"rpc error: code = Internal desc = cannot get mutex for channel: {ID: 14}\",\"version\":\"v.5.0.1\",\"client_type\":\"\",\"user_details\":\"\",\"user_agent\":\"grpc-java-netty/1.28.0\",\"channel_id\":\"14\",\"username\":\"\",\"operation\":\"read\",\"usage_type\":\"apicall\",\"status\":\"failed\",\"start_time\":\"2021-10-01 20:03:13.847349713\",\"end_time\":\"2021-10-01 20:03:16.982996817\",\"usage_value\":1,\"time_zone\":\"\",\"payment_mode\":\"escrow\",\"user_address\":\"\"}" url="https://marketplace-mt-v2.singularitynet.io/metering/usage"
time="2021-10-01T20:03:24.188823457Z" level=debug msg="Payment completed"
time="2021-10-01T20:03:24.428001054Z" level=debug msg="etcd Transaction took 175.435209ms"
time="2021-10-01T20:03:24.428045194Z" level=debug msg="Channel unlocked"
@vsbogd
- Are you trying to get a new Token ?
- Are you manually unlocking the channel ?
- any rough idea on when you first saw this ?
Are you trying to get a new Token ?
No
Are you manually unlocking the channel ?
Yes, it is I need to do each time to make application working again.
any rough idea on when you first saw this ?
First time I saw this issue in log on Sept 27 at 04:48:49,077 GMT. Since that time it happened three times AFAIK.
#507 , we need to get to this soon , been pending a while now We havent done any release in Daemon since a very long time. I see java sdk still uses the pay per use and not concurrency /token feature , either ways daemon should handle channel locks gracefully. I will try to re produce this scenario
@vsbogd , can you also point me to client code ,might have more leads on possibly simulating the error
@vsbogd , can you also point me to client code ,might have more leads on possibly simulating the error
Client code is here: https://github.com/vsbogd/sound-spleeter-proxy (README.md explains how to run it), but it happens once per 8000 client requests so I don't think it is easily reproducible.
I think it is more simple to find the root cause by code analysis and then one could write a unit test to reproduce the behavior. I see the error is raised in the following line: https://github.com/singnet/snet-daemon/blob/fa1eab0f1204e1d0cbdcfeec949fb82f4f04762c/escrow/escrow.go#L98 First issue is that error is not logged nor imprinted into returned error, so we don't have enough information to understand what happened. I would fix it first.
There are two places inside Lock
which can lead to the error:
https://github.com/singnet/snet-daemon/blob/fa1eab0f1204e1d0cbdcfeec949fb82f4f04762c/escrow/lock.go#L36-L58
Taking into account the lock actually happens the most suspicious is CompareAndSwap
call which is essentially ExecuteTransaction
call:
https://github.com/singnet/snet-daemon/blob/fa1eab0f1204e1d0cbdcfeec949fb82f4f04762c/etcddb/etcddb_client.go#L296-L325
Judging by "Transaction took" in log we can say CompleteTransaction
was actually called. Btw the transaction took 3 seconds which looks like timeout value:
https://github.com/singnet/snet-daemon/blob/fa1eab0f1204e1d0cbdcfeec949fb82f4f04762c/etcddb/etcddb_client.go#L329-L385
As transaction looks safe I would suppose that transaction was sent to ETCD server and received by server but response was not received because of timeout. Looks like ETCD finished transaction but client received timeout error and it led to the inconsistency.
What do you think? Does ETCD logs for this moment of time contains something interesting?
Btw, I am not sure about priority of this. Are we going to completely decommission escrow payment transactions and they are already deprecated?
If so may be we can simply migrate code to the tokens approach and finish the decommission. I cannot say when it will be possible though because we don't have resources to for doing this work.
to me this is a production issue , and i dont want any body to have to do a manual workaround lets do this for now
- fix the logging
- increase the timeout to 5 sec on block chain ,
We can try simulating with a test case as suggested and deep dive, will post anything seen on etcd