fabric
fabric copied to clipboard
Chaincode container crashes when querying two tables.
Description
Chaincode container crashes when querying tables twice, and the first table has many rows (more than 100 rows).
Describe How to Reproduce
- Create a table with more than 100 rows
- GetRows() on the table, and read the content into an array.
- Then call GetRow() on the table. Sample chaincode is attached. race_sample.zip
Error message on the chaincode container:
17:32:14.691 [shim] DEBU : [ef450172]Received message ERROR from shim
17:32:14.691 [shim] DEBU : [ef450172]Handling ChaincodeMessage of type: ERROR(state:ready)
Error starting the chaincode: Error handling message: [ef450172-622d-4dc8-9db3-ac6506bea1aa]Chaincode handler FSM cannot handle message (ERROR) with payload size (30) while in state: ready
Error message on the corresponding validating peer
17:32:14.509 [devops] invokeOrQuery -> INFO 0e7 Transaction ID: %v ef450172-622d-4dc8-9db3-ac6506bea1aa
17:32:14.509 [crypto] invokeOrQuery -> INFO 0e8 Initializing client [test_user1]...
17:32:14.623 [crypto] invokeOrQuery -> INFO 0e9 Initializing client [test_user1]...done!
17:32:14.691 [chaincode] processStream -> ERRO 0ea [ef450172]Error handling message, ending stream: [ef450172-622d-4dc8-9db3-ac6506bea1aa]Chaincode handler validator FSM cannot handle message (ERROR) with payload size (134) while in state: ready
17:32:44.628 [crypto] CloseClient -> INFO 0eb Closing client [test_user1]...
17:32:44.636 [rest] processChaincodeInvokeOrQuery -> ERRO 0ec Error when querying chaincode: Error:Failed to execute transaction or query(Timeout expired while executing transaction)
17:32:44.636 [rest] ProcessChaincode -> INFO 0ed REST successfully query chaincode: {"jsonrpc":"2.0","error":{"code":-32003,"message":"Query failure","data":"Error when querying chaincode: Error:Failed to execute transaction or query(Timeout expired while executing transaction)"},"id":5}
Sorry, more precisely, the first GetRows() crashes when the table size > 100. PR https://github.com/hyperledger/fabric/pull/2013 tries to fix this issue but insufficient.
I'll try this out and update by end of day.
@sachikoy could you checkout the above PR please. And thanks for the test case. I've added it to the unit test suite.
@muralisrini Thank you, it worked well. I confirmed it fixed the problem.
@sachikoy Just curious, did you try both methods GetRows and GetRows2 ? My recommendation would be to use the later.
@muralisrini I tested again... ok, actually, the PR for GetRows() did not completely fix the problem. I double checked, and it returned only 136 rows when there are 200 rows in the table. GetRows2() seems to work well.
@sachikoy I'll look into it and update
@sachikoy fixed and modified the unit test to specifically look for 250 rows.
I don't have a better answer, but there are several issues this fix introduces that I think we should try to resolve before merging
- The introduction of a
GetRows2
. We need a way to deprecate API in the shim. If the shim was its own project, we could potentially just say use shim version X instead of continuing to add 2, 3, 4, etc. functions that I think will be very messy. The other solution is to break existing chaincodes and change the signature ofGetRows
which I think will have to be done in the next architecture. - The fact that
GetRows
now loads all rows into memory. Adding this will come back to bite us, but it may be our best bandaid patch for the 0.5 branch. - Allowing chaincode to leak DB handles by not closing the iterator. This is the most dangerous to me. A chaincode may crash, but we should not allow a chaincode to impact the peer.
I think much of the above can be solved more cleanly with the next architecture DB. We likely should not bandaid the master branch and instead implement the correct fix at a DB level.
A bandaid would be a possible solution for the 0.5 branch, but the biggest pain point is that we're introducing a GetRows2
that may not be around in the future.
While I don't like it, would just pulling all rows into memory be an acceptable band-aid for 0.5? We could keep the same function name and signature in that case.
@srderson GetRows would "fix" the problem but would be inefficient. For now we cannot get rid of GetRows and hence the band-aid solution to pull in all the data. This is mentioned in the comment.
I struggled with the GetRows2's naming and "deprecation" too but it does solve the problem cleanly but does place the onus on the chaincode developer to call the closure.
Regarding the DB handle leak if Chaincode developer didn't call the closure..... I thought we closed the DB handles at the end of transaction regardless ? Don't we have to make sure DB handles on the iterator is freed at the end of the transaction (even in the current implementation, if the chaincode crashes in in the middle of iteration, we could leak?) I assumed the same mechanism would kick in case chaincode did not class the closure.
@muralisrini If we make any changes to the 0.5 branch, I think we should only submit the inefficient GetRows
fix. Then we don't introduce a new API that has to potentially be supported.
I agree the your GetRows2
function solves the problem and we can figure out how to introduce in master (maybe just have it be default GetRows
and change API) and confirm that DB closes handles automatically if it does not already.
@sachikoy @bren3582 are you ok with just the GetRows fix for 0.5 branch as @srderson suggests ? We could fix it right in master.
@muralisrini That sounds like a good solution to me if GetRows() in master is replaced by the GetRows2() implementation as we need it for our application as opposed to loading all rows in memory.
@bren3582 I am concerned with the backward-compatibility problem to replace GetRows() in master with GetRows2(). Especially there is a risk that people use GetRows2() without knowing that the chaincode needs to close the connection. I would rather distinguish those two by giving different names, so that the different behaviour is obvious. As for @muralisrini 's question, our code uses GetRows2() right now, and I have not tested GetRows() after it's fixed at July 5 https://github.com/hyperledger/fabric/issues/2088#issuecomment-230369813 . I need to wait until the weekend to test new GetRows().. sorry but due to bandwidth problem.
Hi @muralisrini any update on this?
I got the same error. How do I fix it?