fabric icon indicating copy to clipboard operation
fabric copied to clipboard

Chaincode container crashes when querying two tables.

Open sachikoy opened this issue 8 years ago • 16 comments

Description

Chaincode container crashes when querying tables twice, and the first table has many rows (more than 100 rows).

Describe How to Reproduce

  1. Create a table with more than 100 rows
  2. GetRows() on the table, and read the content into an array.
  3. Then call GetRow() on the table. Sample chaincode is attached. race_sample.zip

Error message on the chaincode container:

17:32:14.691 [shim] DEBU : [ef450172]Received message ERROR from shim
17:32:14.691 [shim] DEBU : [ef450172]Handling ChaincodeMessage of type: ERROR(state:ready)
Error starting the chaincode: Error handling message: [ef450172-622d-4dc8-9db3-ac6506bea1aa]Chaincode handler FSM cannot handle message (ERROR) with payload size (30) while in state: ready

Error message on the corresponding validating peer

17:32:14.509 [devops] invokeOrQuery -> INFO 0e7 Transaction ID: %v ef450172-622d-4dc8-9db3-ac6506bea1aa
17:32:14.509 [crypto] invokeOrQuery -> INFO 0e8 Initializing client [test_user1]...
17:32:14.623 [crypto] invokeOrQuery -> INFO 0e9 Initializing client [test_user1]...done!
17:32:14.691 [chaincode] processStream -> ERRO 0ea [ef450172]Error handling message, ending stream: [ef450172-622d-4dc8-9db3-ac6506bea1aa]Chaincode handler validator FSM cannot handle message (ERROR) with payload size (134) while in state: ready
17:32:44.628 [crypto] CloseClient -> INFO 0eb Closing client [test_user1]...
17:32:44.636 [rest] processChaincodeInvokeOrQuery -> ERRO 0ec Error when querying chaincode: Error:Failed to execute transaction or query(Timeout expired while executing transaction)
17:32:44.636 [rest] ProcessChaincode -> INFO 0ed REST successfully query chaincode: {"jsonrpc":"2.0","error":{"code":-32003,"message":"Query failure","data":"Error when querying chaincode: Error:Failed to execute transaction or query(Timeout expired while executing transaction)"},"id":5}

sachikoy avatar Jul 01 '16 12:07 sachikoy

Sorry, more precisely, the first GetRows() crashes when the table size > 100. PR https://github.com/hyperledger/fabric/pull/2013 tries to fix this issue but insufficient.

sachikoy avatar Jul 02 '16 05:07 sachikoy

I'll try this out and update by end of day.

muralisrini avatar Jul 02 '16 15:07 muralisrini

@sachikoy could you checkout the above PR please. And thanks for the test case. I've added it to the unit test suite.

muralisrini avatar Jul 03 '16 00:07 muralisrini

@muralisrini Thank you, it worked well. I confirmed it fixed the problem.

sachikoy avatar Jul 03 '16 23:07 sachikoy

@sachikoy Just curious, did you try both methods GetRows and GetRows2 ? My recommendation would be to use the later.

muralisrini avatar Jul 03 '16 23:07 muralisrini

@muralisrini I tested again... ok, actually, the PR for GetRows() did not completely fix the problem. I double checked, and it returned only 136 rows when there are 200 rows in the table. GetRows2() seems to work well.

sachikoy avatar Jul 05 '16 00:07 sachikoy

@sachikoy I'll look into it and update

muralisrini avatar Jul 05 '16 00:07 muralisrini

@sachikoy fixed and modified the unit test to specifically look for 250 rows.

muralisrini avatar Jul 05 '16 01:07 muralisrini

I don't have a better answer, but there are several issues this fix introduces that I think we should try to resolve before merging

  1. The introduction of a GetRows2. We need a way to deprecate API in the shim. If the shim was its own project, we could potentially just say use shim version X instead of continuing to add 2, 3, 4, etc. functions that I think will be very messy. The other solution is to break existing chaincodes and change the signature of GetRows which I think will have to be done in the next architecture.
  2. The fact that GetRows now loads all rows into memory. Adding this will come back to bite us, but it may be our best bandaid patch for the 0.5 branch.
  3. Allowing chaincode to leak DB handles by not closing the iterator. This is the most dangerous to me. A chaincode may crash, but we should not allow a chaincode to impact the peer.

I think much of the above can be solved more cleanly with the next architecture DB. We likely should not bandaid the master branch and instead implement the correct fix at a DB level.

A bandaid would be a possible solution for the 0.5 branch, but the biggest pain point is that we're introducing a GetRows2 that may not be around in the future.

While I don't like it, would just pulling all rows into memory be an acceptable band-aid for 0.5? We could keep the same function name and signature in that case.

srderson avatar Jul 05 '16 21:07 srderson

@srderson GetRows would "fix" the problem but would be inefficient. For now we cannot get rid of GetRows and hence the band-aid solution to pull in all the data. This is mentioned in the comment.

I struggled with the GetRows2's naming and "deprecation" too but it does solve the problem cleanly but does place the onus on the chaincode developer to call the closure.

Regarding the DB handle leak if Chaincode developer didn't call the closure..... I thought we closed the DB handles at the end of transaction regardless ? Don't we have to make sure DB handles on the iterator is freed at the end of the transaction (even in the current implementation, if the chaincode crashes in in the middle of iteration, we could leak?) I assumed the same mechanism would kick in case chaincode did not class the closure.

muralisrini avatar Jul 06 '16 13:07 muralisrini

@muralisrini If we make any changes to the 0.5 branch, I think we should only submit the inefficient GetRows fix. Then we don't introduce a new API that has to potentially be supported.

I agree the your GetRows2 function solves the problem and we can figure out how to introduce in master (maybe just have it be default GetRows and change API) and confirm that DB closes handles automatically if it does not already.

srderson avatar Jul 11 '16 21:07 srderson

@sachikoy @bren3582 are you ok with just the GetRows fix for 0.5 branch as @srderson suggests ? We could fix it right in master.

muralisrini avatar Jul 13 '16 20:07 muralisrini

@muralisrini That sounds like a good solution to me if GetRows() in master is replaced by the GetRows2() implementation as we need it for our application as opposed to loading all rows in memory.

bren3582 avatar Jul 13 '16 22:07 bren3582

@bren3582 I am concerned with the backward-compatibility problem to replace GetRows() in master with GetRows2(). Especially there is a risk that people use GetRows2() without knowing that the chaincode needs to close the connection. I would rather distinguish those two by giving different names, so that the different behaviour is obvious. As for @muralisrini 's question, our code uses GetRows2() right now, and I have not tested GetRows() after it's fixed at July 5 https://github.com/hyperledger/fabric/issues/2088#issuecomment-230369813 . I need to wait until the weekend to test new GetRows().. sorry but due to bandwidth problem.

sachikoy avatar Jul 14 '16 12:07 sachikoy

Hi @muralisrini any update on this?

malik-altaf avatar Nov 03 '16 09:11 malik-altaf

I got the same error. How do I fix it?

suffiasameera avatar May 30 '17 07:05 suffiasameera