fabric Block Header Metadata Modification, Changing the Transaction Flow

Hello,

Apologies if this issue seems out of place, but I am seeking guidance from anyone actively involved in maintaining the Hyperledger Fabric project or well-versed in its codebase. I have previously addressed several questions on Discord and received responses to few of them.

I am using the Hyperledger Fabric v2.2 branch.

I am currently looking to implement an additional structure within each block's header. This would facilitate efficient querying of the transaction history by checking each block's header before executing the getHistoryForKey() function. This process would verify if the key being searched for exists within the specified block, potentially minimizing the time required for queries.

Thus far, I have modified common.pb.go to include a new field added to the Block's Header. Following that, I have updated orderer/common/multichannel/blockwriter.go to incorporate the logic necessary to create a bloom-filter like structure with each new block, through modifications to the func (bw *BlockWriter) CreateNextBlock(messages []*cb.Envelope) *cb.Block {} function.

However, I'm at an impasse when it comes to modifying how queries are processed. My initial plan was to design a new method based on .getHistoryForKey and incorporate a check within this method. Unfortunately, I have been unable to locate where .getHistoryForKey() is defined.

Could you provide guidance on the most efficient approach to implementing what I have outlined above? Your expertise and input would be greatly appreciated.

Thank you,

May 15 '23 15:05 r4sk01

GetHistoryForKey() is implemented at: https://github.com/hyperledger/fabric/blob/main/core/chaincode/handler.go#L893

Note that the query does not scan all blocks. It uses an internal LevelDB index based on the key to find the blocks and transaction numbers that wrote to the key. Then only these transactions are queried and returned. So the query is already optimized by using the index. You can see the index that gets created for these queries at: https://github.com/hyperledger/fabric/blob/main/core/ledger/kvledger/history/db.go#L79

Or maybe I am not understanding what you intend to do... how exactly would you optimize the query?

Is this something you are intending to contribute back to the project?

May 15 '23 15:05 denyeart

Hey @denyeart, thank you much for a reply!

I am working on a research and the aim is to allow more efficient historical data retrieval. If result is good enough to contribute to project, it will be great.

Motivation is the following: A blockchain system implements a ledger for recording transactions that modify some global states. The system captures the entire evolution history of the states. There might be situations when I need to get a state version for specific key or specific range of keys. As example, I want to get the 3rd state version for key "JohnDebit".

Currently, in Fabric there is only one API to work with transaction history implemented in levelDB that is .getHistoryForKey(), right? I thought that that method linearly scan all the blocks, but you have already explained that it is not how it works. How is the transaction history organized, what does it store? Is it something like {key, blockNumber, readWriteSet}?

Idea for historical data retrieval improvement was to introduce bloom filter like structure into the the block's metadata that would allow to first check if the key exist in the block's transactions, and only then query if needed.

Does it make sense? Is there a room for such kind of improvement?

May 16 '23 20:05 r4sk01

The history index is based on {key, blockNumber, transactionNumber}. So for every key it is efficient to find the set of historical {blockNumber,transactionNumber} transactions that wrote to it. Then there is a separate index to find the local storage location for each transaction. So finding and returning the N historical writes along with the transactions for a given key is very efficient already, in fact, more efficient than checking each block's metadata would be.

May 17 '23 02:05 denyeart

Thank you for sharing @denyeart , these insights are really helpful!

I have already tested Range Queries (from implementation point of view, that query is a multiple call of .getHistoryForKey() for all the keys in a given range, for example 1-108) in Hyperledger Fabric v2.2. Results show that it’s not very efficient for the range queries, especially with increasing number of data inserted into ledger.

What if we are reading only N/4 blocks or even less? As example we could use combined bloom filter for every 4 blocks, would that make sense? Is there a possibility that proposed approach might then perform better than already existing levelDB based indexes?

Is there somewhere an example for logical flow of levelDB indexing? Might I ask you for such example? What happens when a new state for the key is inserted and how it is retrieved during the history query?

Thank you,

May 17 '23 07:05 r4sk01

Please clarify what you mean by a range query... there is no range query for history. Are you simply iterating over getHistoryForKey() for each of the 108 keys in chaincode? That will require a call between chaincode shim and peer for each of the 108 keys, and will indeed be slow.

If there is a typical use case where you want to query history of many keys at once, it will be more efficient to have a new chaincode API for this so that you don't have to roundtrip between chaincode and peer for every key. Within that implementation, I still think checking the existing indexes would be more efficient than visiting each blocks metadata.

Also, what chaincode language are you using? We've seen best results with Go chaincode.

I'd suggest to place some timers in your chaincode and also consult the peer debug log to identify where the most time is spent. I expect the delays are not coming from the index lookups, but would like to see a summary of the delay breakdown before jumping to conclusions.

May 17 '23 13:05 denyeart

Hey @denyeart , by the range query I mean: give me all the historical states for given range of keys. I am simply iterating over getHistoryForKey() for each of the 108 keys in chaincode. I did it because I did not find build in ways to work with transaction history levelDB except the getHistoryForKey(). I am using the Node.js chaincode language.

When you said that it will be more efficient to have a new chaincode API, what did you particularly mean? Do you mean that a new API should be implemented near 'getHistoryForKey()' in https://github.com/hyperledger/fabric/blob/main/core/chaincode/handler.go#L893 ? Or do you mean that there are already all the tools and I can just write chaincode in go/node?

If it is the first case, I assume there should be some APIs that allow to interact with levelDB transaction history. Can you point to them? May be they will also help me to understand the flow of levelDB and its indexes.

Thank you,

May 17 '23 19:05 r4sk01

Ok, yeah it will be slow to iterate in the chaincode. First of all you are making 108 separate grpc calls from chaincode to the peer runtime. Second of all node.js is known to be slow when iterating in chaincode.

I meant the former, to do bulk history queries like this efficiently a new chaincode API would need to be implemented in Fabric to accept an array of input keys. Starting with the protos e.g. https://github.com/hyperledger/fabric-protos/blob/main/peer/chaincode_shim.proto#L134-L138 and then implement the handler in shim side and peer side. On peer side I still think the existing indexes are sufficient.

But taking a step back, I don't think we'd want to make these additions in the project. User chaincode has always been an awkward way to query historical data from a peer. If you really want to implement something I'd suggest extending Query System Chaincode. From there you have direct access to the peer's internal ledger queries and won't be bound by existing chaincode APIs. Take a look at the other examples in https://github.com/hyperledger/fabric/blob/main/core/scc/qscc/query.go.

You could do something similar to https://github.com/hyperledger/fabric/blob/main/core/chaincode/handler.go#L893 in QSCC (or your own system chaincode) but have one option for a single key and another option for an array of input keys.

If you really want to get ambitious, a new set of grpc APIs would be preferable to QSCC. This has been on the backlog for a long time but never prioritized, see https://jira.hyperledger.org/browse/FAB-15427.

May 17 '23 20:05 denyeart

Hi @denyeart ,

Thank you very much for your help! I'll certainly review the links you provided.

I have a follow-up question related to Hyperledger Fabric's behavior.

Let's consider a scenario where I have a million entries of data in a key-value format. The entries aren't unique; for example, there could be multiple entries like {key: "1", value: "100"}, {key: "1", value: "120"}, and so on. I created two datasets - the first dataset contains a million key-value pairs in ascending order (key 1, key 2, and so on), and the second dataset contains the same million key-value pairs, but in random order (for instance, key 90, key 5, key 120).

I inserted these datasets separately and timed the execution of .getHistoryForKey() on that data for a specific key. Additionally, I ran the same range queries (the ones where I simply iterate over .getHistoryForKey() for each of the 108 keys in the ledger). My observation was that it took less time to execute queries for the sorted data.

Intuitively, this makes sense, but I'm struggling to understand why this happens from a code base perspective. I've read online that LevelDB employs LSM trees, which theoretically should make the order of data insertion relatively insignificant, right?

Thank you for your time,

May 19 '23 03:05 r4sk01

I agree, in theory the order of insertion should not be very significant when using these LevelDB key based 'indexes'. That's why I was hoping you could further break down the delays that you are seeing, so that we can do some more targeted investigation.

May 19 '23 13:05 denyeart

Hi @denyeart ,

Returning to the .getHistoryForKey() and historical index it is using, is there an information about how much will be the space overhead for such index? How a user can verify that the result from .getHistoryForKey() is truth and valid ? Does user check that?

To clarify for myself, I have implemented the following function to run range query. This is what you mean by placing some timers in my chaincode, right? Add timers to this function and find out where it takes the most time, right?

async queryOrderHistoryByRange(ctx, startKey, endKey) {
        const results = [];
        for await (const {key, value} of ctx.stub.getStateByRange(startKey, endKey)) {
            const iterator = await ctx.stub.getHistoryForKey(key);
            while (true) {
                const result = await iterator.next();
                if (result.done) {
                    break;
                }
                const assetValue = result.value.value.toString('utf8');
                let transactionId = result.value.txId;
                let asset = {
                value: assetValue,
                timestamp: result.value.timestamp,
                txId: transactionId
                };
                results.push(asset);
            }
            await iterator.close();
        }
        return JSON.stringify(results);
    }

Thank you for your time,

May 19 '23 20:05 r4sk01

You can monitor the size of the history index at this peer location: CORE_PEER_FILESYSTEMPATH/ledgersData/historyLeveldb

You can verify the history by querying the returned txId and looking at the transaction's writesets. Some users like to query two different peers to verify the results against each other.

For timings, I would suggest to log timestamps in the chaincode as you iterate through the results, and also enable peer debug ( FABRIC_LOGGING_SPEC=debug ) and then identify the largest time gaps in the chaincode and peer logs.

May 23 '23 17:05 denyeart

Hello @denyeart,

As you previously highlighted in this thread, the GetHistoryForKey() function is implemented at the following location: https://github.com/hyperledger/fabric/blob/main/core/chaincode/handler.go#L893

However, I'm having some difficulty understanding its connection with the Node.js SDK. Could you please explain how Node.js packages such as fabric-shim, fabric-contract-api, and fabric-shim-api are able to invoke the GetHistoryForKey() function?

Thank you for your time,

Jun 01 '23 19:06 r4sk01

fabric-shim provides functions to invoke on a chaincode. When you call one of the chaincode functions like GetHistoryForKey() the fabric-shim will make a grpc call to the peer. If you iterate through results there will be a series of these grpc calls between the node chaincode and the peer.

To see the actual fabric-shim handler, look at: https://github.com/hyperledger/fabric-chaincode-node/blob/main/libraries/fabric-shim/lib/stub.js#L694-L714

I'm not an expert in node chaincode, for deeper discussion I'd recommend Discord channel fabric-contract-apis.

Jun 05 '23 19:06 denyeart

To add a new chaincode function you would first extend the protos: https://github.com/hyperledger/fabric-protos/blob/main/peer/chaincode_shim.proto

Then generate the bindings, e.g. make nodebindings: https://github.com/hyperledger/fabric-protos/blob/main/Makefile#L246

Then update files in your preferred language chaincode shim. For Go that would be fabric-chaincode-go/shim interfaces.go, stub.go, handler.go. For node that would be fabric-chaincode-node although I haven't looked at the node implementation files myself (I usually recommend Go chaincode since it performs best and is maintained by the core Fabric maintainers).

Then update the peer to handle the chaincode calls in: https://github.com/hyperledger/fabric/blob/main/core/chaincode/handler.go

Jun 16 '23 16:06 denyeart