bc-java
bc-java copied to clipboard
RFC4998 implementation error ?
It seems that the current implementation of RFC4998 (Evidence Record Syntax) is incorrect. I maybe wrong in the way I use the ERSArchiveTimeStampGenerator class, but from what I see the reduced hash tree contains the hashes of all data elements that are provided. The RFC however states that only selected hash values should be included (cf. section 4.2) . Similarly, I can only obtain a single ERSEvidenceRecord for a given input of n data elements, whereas it should be a n to n mapping with all evidence records sharing the same TimestampToken but containing different reduced hashtrees. Any thoughts on this ?
Hmmm. I think we need to address these separately. On the first one, can you provide an example of what you're putting in and what you'd expect to come back?
On the first one, if I provide the following data objects (each of which constituting a data group), I would expect to get 4 evidence records in return: "hello", "world", "alice", "bob". Considering the attached diagram:
- The first ER, generated from the perspective of "alice", would have a reduced hashtree containing h2 and h34 ;
- The second ER, generated from the perspective of "hello", would have a RHT containing h1 and h34 ;
- The third ER, generated from the perspective of "world", would have a RHT containing h12 and h4 ;
- The fourth ER, generated from the perspective of "bob", would have a RHT containing h12 and h3.
All ERs would however share the same timestamp token, generated over h1234.
As a side note, the implementation I provided a few years ago (see https://github.com/bcgit/bc-java/pull/343) relied on a TreeMap for the RHT generation, whereas the current implementation relies on a pure binary tree approach.
@veebee, I’m not sure I understand you correctly, but why do you expect 4 ERs for your 4 data “hello”, “world” …? The purpose of Evidence Records is to combine these data (in a hash tree or a list of reduced hash trees) and create a single ER which has a one time stamp initially that refers to the root hash value only. The root hash value covers the hash values of the 4 data. Later, one can do hash tree renewal or time stamp renewal for the ER.
I'm not sure I understand. This may be confusion about the API, but following the ASN.1 the above would produce a single ER. The timeStamp itself appears inside ArchiveTimeStamp which is a field of the ER. A single data group would produce a partial hash tree with a single node in it representing the entire group due to the calculation mechanism in Section 4.2 part 3.
I think the issue is the RFC uses "group" in two contexts, both in terms of the archive time stamp construction, but also in terms of the data to be included in a hash tree being either the hashes of single documents or a single hash computed from a group of documents which are, usually, someway related. The type ERSDataGroup, define in the BC API, refers specifically to the last one - a group of documents to be stored under a single hash.
The other form of grouping, the collection of hashes into a partial hash tree, will follow naturally from the use of ERSData, in adding the objects. This will still produce a single ArchiveTimeStamp with a reduced hash tree covered by a single time stamp.
Does that make sense?
@raubv0gel, my understanding is that the RHT is data-group dependent (cf. Figures 2 and 3 under section 4.2). As there is only a single RHT under and ArchiveTimeStamp, an ER should therefore be associated to a single data group.
@dghgit, I might indeed be confused by these two different uses of "group". To give a better example, I tried to mimic what is provided in section 4.2 (the attached txt file contains the ASN.1 structure encoded in base64). I gave the following input:
- An ERSDataGroup (d1) containing a single ERSByteData ("hello");
- An ERSDataGroup (d2) containing 3 ERSByteData ("bouncy", "castle" and "rocks");
- An ERSDataGroup (d3) containing a single ERSByteData ("world").
The ER I obtain has a single RHT containing 3 partial hashtrees, totalling 5 hashes:
- pht1 contains h1 = H(d1) ;
- pht2 contains h3 = H(d3) ;
- pht3 contains h2b, h2a, h2c (respectively H("castle"), H("bouncy") and H("rocks")).
What I would expect instead is a set of 3 ERs:
- The first one covering d1, with a RHT containing 2 partial hashtrees: (h1, h2) and (h3), where h2 = H(h2b + h2a + h2c) ;
- The second one covering d2, with a RHT containing 3 partial hashtrees: (h2b, h2a, h2c), (h1) and (h3);
- The third one covering d3, with a RHT containing 2 partial hashtrees (h3) and (h12).
Does it make it a bit clearer ?
The test: https://github.com/bcgit/bc-java/blob/master/pkix/src/test/java/org/bouncycastle/tsp/test/ERSTest.java emulates section 4.2.
Section 4.2 is only about reduced hash trees though - it provides examples of what could go in a single ER.
You can certainly do what you want, but you would need a separate ERSArchveTimeStampGenerator for each data group, which would allow you to create a separate ERSEvidenceRecord for each one.
It seems like the RFC does not define the algorithm for reduced hash tree generation quite correctly (section 4.2). It starts with
- Generate hash value h of the data object, using hash algorithm H of the hash tree.
Actually, one have to generate a reduced hash tree of each data object. After, all reduced hash trees will be stored in the reducedHashtree attribute of a single evidence record.
Furthermore, in section 4.3 the RFC says that verification is
- Calculate hash value h of the data object with hash algorithm H given in field digestAlgorithm of the Archive Timestamp.
- Search for hash value h in the first list (partialHashtree) of reducedHashtree. If not present, terminate verification process with negative result.
- …
- …
This seems not to be enough: To fail verification one has to check that h is not contained in any partialHashtree, not in the first list only. In the BC implementation one can actually see that org.bouncycastle.tsp.ers.ERSArchiveTimeStamp#checkContainsHashValue
iterates over all reducedHashTrees.
Note: It seems that the implementation of timestamp renewal and hash-tree renewal is missing in the BC API. See #1173.
Actually, one have to generate a reduced hash tree of each data object.
That's my point actually, and the reason why my understanding of the RFC is that one ER should be associated with a single data object. I'm trying to find back a conversation I had with T. Gondrom a few years back, which confirmed this, as well as sample ERS from a German government institution (using a proprietary implementation).
After, all reduced hash trees will be stored in the reducedHashtree attribute of a single evidence record.
No, this would kill the purpose of using reduced hashtrees. You would be better off with storing all hashes then.
@raubv0gel, did you have a look at this: https://github.com/de-bund-bsi-tr-esor/ERVerifyTool ?
Okay, I think I see the issue - a reduced hash tree is basically nodes out of a Merkle tree which provide enough information to recreate the root. This last bit is the thing which is not fully defined in the RFC (we use a left first traversal, I'd guess most people would, but there are clearly other ways of doing it, which is a shame, as it is going to lead to compatibility issues going on). The root of the Merkle tree is what is signed by the timestamp server and forms the basis of the ER. The idea of using the Merkle tree is to avoid having to produce an ER for each document, so a group of documents can be hashed as nodes of a Merkle tree with the hashes of document that are longer lived than others remaining valid even while some documents expire or are superseded.
That said there is nothing to stop you from producing an ER per document, but you wouldn't use a reduced hash tree for this as it makes no sense - there's only a single document, so the root of the Merkle tree would be the document hash. It would be the simplest way to implement the standard though.
I haven't had a look at the ERVerifyTool (yet). I will, eventually.
That said there is nothing to stop you from producing an ER per document, but you wouldn't use a reduced hash tree for this as it makes no sense - there's only a single document, so the root of the Merkle tree would be the document hash. It would be the simplest way to implement the standard though.
I would say that it enables cost optimization: if you need to produce single ERs for millions of documents, the cost of timestamp tokens may be prohibitive. However, if you group them in order to share a single timestamp token, it becomes very interesting.
@veebee
I would say that it enables cost optimization
That’s the point! 🙂 E. g., a qualified time stamp from D-Trust TSA is expensive – about 0.5 €. Imagine one have to time stamp thousands of documents per day …
While trying to understand reduced hash trees I stepped through org.bouncycastle.tsp.ers.ERSArchiveTimeStampGenerator#getPartialHashtrees having RFC 4998, section 4.2. in mind. If I am not wrong, getPartialHashtrees()
doesn’t do any reduction. The result seems to be always like
Each partial hash tree only contains a single hash value (the example is about 6 data objects, therefore 6 hash values). The example in RFC 4998, section 4.2. clearly shows that a partial hash tree contains more than a single hash value in general.
As said, there seems to be no hash tree reduction at all – it’s actually only a hash list.
Help, please?! 🙃
The full tree is calculated in BinaryTreeRootCalculator. What's stored in the ASN.1 is enough to rebuild that. We do reproduce the example in section 4.2, but again it's not really enough to determine exactly how the Merkle tree should be either decomposed or rebuilt.
It’s computationally expensive to rebuild the whole hash tree (or “just” the root hash value) from the data object hash values only. One has to calculate each inner node (hash value).
The purpose of reduced hash trees is that one can computationally inexpensively calculate the root hash value while only having one data object hash value h
(and therefore prove that h
is contained in the ArchiveTimeStamp). Much fewer inner nodes (hash values) have to be calculated, because they are already contained in the reduced hash tree.
ArchiveTimeStamp from unit test org.bouncycastle.tsp.test.ERSTest#testBasicBuildEvidenceRecord
:
Again, only a hash list is stored (there are 3 hash values in partial hash tree 0, because h3Docs
is a ERSDataGroup
that contains 3 data objects). This is not the reduced hash tree illustrated in RFC 4998, section 4.2.
Am I wrong?
Rereading the RFC - if you look at it from the point of optimization, 3 reduced hash trees would make sense in this case. In this case you'd be describing 3 paths for calculating each data group. I'll admit in my case when I was reading this I thought it was more about reducing space and that the examples were a succession of decompositions.
Reduced hash tree currently being stored by ERSTest.testBasicBuild() is:
DER Sequence ----DER Sequence --------DER Octet String[32] --------DER Octet String[32] --------DER Octet String[32] ----DER Sequence --------DER Octet String[32] ----DER Sequence --------DER Octet String[32]
Which is the same as the structure given at the end of section 4.2.
I think we can add support for the expanded method without breaking anything which has been done already.
Update: actually, no, I think this is till not quite right, an ArchiveTimeStamp can only have one reduced hash tree. If this was for optimization then there would be a separate ArchiveTimeStamp for each data group, or single object, with the same ArchiveTimeStampChain containing the different ArchiveTimeStamp structures for each data group. I think this is what confused me with the initial report, it's not the EvidenceRecord that's repeated, it's the ArchiveTimeStamp. This is still backwards compatible change - it's an optimisation. I need to look at the BSI tests first.
Update 2: Section 5 actually says a single ArchiveTimeStamp - the structure seems to be ArchiveTimeStampChain for timestamp renewal, a new ArchiveTimeStampChain for hash renewal added to the sequence. It specifically talks about a single ArchiveTimeStamp in each case, it does not say there multiples. I've now also checked the BSI test data as well, it seems consistent with what we are producing.
This seems to be getting more and more confusing. ¯\_(ツ)_/¯
This RFC seems to have some mistakes and wired descriptions.
I certainly missed the fact that one can only store a single reduced hash tree in an ArchiveTimeStamp
…
Section 4.2. shows the example hash tree
For data object 1 (d1
, h1 = h(d1)
) it is reduced to the reduced hash tree rht1
For data object 2 (d2
, h2abc = h(h(d2a) ∙ h(d2b) ∙ h(d2c))
) it is reduced to the reduced hash tree rht2
For data object 3 (d3
, h3 = h(d3)
) the reduced hash tree is not given (why?). The reduced hash tree rht3
should be
We get 3 reduced hash trees rht1 = SEQ(SEQ(h2abc, h1), SEQ(h3))
, rht2 = SEQ(SEQ(h2b, h2c, h2a), SEQ(h1), SEQ(h3))
, rht3 = SEQ(SEQ(h12, h3))
.
@dghgit, you are right with your Update (1): The ASN.1 syntax
clearly shows that only a single reduced hash tree can be stored in an ArchiveTimeStamp
. Currently this makes no sense to me: Why should we reduce the hash tree for each data object if we can only store a single reduced hash tree … Otherwise, why should we generate only a single reduced hash tree for one of the data … For me it seems wired and complex to generate multiple copies of ArchiveTimeStamp
s up to different reduced hash trees (for the same hash tree) just to store multiple reduced hash trees.
Questions:
- Why just don’t store the complete hash tree (without reduced hash trees)?
- What is the purpose of storing reduced hash trees at all? Can it make sense to store only a single reduced hash tree?
@dghgit
Update 2: Section 5 actually says a single ArchiveTimeStamp - the structure seems to be ArchiveTimeStampChain for timestamp renewal, a new ArchiveTimeStampChain for hash renewal added to the sequence. It specifically talks about a single ArchiveTimeStamp in each case, it does not say there multiples. I've now also checked the BSI test data as well, it seems consistent with what we are producing.
Because of #1173 I have implemented time stamp renewal and hash tree renewal (rehashing) into the BC API (forked) – the generation part (5.2) is finished but the validation part (5.3) isn’t, yet.
Right, the RFC (5.2) says
In the case of Timestamp Renewal, the content of the timeStamp field of the old Archive Timestamp has to be hashed and timestamped by a new Archive Timestamp. The new Archive Timestamp MAY not contain a reducedHashtree field, if the timestamp only simply covers the previous timestamp.
If we had multiple copies of ArchiveTimeStamp
s up to different reduced hash trees, we had to generate new ArchiveTimeStamp
s for each in the case of time stamp renewal. The complexity in the case of hash tree renewal would be even more “fun”.
Big misunderstanding
I have reread the RFC 4998 and conclude that I have misunderstood a fundamental thing. To clarify I give some quotes from the RFC.
From RFC
Archived
data object
: A data unit that is archived.
Archived
data object group
: A set of two or more ofdata objects
, which for some reason belong together.
Archive Timestamp
: A timestamp and typically lists of hash values, which allow the verification of the existence of severaldata objects
at a certain time.
Actually, an Archive Timestamp
contains a reduced hash tree for a single data object
(or data object group
) only. The corresponding ASN.1 clearly says it does not contain multiple reduced hash trees – only one or no reduced hash tree is allowed. I guess, what is meant by “which allow the verification of the existence of several data objects
” is that the time stamp in the Archive Timestamp
can cover multiple data object
s (or data object group
s).
Evidence record
: Collection of evidence compiled for one or more given archiveddata objects
over time.
This seems to be unclear, too. The RFC later writes “An Evidence Record is a unit of data, which can be used to prove the existence of an archived data object
or an archived data object group
at a certain time.” An Evidence record
can contain multiple Archive Timestamp
s, but it’s referred to a single data object
(or data object group
) only. The 2nd Archive Timestamp
, 3rd Archive Timestamp
, … are referring to the 1st Archive Timestamp
only (it covers the case of time stamp renewal)!
Conclusion
If I’m not wrong (again):
- An
Archive Timestamp
covers a singledata object
(ordata object group
) only. - Therefore an
Evidence record
covers a singledata object
(ordata object group
) only. - Thus – I guess the most important conclusion – an
Evidence record
is just a proof for a singledata object
(ordata object group
), it’s not a data structure to store multipledata object
s (ordata object group
s). One could also put it this way:Evidence record
is not a data structure to store evidences (in the sense of an evidence archive), it’s a data structure to exchange/give evidences (in the sense of using it in a lawsuit).
Hence, I guess @veebee was/is absolutely right!
Therefore, I guess we have to rethink the BC implementation! At least the ERSArchiveTimeStampGenerator
seems not to be correct.
Hence, I guess @veebee was/is absolutely right!
Of course, I'm always right :-)
Anyways, if needed I'd be happy to rework on the implementation I made a while ago.
Section 4.2 Generation
- For each data group containing more than one document, its respective document hashes are binary sorted in ascending order, concatenated, and hashed. The hash values are the complete output from the hash algorithm, i.e., leading zeros are not removed, with the most significant bit first.
It's clearly saying multiple data groups are allowed.
The BSI XML data is also described as holding 3 data objects, one of which is a group, although the odd thing about this is the group is stored as a hash only and is actually the described by the first partial hash tree - so (strictly speaking) it's not necessary to store the group hash as the use of the sequence to connect the two hashes in the group.
At this point I actually expect that the authors meant:
"We get 3 reduced hash trees rht1 = SEQ(SEQ(h2abc, h1), SEQ(h3)), rht2 = SEQ(SEQ(h2b, h2c, h2a), SEQ(h1), SEQ(h3)), rht3 = SEQ(SEQ(h12, h3))."
to read
"We get 3 partial hash trees rht1 = SEQ(SEQ(h2abc, h1), SEQ(h3)), rht2 = SEQ(SEQ(h2b, h2c, h2a), SEQ(h1), SEQ(h3)), rht3 = SEQ(SEQ(h12, h3))."
As this would provide partial hash trees for validating each object/group. Of course if you just store rht2 you can still validate everything. rht1 and rht2 are actually redundant - in this sense "reduced hash tree" is kind of correct (so any of would do) although by their own definition rht3 would be wrong as it's assuming the (h2b, h2c, h2a) and h1 are a single group (which is why I suspect the sentence is talking about partial hash trees instead).
For the root of rht2 to equal the root of rht1 a left first traversal is required, which is why we do things the way we do. We don't store the redundant information as these things are designed to deal with potentially large data sets and adding the redundant information could lead to an explosion in the number of partial hash trees required as opposed to a steady linear increase in size in the reduced hash tree that is stored.
In section 4.2, the RFC does also say the reduced hash tree can just be a list of hashes - in our case you would construct that by putting all your data into a single group.
I think, because the RFC is too informal, there is so much confusion …
Section 4.2 Generation, 1. – 5. of part 1, describes how to generate the hash tree. It does not tell us to store the hash tree. It’s a preparation for the following steps, 1. – 4. of part 2, that describes how to generate a reduced hash tree for a single selected data object
. As said above, an ArchiveTimeStamp
by it’s ASN.1 structure can store a single reduced hash tree only.
Note there are no restrictions on the quantity or length of hash- value lists. Also note that it is profitable but not required to build hash trees and reduce them. An Archive Timestamp may consist only of one list of hash-values and a timestamp or only a timestamp with no hash value lists.
This is, of course, wired. Why does the RFC calls it reducedHashtree in
ArchiveTimeStamp ::= SEQUENCE {
digestAlgorithm [0] AlgorithmIdentifier OPTIONAL,
attributes [1] Attributes OPTIONAL,
reducedHashtree [2] SEQUENCE OF PartialHashtree OPTIONAL,
timeStamp ContentInfo}
PartialHashtree ::= SEQUENCE OF OCTET STRING
if one can store “only one list of hash-values” instead of a reducedHashtree
? Furthermore it has to be formally a list of lists of hash values. Moreover, how does a validator know if he has to interpret reducedHashtree
as a reduced hash tree or a simple list of hash values?
Take a look at
with it’s reduced hash trees (rht) (without loss of generality according to the binary order)
rht(h1) = SEQ( SEQ(h1,h2), SEQ(h34), SEQ(h5678) )
rht(h2) = rht(h1)
rht(h3) = SEQ( SEQ(h3,h4), SEQ(h12), SEQ(h5678) )
rht(h4) = rht(h3)
rht(h5) = SEQ( SEQ(h5,h6), SEQ(h78), SEQ(h1234) )
rht(h6) = rht(h5)
rht(h7) = SEQ( SEQ(h7,h8), SEQ(h56), SEQ(h1234) )
rht(h8) = rht(h7)
Obviously, there is no reduced hash tree that contains all hi
(i = 1, …, 8
). The case rht1 = SEQ(SEQ(h2abc, h1), SEQ(h3))
above only contains all hash values because it’s “special” case. Thus, in general there is no reduced hash tree containing all data hash values.
Furthermore, section 4.3 Verification, states
An Archive Timestamp shall prove that a data object existed at a certain time, given by timestamp. This can be verified as follows:
- Calculate hash value h of the data object with hash algorithm H given in field digestAlgorithm of the Archive Timestamp.
- Search for hash value h in the first list (partialHashtree) of reducedHashtree. If not present, terminate verification process with negative result.
As one can see for example, in the first list of rht1 = SEQ(SEQ(h2abc, h1), SEQ(h3))
, i. e. SEQ(h2abc, h1)
, there is no h3
. Thus if we test for h3
, we get negative result. This is a contradiction if one assumes that an ArchiveTimeStamp
contains all data object
hash values! Btw. BC’s void checkContainsHashValue(final byte[] hash, final DigestCalculator digCalc) iterates over all partialHashtree
s whereas 2. clearly stats to look at the first list (partialHashtree) only.
Again, I conclude as above: An ArchiveTimeStamp
is for a single data object
only. Hence an Evidence Record
is for a single data object
only. Thus an Evidence Record
is for exchange only (not for storage).
@veebee, did you find the conversation with T. Gondrom? Does it clarify if an ArchiveTimeStamp
is used for a single data or for all data? Maybe, I can try to contact T. Gondrom and talk to him in German …
@veebee, did you find the conversation with T. Gondrom? Does it clarify if an ArchiveTimeStamp is used for a single data or for all data? Maybe, I can try to contact T. Gondrom and talk to him in German …
I'm afraid not. Seems I contacted him using my previous work email... I will try to get some input from people who worked on the eIDAS long-term preservation scheme and the associated ETSI standards.
I think we need to reset this.
@veebee you are trying to say that the scenario described in Section 4.2 should produce 3 evidence records. Is that correct? Actually looking back that's clearly the case.
The following:
- Search for hash value h in the first list (partialHashtree) of reducedHashtree. If not present, terminate verification process with negative result.
is quite compelling.
I've made some changes to correctly accommodate 4.3 section 2, which as expected led to multiple evidence records. ERSTest has been updated accordingly. The updates should be on github shortly and include timestamp and hash renewal.
Fortunately it's easy to split the old style evidence record if required to create correct new ones, so while the change to the API is not the best news, at least people won't have to get access to the original data if they need to update.
Let me know your thoughts.
Found an issue with sorting on larger data sets. Fix for that now pushed as well.
I think we need to reset this.
@veebee you are trying to say that the scenario described in Section 4.2 should produce 3 evidence records. Is that correct? Actually looking back that's clearly the case.
Yep, absolutely.
I'll have a look later today.
Thanks, while you're at it would you check BinaryTreeRootCalculator.java I found an error in that which I believe I have fixed as well. As I mentioned to @raubv0gel I'd like to get my hands on any tests/test data I can. It's become quite obvious that "stressing" this particular code is quite important.
If it's any help, there is now a new beta 172b04 on https://downloads.bouncycastle.org/betas/ which has updated jars including the latest changes.
A quick test with 8 data object
s d_i
(i = 1, …, 8
) yields
For each i
the ER(d_i)
contains a hash list of all {d_j | j = 1, …, 8}
, not the reduced hash tree of d_i
. We really should implement proper reduced hash trees – the root hash calculation in much faster in the case of reduced hash trees compared to build up the complete hash tree from the hash list of all data object
. We should think of thousands of data object
s …
Do you want to give me a try to implement proper reduced hash tree generation?