SPDX Compare Time Out
Hello Team
We are working on comparing SPDX files using Java-Tool by generating one from zephyr and other one from previous release
Looks like compare never finishes and time out
Any one facing similar issue or following any other methods which succeeded the compare ?
There have been some performance issues in the past when deleting SPDX elements, but that shouldn't impact the compare.
Compare is very compute intensive, but it shouldn't be timing out. Since dependencies can have cycles, I wonder if there isn't an infinite loop that not being handled properly.
The unit tests run through a number of compare scenarios, so this issue may be related to the size of the SPDX files or something unique about the SPDX files.
If you can point me to two SPDX files that reproduce the problem, I'll take a look.
I did some analysis and debugging. There are 44,595 elements which are in the documentDescribes list.
The document describes should only point to a few elements - the "root of the tree", typically less than 10 elements, so this is unusually large.
To make sure these match, the algorithm iterates through each element and makes sure it finds an equivalent element in the comparison documentDescribes list. On my machine, it is taking about 1 second for each element search, or 0.02 ms for each individual element comparison. The equivalent comparison can be quite involved, so the individual comparison cost seems reasonable to me. The issue is more the size of the lists being compared and possibly the algorithm for the list comparison.
We may be able to improve the algorithm, but I would not expect any of the collections to be larger than 5,000 or so in practice.
I did some more analysis and experimentation to see if I could dramatically improve the performance of comparing very large SPDX element collections.
TL;DR - no success yet, but some interesting findings.
Following are some of the constraints when doing the compares:
- The collections do not have any guaranteed order
- When comparing elements, we use the
equivalentmethods to state the 2 elements are the same only if all their properties are the same - Access to the element properties are stateless - if another process or thread changes a property value, the next time that property is accessed, the new value will show up - no stale values
Comparing all the property values can take some time, especially if a property value is also a collection of elements. We can get a recursive stack of comparing very large number of elements.
One thing I noticed in my experimentation is we tend to compare the same elements over and over (e.g. the collections being compared may be a property of another element which is being compared).
If we relax the stateless access (3 above) for this use case, we can keep track of the comparisons in a map.
I created prototype using this approach (reference this draft pull request).
Unfortunately, we now have a memory issue - I ran out of heap space running the comparison above.
I'm not quite sure yet what is consuming the memory. The maximum size of the map will be total number of elements to the factor of the number of documents. Since it is a Boolean map, I wouldn't expect it to be large enough to run us out of heap space. More analysis needs to be done.
If anyone is interested in looking into this, let me know.
If we can guarantee the collection (of differences) is sorted, will it help the comparison?
The sorting order is a desired enhancement from this https://github.com/spdx/Spdx-Java-Library/issues/232 too. So looks like a good investment.
If we can guarantee the collection (of differences) is sorted, will it help the comparison?
The sorting order is a desired enhancement from this spdx/Spdx-Java-Library#232 too. So looks like a good investment.
Depending on the store implementation, sorting may not be possible. For example, RDF does not guarantee order.