ensure that conflicting statements are reconciled only once
Scenario
A statement s_1 and a statement s_2, from sources S_1 and S_2 respectively, are conflicting. With some policy (or via manual intervention) they are reconciled and the result is statement s_3.
The next day, kaybee merge is run again, how to deal with the same conflict? Shall we keep a pointer to the last commit that was considered from each source so that reconciliation is not done again on the exact same set of conflicting statements?
Proposed solution (thanks @henrikplate !):
keep track of the time when statements about a given vuln. were last reconciled; then, compare timestamp of each candidate statement with the last resolution timestamp (if nothing changed..., do not attempt reconciliation again).
This is not just an optimization that avoids a computation that will anyway produce the same result: this is especially needed to avoid requiring human intervention in the cases where the manual reconciliation was applied.
Of course, it should be possible to force the reconciliation (=reset the timestamp), so that the user has a chance to overrule the previous decisions.
Wouldn't it be possible to just keep the timestamp of the last run of kaybee (instead of the last reconciliation of a given vulnerability)?
Yes, and that's easier to implement but the finer-grained approach might be useful in practice:
if we have one last_reconciled timestamp per vulnerability, we can just set that timestamp to 0 and reconcile just that vulnerability again. If we do the same for the last_kaybee_run timestamp, all reconciliations will be considered obsolete and be re-run, you agree?
I thought you could force the merge for single vulnerabilities by sth. like kaybee merge CVE-0123-4567 --force, thereby ignoring the last_kaybee_run timestamp.
Anyways, together with timestamp(s) you should also remember the sources for which a past merge was done. Otherwise, the addition or removal of sources may not be reflected.
I need to think about it, you're probably right, this simpler approach might be all we need.
As for the second part: during and after merging, the statement is annotated with some details on where it came from (basically it becomes statement + source).
Regarding sources: This means you need to have the result of the last merge in order to know the source(s) it originated from? If yes, you could take the timestamp of that file as last_reconciled.
For each "reconciled statement" i keep the full list of candidate statements that were considered, and for each of them, I keep the information of the source they came from. This info right now is only saved to a file for inspection purposes, but I can put that in a more long-lived data storage (an embedded sqlite db could a good solution)
Why a single last_run timestamp is not enough, and we need finer, per-vulnerability , timestamps:
Imagine we run kaybee merge -p <policyA> (for example, policyA could be strict): the result is that some statements are merged and some are skipped (because the merge policy cannot reconcile them). In this case, I expect that the user can run afterwards kaybee merge -p <policyB> (for example, policyB could be manual) in which case the statements that were previously reconciled are left alone, and only the others non-reconciled are considered in the new merge operation.
Makes sense?
A simpler solution, with no timestamps involved:
When reconciling statements, we can simply check if we have already considered that set of candidates in a previous (successful) reconciliation. If the set is different, we need to reconcile, otherwise we can skip. Note: the set is different even if a candidate disappears, in which case we need to decide what is the right thing to do....
Checking if a statement has already been seen before requires saving all Statement hashes (we have hash() method) in a searchable mergelog. No timestamps are needed with this solution (also, how would they help?).
@henrikplate could you please comment?
AFAIK, you already produce a merge log. And considering sets of previously considered statements also, intuitively, covers the case of added/removed sources. All-in-all, I think it makes sense to proceed that way. And, as you say, once you work with digests, we would not need any timestamps any more.