guac
guac copied to clipboard
[feature] Add new field to store document reference for persistent storage
Is your feature request related to a problem? Please describe. With the new addition of the blob store, GUAC now can store documents that were collected and ingested into GUAC. This will allow users to find the original document (or re-ingest in the case of failure) if they need. In order to accurately determine the location of a document in the blob store, a document reference (blob store key) needs to be stored in the DB.
Currently, replacing the origin with this information, as we did in https://github.com/guacsec/guac/pull/1811 would result in loss of the original location from where the document originated. We should add a new field to preserve the existing functionality.
This would require an update to the graphQL schema to add in a new field called documentRef that would be empty if the persistent blob store is not being used. We would also add a new field to the SourceInformation struct in Document to contain the new field:
type Document struct {
Blob []byte
Type DocumentType
Format FormatType
Encoding EncodingType
SourceInformation SourceInformation
}
// SourceInformation provides additional information about where the document comes from
type SourceInformation struct {
// Collector describes the name of the collector providing this information
Collector string
// Source describes the source which the collector got this information
Source string
// DocumentRef describes the location of the document in the blob store
DocumentRef string
}
We could either do this for all "verbs" or we can do it for only non-ephemeral documents like hasSBOM, hasSLSA, and CertifyVEX.
Though, in the future, we may want to move to a more in-toto attestation-specific ingestion for verbs like certifyBad, certifyGood, hasSourceAt, pkgEquals to name a few. So that we have evidence that something was attested by someone at this time. We again store this in the blob store for record keeping.
Describe the solution you'd like
Update to the graphQL schema to add in a new field called documentRef that would be empty if the persistent blob store is not being used.
Decision needs to be made if this should be done for all "verbs" or only a select few.
Describe alternatives you've considered
We could concatenate the strings together via a separator value (maybe something like https://github.com/guacsec/guac/blob/main/pkg/assembler/helpers/package.go#L27 so that it does not interfere in the future)?
For example:
mcr.microsoft.com/oss/kubernetes/kubectl@sha256:8035089a59a6f8577255f494c1ced250e1206667d8462869fc0deeca98d79427guac-empty-@@sha256:8534561615616161894984126517
That way we can split out the source in the future and get both the original source and the new blob store key back.
But following this method, we would lose the capability to query via origin and loss of functionality in GUAC.
cc @nchelluri
I say we do it for all verbs. I think it's a bit more extra work for helping not just from an evidence perspective but debuggability and maintainability perspective for GUAC.
We can use the blobs whether or not they are API responses or documents or something else for help debugging when something goes wrong as we have the content saved. It also helps if we parse a document, API request/response, etc. differently in the future and need to re-parse the data.
+1 on doing it for all verbs.
We should not combine fields with separators. I think doing so will open us to issues when we'd need to escape the separator further down.
+1 , this looks good on adding this to SourceInformation
This will allow users to find the original document (or re-ingest in the case of failure) if they need. i would say that this will not quite directly meet the usecase of "re-ingest in the case of failure". There needs to be another solution for that - something that is more on the pubsub and having a reprocessing pipeline, but that seems out of scope for this issue.
As an aside, one of the issues we've run to before with one of our other projects that has a similar data pipeline is the reprocessing is the issue of duplicates (@mdeicas has been looking at this).
This will allow users to find the original document (or re-ingest in the case of failure) if they need. i would say that this will not quite directly meet the usecase of "re-ingest in the case of failure". There needs to be another solution for that - something that is more on the pubsub and having a reprocessing pipeline, but that seems out of scope for this issue.
oh yes, this is not the solution to "re-ingest in case of a failure". This is your database blows up and you have to start from scratch.
As an aside, one of the issues we've run to before with one of our other projects that has a similar data pipeline is the reprocessing is the issue of duplicates (@mdeicas has been looking at this).
Interesting, if there is lessons learned we can apply here that would be great.
Sorry for the late response, but I think the lesson learned is that an ingestion pipeline may have been designed with an assumption of only ingesting documents once, or otherwise to be idempotent, and so it won't support re-ingesting documents to pick up new parsing features. It might be prudent to document this somewhere for clients?
hmmm that is an interesting case. I added to our agenda to discuss