guac
guac copied to clipboard
[ingestion/data-quality issue] JSON lines files are rejected
Describe the bug When trying to ingest a file of json lines of predicates, it fails due to the document type being unknown..
To Reproduce
- Create a file to collect with the contents:
lumb@lumb:~/ssci_metadata/sample_jsonl$ cat lines.intoto.jsonl
{"payloadType":"application/vnd.in-toto+json","payload":"eyJfdHlwZSI6Imh0dHBzOi8vaW4tdG90by5pby9TdGF0ZW1lbnQvdjAuMSIsInN1YmplY3QiOlt7Im5hbWUiOiJjdXJsLTcuNzIuMC50YXIuYnoyIiwiZGlnZXN0Ijp7InNoYTI1NiI6ImQ0ZDU4OTlhMzg2OGZiYjZhZTE4NTZjM2U1NWEzMmNlMzU5MTNkZTM5NTZkMTk3M2NhY2NkMzdiZDAxNzRmYTIifX1dLCJwcmVkaWNhdGVUeXBlIjoiaHR0cHM6Ly9zbHNhLmRldi9wcm92ZW5hbmNlL3YwLjEiLCJwcmVkaWNhdGUiOnsiYnVpbGRlciI6eyJpZCI6Im1haWx0bzpwZXJzb25AZXhhbXBsZS5jb20ifSwicmVjaXBlIjp7InR5cGUiOiJodHRwczovL2V4YW1wbGUuY29tL01ha2VmaWxlIiwiZGVmaW5lZEluTWF0ZXJpYWwiOjAsImVudHJ5UG9pbnQiOiJzcmM6Zm9vIn0sIm1ldGFkYXRhIjp7ImJ1aWxkSW52b2NhdGlvbklkIjoiU29tZUJ1aWxkSWQiLCJidWlsZFN0YXJ0ZWRPbiI6IjE5ODYtMTItMThUMTU6MjA6MzArMDg6MDAiLCJidWlsZEZpbmlzaGVkT24iOiIxOTg2LTEyLTE4VDE2OjIwOjMwKzA4OjAwIiwiY29tcGxldGVuZXNzIjp7ImFyZ3VtZW50cyI6dHJ1ZSwiZW52aXJvbm1lbnQiOmZhbHNlLCJtYXRlcmlhbHMiOnRydWV9LCJyZXByb2R1Y2libGUiOmZhbHNlfSwibWF0ZXJpYWxzIjpbeyJ1cmkiOiJodHRwczovL2V4YW1wbGUuY29tL2V4YW1wbGUtMS4yLjMudGFyLmd6IiwiZGlnZXN0Ijp7InNoYTI1NiI6IjEyMzQuLi4ifX1dfX0=","signatures":[{"sig":"MIGIAkIBA9e9+cgYpo46iIOpRKhDCE+tOBtUDKlZsdKP70EGze5yvb8pOAH1i85T8bgvO70qai6kGMl6gSsAWoa05lBT3QACQgHMmDi9bs4CyFC3Ed7EgKPNgEVW9iLGFfoZRjjXHxx6leEyZc9lFRUzrKZkV+fiEg5a1bNeEtgLTz2aPH4ipUnIaA==","keyid":"MyKey"}]}
{"payloadType":"application/vnd.in-toto+json","payload":"eyJfdHlwZSI6Imh0dHBzOi8vaW4tdG90by5pby9TdGF0ZW1lbnQvdjAuMSIsInN1YmplY3QiOlt7Im5hbWUiOiJjdXJsLTcuNzIuMC50YXIuYnoyIiwiZGlnZXN0Ijp7InNoYTI1NiI6ImQ0ZDU4OTlhMzg2OGZiYjZhZTE4NTZjM2U1NWEzMmNlMzU5MTNkZTM5NTZkMTk3M2NhY2NkMzdiZDAxNzRmYTIifX1dLCJwcmVkaWNhdGVUeXBlIjoiaHR0cHM6Ly9zbHNhLmRldi9wcm92ZW5hbmNlL3YwLjEiLCJwcmVkaWNhdGUiOnsiYnVpbGRlciI6eyJpZCI6Im1haWx0bzpwZXJzb25AZXhhbXBsZS5jb20ifSwicmVjaXBlIjp7InR5cGUiOiJodHRwczovL2V4YW1wbGUuY29tL01ha2VmaWxlIiwiZGVmaW5lZEluTWF0ZXJpYWwiOjAsImVudHJ5UG9pbnQiOiJzcmM6Zm9vIn0sIm1ldGFkYXRhIjp7ImJ1aWxkSW52b2NhdGlvbklkIjoiU29tZUJ1aWxkSWQiLCJidWlsZFN0YXJ0ZWRPbiI6IjE5ODYtMTItMThUMTU6MjA6MzArMDg6MDAiLCJidWlsZEZpbmlzaGVkT24iOiIxOTg2LTEyLTE4VDE2OjIwOjMwKzA4OjAwIiwiY29tcGxldGVuZXNzIjp7ImFyZ3VtZW50cyI6dHJ1ZSwiZW52aXJvbm1lbnQiOmZhbHNlLCJtYXRlcmlhbHMiOnRydWV9LCJyZXByb2R1Y2libGUiOmZhbHNlfSwibWF0ZXJpYWxzIjpbeyJ1cmkiOiJodHRwczovL2V4YW1wbGUuY29tL2V4YW1wbGUtMS4yLjMudGFyLmd6IiwiZGlnZXN0Ijp7InNoYTI1NiI6IjEyMzQuLi4ifX1dfX0=","signatures":[{"sig":"MIGIAkIBA9e9+cgYpo46iIOpRKhDCE+tOBtUDKlZsdKP70EGze5yvb8pOAH1i85T8bgvO70qai6kGMl6gSsAWoa05lBT3QACQgHMmDi9bs4CyFC3Ed7EgKPNgEVW9iLGFfoZRjjXHxx6leEyZc9lFRUzrKZkV+fiEg5a1bNeEtgLTz2aPH4ipUnIaA==","keyid":"MyKey"}]}
- Start guacgql in a process
bin/guacone collect files ~/ssci_metadata/sample_jsonl/- Get error message shown below.
Expected behavior
Document type being unknown while being parsed out should be expected, and should not fail at that stage. (https://github.com/guacsec/guac/blob/main/pkg/handler/processor/process/process.go#L206).
I think perhaps for JSON_LINES, we can create a new DocumentType called Opaque? which is to not validate?
Screenshots
lumb@lumb:~/git/guac$ bin/guacone collect files ~/ssci_metadata/sample_jsonl/
{"level":"info","ts":1728571875.5916176,"caller":"logging/logger.go:79","msg":"Logging at info level","guac-version":"v0.9.1"}
{"level":"info","ts":1728571875.5923378,"caller":"cli/init.go:65","msg":"Using config file: /usr/local/google/home/lumb/git/guac/guac.yaml","guac-version":"v0.9.1"}
{"level":"error","ts":1728571875.5989473,"caller":"collector/collector.go:108","msg":"emit error: unable to ingest document: unable to process doc: invalid document format type: JSON_LINES, format: JSON_LINES, document: UNKNOWN","guac-version":"v0.9.1","documentHash":"sha256_53990c51d4765c7fddbff145b82aff94468b9c8c014bdda14ece36ac4ae05fc7","stacktrace":"github.com/guacsec/guac/pkg/handler/collector.Collect\n\t/usr/local/google/home/lumb/git/guac/pkg/handler/collector/collector.go:108\ngithub.com/guacsec/guac/cmd/guacone/cmd.init.func7\n\t/usr/local/google/home/lumb/git/guac/cmd/guacone/cmd/files.go:151\ngithub.com/spf13/cobra.(*Command).execute\n\t/usr/local/google/home/lumb/go/pkg/mod/github.com/spf13/[email protected]/command.go:989\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/usr/local/google/home/lumb/go/pkg/mod/github.com/spf13/[email protected]/command.go:1117\ngithub.com/spf13/cobra.(*Command).Execute\n\t/usr/local/google/home/lumb/go/pkg/mod/github.com/spf13/[email protected]/command.go:1041\ngithub.com/guacsec/guac/cmd/guacone/cmd.Execute\n\t/usr/local/google/home/lumb/git/guac/cmd/guacone/cmd/root.go:57\nmain.main\n\t/usr/local/google/home/lumb/git/guac/cmd/guacone/main.go:23\nruntime.main\n\t/usr/lib/google-golang/src/runtime/proc.go:272"}
{"level":"info","ts":1728571875.5990932,"caller":"cmd/files.go:144","msg":"collector ended gracefully","guac-version":"v0.9.1"}
{"level":"fatal","ts":1728571875.5991073,"caller":"cmd/files.go:156","msg":"completed ingestion with error, 0 of 1 were successful - the following files did not ingest successfully: file:////usr/local/google/home/lumb/ssci_metadata/sample_jsonl/lines.intoto.jsonl","guac-version":"v0.9.1","stacktrace":"github.com/guacsec/guac/cmd/guacone/cmd.init.func7\n\t/usr/local/google/home/lumb/git/guac/cmd/guacone/cmd/files.go:156\ngithub.com/spf13/cobra.(*Command).execute\n\t/usr/local/google/home/lumb/go/pkg/mod/github.com/spf13/[email protected]/command.go:989\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/usr/local/google/home/lumb/go/pkg/mod/github.com/spf13/[email protected]/command.go:1117\ngithub.com/spf13/cobra.(*Command).Execute\n\t/usr/local/google/home/lumb/go/pkg/mod/github.com/spf13/[email protected]/command.go:1041\ngithub.com/guacsec/guac/cmd/guacone/cmd.Execute\n\t/usr/local/google/home/lumb/git/guac/cmd/guacone/cmd/root.go:57\nmain.main\n\t/usr/local/google/home/lumb/git/guac/cmd/guacone/main.go:23\nruntime.main\n\t/usr/lib/google-golang/src/runtime/proc.go:272"}
GUAC version head
Ingested document(s)
Can you share the documents that are used to reproduce the ingestion errors or showcase the data quality issues.
Additional context Add any other context about the problem here.