package-analysis
package-analysis copied to clipboard
Loader is failing with JSON issues.
The loader function is failing.
The last successful import was 2022-03-03 18:02:27 UTC.
Digging through the logs I find the following error:
additionalErrors: [
0: {
code: 3
message: "Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 211; errors: 1. Please look into the errors[] collection for more details."
}
1: {
code: 3
message: "Error while reading data, error message: JSON processing encountered too many errors, giving up. Rows: 209621; errors: 1; max bad: 0; error percent: 0"
}
2: {
code: 3
message: "Error while reading data, error message: JSON parsing error in row starting at position 0: Row size is larger than: 104857600."
}]
error: {
code: 3
message: "Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 211; errors: 1. Please look into the errors[] collection for more details."
}
state: "DONE"
}
$ gsutil ls -l 'gs://ossf-malware-analysis-results/**' > allpackagewithsizes.out
$ cat allpackagewithsizes.out | sort -n | tail
64417903 2022-02-02T22:19:07Z gs://ossf-malware-analysis-results/pypi/pqrs/0.8.0.json
65228324 2022-03-03T16:52:06Z gs://ossf-malware-analysis-results/pypi/foodx-devops-tools/0.13.1.post1.json
65338726 2022-02-23T17:42:14Z gs://ossf-malware-analysis-results/pypi/foodx-devops-tools/0.13.1.json
65628551 2022-02-11T16:28:02Z gs://ossf-malware-analysis-results/pypi/ansible-kernel/1.0.0.json
125303369 2022-03-03T19:58:44Z gs://ossf-malware-analysis-results/npm/@nuskin/ns-signup/3.34.289.json
125361942 2022-03-07T22:51:09Z gs://ossf-malware-analysis-results/npm/@nuskin/ns-signup/3.34.290.json
125365214 2022-03-08T00:26:53Z gs://ossf-malware-analysis-results/npm/@nuskin/ns-signup/3.34.291.json
125408813 2022-03-03T19:22:35Z gs://ossf-malware-analysis-results/npm/@nuskin/ns-aem/3.34.289.json
125469811 2022-03-07T22:52:29Z gs://ossf-malware-analysis-results/npm/@nuskin/ns-aem/3.34.290.json
125474362 2022-03-08T00:29:46Z gs://ossf-malware-analysis-results/npm/@nuskin/ns-aem/3.34.291.json
Inspecting the huge packages shows the recent addition of git causes these packages to clone a git repo that has a very large submodule (thousands of files).
This failure is somewhat silent.
We probably should call. .Wait() on the job and log the errors it returns to avoid having to dig through BigQuery job logs to find failures.
This failure is somewhat silent.
We probably should call.
.Wait()on the job and log the errors it returns to avoid having to dig through BigQuery job logs to find failures.
Yep, we definitely need to surface these a bit better. A potential issue with Wait() in the current GCP function is the max timeout we can set on the function it self (540 seconds), but maybe that's plenty for the forseeable future.
Might be able to push the job id onto a pubsub queue and have it checked periodically to examine the outcome of the job.
Work left on this issue: add some altering/logging to make it easier to find the job id, and see the outcome of an import.
Fixed by #947 which calls .Wait() and prints any errors to stdout (which get logged to the Cloud console)