package-analysis icon indicating copy to clipboard operation
package-analysis copied to clipboard

Loader is failing with JSON issues.

Open calebbrown opened this issue 3 years ago • 4 comments

The loader function is failing.

The last successful import was 2022-03-03 18:02:27 UTC.

Digging through the logs I find the following error:

additionalErrors: [
0: {
code: 3
message: "Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 211; errors: 1. Please look into the errors[] collection for more details."
}
1: {
code: 3
message: "Error while reading data, error message: JSON processing encountered too many errors, giving up. Rows: 209621; errors: 1; max bad: 0; error percent: 0"
}
2: {
code: 3
message: "Error while reading data, error message: JSON parsing error in row starting at position 0: Row size is larger than: 104857600."
}]
error: {
code: 3
message: "Error while reading data, error message: JSON table encountered too many errors, giving up. Rows: 211; errors: 1. Please look into the errors[] collection for more details."
}
state: "DONE"
}

calebbrown avatar Mar 08 '22 21:03 calebbrown

$ gsutil ls -l 'gs://ossf-malware-analysis-results/**' > allpackagewithsizes.out
$ cat allpackagewithsizes.out | sort -n | tail
  64417903  2022-02-02T22:19:07Z  gs://ossf-malware-analysis-results/pypi/pqrs/0.8.0.json
  65228324  2022-03-03T16:52:06Z  gs://ossf-malware-analysis-results/pypi/foodx-devops-tools/0.13.1.post1.json
  65338726  2022-02-23T17:42:14Z  gs://ossf-malware-analysis-results/pypi/foodx-devops-tools/0.13.1.json
  65628551  2022-02-11T16:28:02Z  gs://ossf-malware-analysis-results/pypi/ansible-kernel/1.0.0.json
 125303369  2022-03-03T19:58:44Z  gs://ossf-malware-analysis-results/npm/@nuskin/ns-signup/3.34.289.json
 125361942  2022-03-07T22:51:09Z  gs://ossf-malware-analysis-results/npm/@nuskin/ns-signup/3.34.290.json
 125365214  2022-03-08T00:26:53Z  gs://ossf-malware-analysis-results/npm/@nuskin/ns-signup/3.34.291.json
 125408813  2022-03-03T19:22:35Z  gs://ossf-malware-analysis-results/npm/@nuskin/ns-aem/3.34.289.json
 125469811  2022-03-07T22:52:29Z  gs://ossf-malware-analysis-results/npm/@nuskin/ns-aem/3.34.290.json
 125474362  2022-03-08T00:29:46Z  gs://ossf-malware-analysis-results/npm/@nuskin/ns-aem/3.34.291.json

Inspecting the huge packages shows the recent addition of git causes these packages to clone a git repo that has a very large submodule (thousands of files).

calebbrown avatar Mar 09 '22 01:03 calebbrown

This failure is somewhat silent.

We probably should call. .Wait() on the job and log the errors it returns to avoid having to dig through BigQuery job logs to find failures.

calebbrown avatar Mar 09 '22 02:03 calebbrown

This failure is somewhat silent.

We probably should call. .Wait() on the job and log the errors it returns to avoid having to dig through BigQuery job logs to find failures.

Yep, we definitely need to surface these a bit better. A potential issue with Wait() in the current GCP function is the max timeout we can set on the function it self (540 seconds), but maybe that's plenty for the forseeable future.

oliverchang avatar Mar 09 '22 02:03 oliverchang

Might be able to push the job id onto a pubsub queue and have it checked periodically to examine the outcome of the job.

calebbrown avatar Mar 09 '22 02:03 calebbrown

Work left on this issue: add some altering/logging to make it easier to find the job id, and see the outcome of an import.

calebbrown avatar Dec 21 '22 00:12 calebbrown

Fixed by #947 which calls .Wait() and prints any errors to stdout (which get logged to the Cloud console)

maxfisher-g avatar Dec 15 '23 03:12 maxfisher-g