embulk-output-bigquery
embulk-output-bigquery copied to clipboard
Split file each 4GB for BigQuery Quota Policy
BigQuery has following Quota Policy.
So, It's better to split output file each 4GB.
File Type | Compressed | Uncompressed |
---|---|---|
CSV | 4 GB | With new-lines in strings: 4 GB Without new-lines in strings: 5 TB |
JSON | 4 GB | 5TB |
Problems
- Have to split newline(CRLF/LF/CR) at EOL, not only filesize.
- Split before output beforehand is better way than split output file, Because Embulk run multiple tasks with multiple CPU cores.
I have encountered this problem.
Caused by: org.jruby.exceptions.RaiseException: (Error) failed during waiting a Load job, get_job(myproject, embulk_load_job_513c2da9-2e73-498d-b57a-493ab53860af), errors:[{:reason=>"invalid", :message=>"Error while reading table: XXXX, error message: Input CSV files are not splittable and at least one of the files is larger than the maximum allowed size. Size is: 7505312411. Max allowed size is: 4294967296."}]
Hello, @kosukekurimoto Have you ever tried uncompress mode? It limits up to 5TB.
@hiroyuki-sato
Hello, @kosukekurimoto Have you ever tried uncompress mode? It limits up to 5TB.
アドバイスをありがとうございます。私は該当のドキュメントを発見しました。 https://cloud.google.com/bigquery/quotas?hl=ja#load_jobs
compression: NONEで再度トライしてみます。