embulk-output-bigquery Split file each 4GB for BigQuery Quota Policy

Split file each 4GB for BigQuery Quota Policy

Open sakama opened this issue 9 years ago • 3 comments

BigQuery has following Quota Policy.

So, It's better to split output file each 4GB.

File Type	Compressed	Uncompressed
CSV	4 GB	With new-lines in strings: 4 GB Without new-lines in strings: 5 TB
JSON	4 GB	5TB

Problems

Have to split newline(CRLF/LF/CR) at EOL, not only filesize.
Split before output beforehand is better way than split output file, Because Embulk run multiple tasks with multiple CPU cores.

Apr 23 '15 01:04 sakama

I have encountered this problem.

Caused by: org.jruby.exceptions.RaiseException: (Error) failed during waiting a Load job, get_job(myproject, embulk_load_job_513c2da9-2e73-498d-b57a-493ab53860af), errors:[{:reason=>"invalid", :message=>"Error while reading table: XXXX, error message: Input CSV files are not splittable and at least one of the files is larger than the maximum allowed size. Size is: 7505312411. Max allowed size is: 4294967296."}]

Mar 19 '20 04:03 kosukekurimoto

Hello, @kosukekurimoto Have you ever tried uncompress mode? It limits up to 5TB.

Mar 19 '20 14:03 hiroyuki-sato

@hiroyuki-sato

Hello, @kosukekurimoto Have you ever tried uncompress mode? It limits up to 5TB.

アドバイスをありがとうございます。私は該当のドキュメントを発見しました。 https://cloud.google.com/bigquery/quotas?hl=ja#load_jobs

compression: NONEで再度トライしてみます。

Mar 21 '20 16:03 kosukekurimoto

embulk-output-bigquery embulk-output-bigquery copied to clipboard

Split file each 4GB for BigQuery Quota Policy

Problems

embulk-output-bigquery
embulk-output-bigquery copied to clipboard