embulk-output-bigquery icon indicating copy to clipboard operation
embulk-output-bigquery copied to clipboard

Split file each 4GB for BigQuery Quota Policy

Open sakama opened this issue 9 years ago • 3 comments

BigQuery has following Quota Policy.

So, It's better to split output file each 4GB.

File Type Compressed Uncompressed
CSV 4 GB With new-lines in strings: 4 GB
Without new-lines in strings: 5 TB
JSON 4 GB 5TB

Problems

  • Have to split newline(CRLF/LF/CR) at EOL, not only filesize.
  • Split before output beforehand is better way than split output file, Because Embulk run multiple tasks with multiple CPU cores.

sakama avatar Apr 23 '15 01:04 sakama

I have encountered this problem.

Caused by: org.jruby.exceptions.RaiseException: (Error) failed during waiting a Load job, get_job(myproject, embulk_load_job_513c2da9-2e73-498d-b57a-493ab53860af), errors:[{:reason=>"invalid", :message=>"Error while reading table: XXXX, error message: Input CSV files are not splittable and at least one of the files is larger than the maximum allowed size. Size is: 7505312411. Max allowed size is: 4294967296."}]

kosukekurimoto avatar Mar 19 '20 04:03 kosukekurimoto

Hello, @kosukekurimoto Have you ever tried uncompress mode? It limits up to 5TB.

hiroyuki-sato avatar Mar 19 '20 14:03 hiroyuki-sato

@hiroyuki-sato

Hello, @kosukekurimoto Have you ever tried uncompress mode? It limits up to 5TB.

アドバイスをありがとうございます。私は該当のドキュメントを発見しました。 https://cloud.google.com/bigquery/quotas?hl=ja#load_jobs

compression: NONEで再度トライしてみます。

kosukekurimoto avatar Mar 21 '20 16:03 kosukekurimoto