embulk-output-bigquery
embulk-output-bigquery copied to clipboard
creates table into destination project when setting `destination_project`,
summary
In embulk-output-bigquery plugin, you can set up billing project and destination project separately.
destination_project A destination project to which the data will be loaded. Use this if you want to separate a billing project (the project value) and a destination project (the destination_project value).
Plugin tries to create the table into the billing project instead of the destination project if you set destination_project
and the destination table or temporary table does not exist.
This cause error of dataset not found error
and permission denied
.
I want to creates table into destination project when setting destination_project
.

Setting
Excerpt setting
out:
type: bigquery
auth_method: json_key
mode = 'replace'
project = '{xxx}'
destination_project = '{yyy}'
dataset = '{aaa}'
json_keyfile = '{service_account.json}'
gcs_bucket = '{gcs_bucket}'
auto_create_table = true
compression = true
auto_create_gcs_bucket = false
allow_quoted_newlines = true
Service account have below roles
-
roles/bigquery.jobUser
,roles/bigquery.dataEditor
,roles/storage.objectAdmin
in yyy project -
roles/bigquery.jobUser
in xxx project
Error Log
Excerpt log
2022-12-22 15:22:37.071 +0900 [INFO] (0001:transaction): embulk-output-bigquery: Get dataset... yyy:aaa
2022-12-22 15:22:38.501 +0900 [INFO] (0001:transaction): embulk-output-bigquery: Create table... yyy:aaa.LOAD_TEMP_91f517c0_43ce_46e6_84a3_55e0f3714602_table
2022-12-22 15:22:39.020 +0900 [ERROR] (0001:transaction): embulk-output-bigquery: insert_table(xxx, 0_temp, us/eu, {:table_reference=>{:table_id=>"LOAD_TEMP_91f517c0_43ce_46e6_84a3_55e0f3714602_table"}, :schema=>{:fields=>[{:name=>"id", :type=>"INTEGER"}, {:name=>"url", :type=>"STRING"}, {:name=>"title", :type=>"STRING"}, {:name=>"created_at", :type=>"TIMESTAMP"}, {:name=>"updated_at", :type=>"TIMESTAMP"}]}}, {}), response:{:status_code=>404, :message=>"notFound: Not found: Dataset xxx:aaa", :error_class=>Google::Apis::ClientError}
org.embulk.exec.PartialExecutionException: org.jruby.exceptions.RaiseException: (Error) failed to create table yyy:aaa.LOAD_TEMP_91f517c0_43ce_46e6_84a3_55e0f3714602_table in us/eu, response:{:status_code=>404, :message=>"notFound: Not found: Dataset xxx:aaa", :error_class=>Google::Apis::ClientError}
@hiroyuki-sato Hi, could you review it? I want to separate the destination project {yyy} from the billing project {xxx}, but without this commit, it always tries to insert into the billing project and fails.
It needs to insert into the destination project instead.
@p-eye It seems this PR needs additional modification. The trocco-io version already uses this feature https://github.com/trocco-io/embulk-output-bigquery/commit/8343629792b3b15e0f3fe89eab50eb5edd84dd1b
I'll take a look at the difference.
@kashira202111, @p-eye I misunderstood about this PR. Could you provide a reproduced example? (For my test) @kashira202111 Could you resolve the conflicts?
@hiroyuki-sato Thank you for your reply, I'm going to resolve the conflicts. Please wait a moment.
@kashira202111 Thank you for your work and sorry for waiting for the review.
I want to confirm the following.
- Current behavior: raise error
dataset not found error
andpermission denied
(I want to know how to reproduce this error) - Whether this PR changes the current behavior or not.
If my understanding is correct, this PR type is the fix for the error, We don't need to consider compatibility. But If this PR changes the current behavior, We need to care about the compatibility.
Please tell me about detail about above.
- Current behavior: raise error
dataset not found error
andpermission denied
(I want to know how to reproduce this error)
The important purpose of this PR is create table to destination_project
project, not to project
.
Current behavior: create table into project
regardless of setting destination_project
.
If you want to reproduce this error, you must prepare below.
- Use
auto_create_table=true
- Use both
project
parameter anddestination_project
parameter-
project
anddestination_project
are not equal.
-
- Use
auto_create_dataset=false
- In xxx project, there aren't 'aaa' dataset. In yyy project, there are 'aaa' dataset.
- Then raise
dataset not found error
- Then raise
out:
type: bigquery
auth_method: json_key
mode: 'replace'
project: 'xxx'
destination_project: 'yyy'
dataset: 'aaa'
json_keyfile: '{service_account.json}'
auto_create_dataset: false
auto_create_table: true
@hiroyuki-sato I am sorry that my English is so bad. Could you check again?
@kashira202111 Thank you for your work. I'll review this PR in a week. Please wait.
@kashira202111 I can't take time for review. Please wait for a while.
Hello, @kashira202111 and @p-eye
Does this PR fix the problem if a user uses the gcs_bucket
parameter in the configuration?
If you don't use gcs_bucket
you can load data into the table which you specified in the detination_project
The following configuration created the bqtest2.mydataset.hello_embulk
table.
Is my understanding correct?
And I have not set up gcp_bucket
for testing yet. (I don't use gcp_bucket
parameter in my project)
Please wait to set them up.
As a workaround, can you remove gcp_bucket
?
out:
type: bigquery
mode: replace
project: bqtest
destination_project: bqtest2
dataset: mydataset
auth_method: service_account
json_keyfile: 'key_file`
auto_create_dataset: false
auto_create_table: true
table: hello_embulk
@hiroyuki-sato Thanks for review.
Does this PR fix the problem if a user uses the gcs_bucket parameter in the configuration?
I'm going to check this. Please wait for a while.
@hiroyuki-sato
Both using gcs_bucket
parameter and not using it, we can load data into destination_project with this commit.
But I found another problem about gcs_bucket
in latest release. After the problem is fixed I confirmed that I can load data.
I issue the problem later. Please wait for a while.
@hiroyuki-sato
This is the problem that I mentioned above.
https://github.com/embulk/embulk-output-bigquery/pull/164
@kashira202111
I would like to double-check about this issue,
Without applying this PR, I could load data to the detination_project
table.
https://github.com/embulk/embulk-output-bigquery/pull/148#issuecomment-2212447130
So, I don't completely understand the problem yet.
Could you tell me again the problem?
@hiroyuki-sato Hi. I'm sorry that my PR title and description led you misunderstands. I rewrote PR title and description.
Could you review this PR again?
@kashira202111 Thank you for proposing this PR. I confirmed this PR fixes the problem.
Expect behavior: Load data into the table
in the detination_project
project and remove all intermediate tables.
However, without this PR, It behaves the following.
Dataset in project |
Actual behavior |
---|---|
Not exists | Raise Exception (below) |
Already exits | create Unnecessary tables (temp and destination). |
org.embulk.exec.PartialExecutionException: org.jruby.exceptions.StandardError: (Error) failed to create table bqtest2-428605:mydataset.LOAD_TEMP_93248bb6_8d52_42b8_8226_67856107508b_hello_embulk in us/eu, response:{:status_code=>404, :message=>"notFound: Not found: Dataset bqtest-253710:mydataset", :error_class=>Google::Apis::ClientError}
at org.embulk.exec.BulkLoader$LoaderState.buildPartialExecuteException(BulkLoader.java:340)
at org.embulk.exec.BulkLoader.doRun(BulkLoader.java:580)
at org.embulk.exec.BulkLoader.access$000(BulkLoader.java:36)
at org.embulk.exec.BulkLoader$1.run(BulkLoader.java:353)
at org.embulk.exec.BulkLoader$1.run(BulkLoader.java:350)
at org.embulk.spi.ExecInternal.doWith(ExecInternal.java:26)
at org.embulk.exec.BulkLoader.run(BulkLoader.java:350)
at org.embulk.EmbulkEmbed.run(EmbulkEmbed.java:278)
at org.embulk.EmbulkRunner.runInternal(EmbulkRunner.java:288)
at org.embulk.EmbulkRunner.run(EmbulkRunner.java:153)
at org.embulk.cli.EmbulkRun.runInternal(EmbulkRun.java:115)
at org.embulk.cli.EmbulkRun.run(EmbulkRun.java:24)
at org.embulk.cli.Main.main(Main.java:53)
Suppressed: java.lang.NullPointerException
at org.embulk.exec.BulkLoader.doCleanup(BulkLoader.java:477)
at org.embulk.exec.BulkLoader$3.run(BulkLoader.java:411)
at org.embulk.exec.BulkLoader$3.run(BulkLoader.java:408)
at org.embulk.spi.ExecInternal.doWith(ExecInternal.java:26)
at org.embulk.exec.BulkLoader.cleanup(BulkLoader.java:408)
at org.embulk.EmbulkEmbed.run(EmbulkEmbed.java:283)
... 5 more
Caused by: org.jruby.exceptions.StandardError: (Error) failed to create table bqtest2-428605:mydataset.LOAD_TEMP_93248bb6_8d52_42b8_8226_67856107508b_hello_embulk in us/eu, response:{:status_code=>404, :message=>"notFound: Not found: Dataset bqtest-253710:mydataset", :error_class=>Google::Apis::ClientError}
at RUBY.create_table_if_not_exists/path/to/embulk-output-bigquery-0.7.1/lib/embulk/output/bigquery/bigquery_client.rb:462)
at RUBY.auto_create/path/to/embulk-output-bigquery-0.7.1/lib/embulk/output/bigquery.rb:315)
at RUBY.transaction/path/to/embulk-output-bigquery-0.7.1/lib/embulk/output/bigquery.rb:342)
at RUBY.transaction/path/to/embulk-0.11.4-java/lib/embulk/output_plugin.rb:64)
Caused by: org.jruby.exceptions.StandardError: (ClientError) notFound: Not found: Dataset bqtest-253710:mydataset
at RUBY.check_status/path/to/google-apis-core-0.14.0/lib/google/apis/core/http_command.rb:244)
at RUBY.check_status/path/to/google-apis-core-0.14.0/lib/google/apis/core/api_command.rb:135)
at RUBY.process_response/path/to/google-apis-core-0.14.0/lib/google/apis/core/http_command.rb:207)
at RUBY.execute_once/path/to/google-apis-core-0.14.0/lib/google/apis/core/http_command.rb:326)
at RUBY.do_retry/path/to/google-apis-core-0.14.0/lib/google/apis/core/http_command.rb:131)
at RUBY.retriable/path/to/retriable-3.1.2/lib/retriable.rb:61)
at org.jruby.RubyFixnum.times(org/jruby/RubyFixnum.java:312)
at RUBY.retriable/path/to/retriable-3.1.2/lib/retriable.rb:56)
at RUBY.do_retry/path/to/google-apis-core-0.14.0/lib/google/apis/core/http_command.rb:128)
at RUBY.retriable/path/to/retriable-3.1.2/lib/retriable.rb:61)
at org.jruby.RubyFixnum.times(org/jruby/RubyFixnum.java:312)
at RUBY.retriable/path/to/retriable-3.1.2/lib/retriable.rb:56)
at RUBY.do_retry/path/to/google-apis-core-0.14.0/lib/google/apis/core/http_command.rb:118)
at RUBY.execute/path/to/google-apis-core-0.14.0/lib/google/apis/core/http_command.rb:109)
at RUBY.execute_or_queue_command/path/to/google-apis-core-0.14.0/lib/google/apis/core/base_service.rb:477)
at RUBY.insert_table/path/to/google-apis-bigquery_v2-0.65.0/lib/google/apis/bigquery_v2/service.rb:1442)
at RUBY.create_table_if_not_exists/path/to/embulk-output-bigquery-0.7.1/lib/embulk/output/bigquery/bigquery_client.rb:451)
at RUBY.with_network_retry/path/to/embulk-output-bigquery-0.7.1/lib/embulk/output/bigquery/google_client.rb:51)
at RUBY.create_table_if_not_exists/path/to/embulk-output-bigquery-0.7.1/lib/embulk/output/bigquery/bigquery_client.rb:451)
... 3 more
Error: org.jruby.exceptions.StandardError: (Error) failed to create table bqtest2-428605:mydataset.LOAD_TEMP_93248bb6_8d52_42b8_8226_67856107508b_hello_embulk in us/eu, response:{:status_code=>404, :message=>"notFound: Not found: Dataset bqtest-253710:mydataset", :error_class=>Google::Apis::ClientError}
@kashira202111 Fix CI in #166. Could you rebase this PR to check CI?
@hiroyuki-sato Thank you. I rebased this PR !
@joker1007 CI passed. Please take a look when you have time.
Thanks!
@p-eye and @kashira202111 embulk-output-bigquery 0.7.2 has been released. 0.7.2 includes this PR. Please try it.
It works well, thanks to you guys.