summary

In embulk-output-bigquery plugin, you can set up billing project and destination project separately.

destination_project A destination project to which the data will be loaded. Use this if you want to separate a billing project (the project value) and a destination project (the destination_project value).

Plugin tries to create the table into the billing project instead of the destination project if you set destination_project and the destination table or temporary table does not exist. This cause error of dataset not found error and permission denied.

I want to creates table into destination project when setting destination_project.

Setting

Excerpt setting

out:
  type: bigquery
  auth_method: json_key
  mode = 'replace'
  project = '{xxx}'
  destination_project = '{yyy}'
  dataset = '{aaa}'
  json_keyfile = '{service_account.json}'
  gcs_bucket = '{gcs_bucket}'
  auto_create_table = true
  compression = true
  auto_create_gcs_bucket = false
  allow_quoted_newlines = true

Service account have below roles

roles/bigquery.jobUser, roles/bigquery.dataEditor, roles/storage.objectAdmin in yyy project
roles/bigquery.jobUser in xxx project

Error Log

Excerpt log

2022-12-22 15:22:37.071 +0900 [INFO] (0001:transaction): embulk-output-bigquery: Get dataset... yyy:aaa
2022-12-22 15:22:38.501 +0900 [INFO] (0001:transaction): embulk-output-bigquery: Create table... yyy:aaa.LOAD_TEMP_91f517c0_43ce_46e6_84a3_55e0f3714602_table
2022-12-22 15:22:39.020 +0900 [ERROR] (0001:transaction): embulk-output-bigquery: insert_table(xxx, 0_temp, us/eu, {:table_reference=>{:table_id=>"LOAD_TEMP_91f517c0_43ce_46e6_84a3_55e0f3714602_table"}, :schema=>{:fields=>[{:name=>"id", :type=>"INTEGER"}, {:name=>"url", :type=>"STRING"}, {:name=>"title", :type=>"STRING"}, {:name=>"created_at", :type=>"TIMESTAMP"}, {:name=>"updated_at", :type=>"TIMESTAMP"}]}}, {}), response:{:status_code=>404, :message=>"notFound: Not found: Dataset xxx:aaa", :error_class=>Google::Apis::ClientError}
org.embulk.exec.PartialExecutionException: org.jruby.exceptions.RaiseException: (Error) failed to create table yyy:aaa.LOAD_TEMP_91f517c0_43ce_46e6_84a3_55e0f3714602_table in us/eu, response:{:status_code=>404, :message=>"notFound: Not found: Dataset xxx:aaa", :error_class=>Google::Apis::ClientError}

Dec 22 '22 08:12 kashira202111

@hiroyuki-sato Hi, could you review it? I want to separate the destination project {yyy} from the billing project {xxx}, but without this commit, it always tries to insert into the billing project and fails.

It needs to insert into the destination project instead.

Jun 23 '24 02:06 p-eye

@p-eye It seems this PR needs additional modification. The trocco-io version already uses this feature https://github.com/trocco-io/embulk-output-bigquery/commit/8343629792b3b15e0f3fe89eab50eb5edd84dd1b

I'll take a look at the difference.

Jun 23 '24 14:06 hiroyuki-sato

@kashira202111, @p-eye I misunderstood about this PR. Could you provide a reproduced example? (For my test) @kashira202111 Could you resolve the conflicts?

Jun 24 '24 09:06 hiroyuki-sato

@hiroyuki-sato Thank you for your reply, I'm going to resolve the conflicts. Please wait a moment.

Jun 24 '24 09:06 kashira202111

@kashira202111 Thank you for your work and sorry for waiting for the review.

I want to confirm the following.

Current behavior: raise error dataset not found error and permission denied (I want to know how to reproduce this error)
Whether this PR changes the current behavior or not.

If my understanding is correct, this PR type is the fix for the error, We don't need to consider compatibility. But If this PR changes the current behavior, We need to care about the compatibility.

Please tell me about detail about above.

Jun 24 '24 10:06 hiroyuki-sato

Current behavior: raise error dataset not found error and permission denied (I want to know how to reproduce this error)

The important purpose of this PR is create table to destination_project project, not to project. Current behavior: create table into project regardless of setting destination_project.

If you want to reproduce this error, you must prepare below.

Use auto_create_table=true
Use both project parameter and destination_project parameter
- project and destination_project are not equal.
Use auto_create_dataset=false
In xxx project, there aren't 'aaa' dataset. In yyy project, there are 'aaa' dataset.
- Then raise dataset not found error

out:
  type: bigquery
  auth_method: json_key
  mode: 'replace'
  project: 'xxx'
  destination_project: 'yyy'
  dataset: 'aaa'
  json_keyfile: '{service_account.json}'
  auto_create_dataset: false
  auto_create_table: true

Jun 25 '24 10:06 kashira202111

@hiroyuki-sato I am sorry that my English is so bad. Could you check again?

Jun 25 '24 10:06 kashira202111

@kashira202111 Thank you for your work. I'll review this PR in a week. Please wait.

Jun 25 '24 12:06 hiroyuki-sato

@kashira202111 I can't take time for review. Please wait for a while.

Jul 03 '24 13:07 hiroyuki-sato

Hello, @kashira202111 and @p-eye

Does this PR fix the problem if a user uses the gcs_bucket parameter in the configuration? If you don't use gcs_bucket you can load data into the table which you specified in the detination_project

The following configuration created the bqtest2.mydataset.hello_embulk table.

Is my understanding correct? And I have not set up gcp_bucket for testing yet. (I don't use gcp_bucket parameter in my project) Please wait to set them up.

As a workaround, can you remove gcp_bucket?

out:
  type: bigquery
  mode: replace
  project: bqtest
  destination_project: bqtest2
  dataset: mydataset
  auth_method: service_account
  json_keyfile:  'key_file`
  auto_create_dataset: false
  auto_create_table: true
  table: hello_embulk

Jul 07 '24 13:07 hiroyuki-sato

@hiroyuki-sato Thanks for review.

Does this PR fix the problem if a user uses the gcs_bucket parameter in the configuration?

I'm going to check this. Please wait for a while.

Jul 09 '24 09:07 kashira202111

@hiroyuki-sato

Both using gcs_bucket parameter and not using it, we can load data into destination_project with this commit.

But I found another problem about gcs_bucket in latest release. After the problem is fixed I confirmed that I can load data.

Jul 10 '24 09:07 kashira202111

I issue the problem later. Please wait for a while.

Jul 10 '24 10:07 kashira202111

@hiroyuki-sato

This is the problem that I mentioned above.

https://github.com/embulk/embulk-output-bigquery/pull/164

Jul 10 '24 14:07 kashira202111

@kashira202111

I would like to double-check about this issue, Without applying this PR, I could load data to the detination_project table. https://github.com/embulk/embulk-output-bigquery/pull/148#issuecomment-2212447130 So, I don't completely understand the problem yet.

Could you tell me again the problem?

Jul 11 '24 13:07 hiroyuki-sato

@hiroyuki-sato Hi. I'm sorry that my PR title and description led you misunderstands. I rewrote PR title and description.

Could you review this PR again?

Jul 16 '24 06:07 kashira202111

@kashira202111 Thank you for proposing this PR. I confirmed this PR fixes the problem.

Expect behavior: Load data into the table in the detination_project project and remove all intermediate tables. However, without this PR, It behaves the following.

Dataset in `project`	Actual behavior
Not exists	Raise Exception (below)
Already exits	create Unnecessary tables (temp and destination).

org.embulk.exec.PartialExecutionException: org.jruby.exceptions.StandardError: (Error) failed to create table bqtest2-428605:mydataset.LOAD_TEMP_93248bb6_8d52_42b8_8226_67856107508b_hello_embulk in us/eu, response:{:status_code=>404, :message=>"notFound: Not found: Dataset bqtest-253710:mydataset", :error_class=>Google::Apis::ClientError}
	at org.embulk.exec.BulkLoader$LoaderState.buildPartialExecuteException(BulkLoader.java:340)
	at org.embulk.exec.BulkLoader.doRun(BulkLoader.java:580)
	at org.embulk.exec.BulkLoader.access$000(BulkLoader.java:36)
	at org.embulk.exec.BulkLoader$1.run(BulkLoader.java:353)
	at org.embulk.exec.BulkLoader$1.run(BulkLoader.java:350)
	at org.embulk.spi.ExecInternal.doWith(ExecInternal.java:26)
	at org.embulk.exec.BulkLoader.run(BulkLoader.java:350)
	at org.embulk.EmbulkEmbed.run(EmbulkEmbed.java:278)
	at org.embulk.EmbulkRunner.runInternal(EmbulkRunner.java:288)
	at org.embulk.EmbulkRunner.run(EmbulkRunner.java:153)
	at org.embulk.cli.EmbulkRun.runInternal(EmbulkRun.java:115)
	at org.embulk.cli.EmbulkRun.run(EmbulkRun.java:24)
	at org.embulk.cli.Main.main(Main.java:53)
	Suppressed: java.lang.NullPointerException
		at org.embulk.exec.BulkLoader.doCleanup(BulkLoader.java:477)
		at org.embulk.exec.BulkLoader$3.run(BulkLoader.java:411)
		at org.embulk.exec.BulkLoader$3.run(BulkLoader.java:408)
		at org.embulk.spi.ExecInternal.doWith(ExecInternal.java:26)
		at org.embulk.exec.BulkLoader.cleanup(BulkLoader.java:408)
		at org.embulk.EmbulkEmbed.run(EmbulkEmbed.java:283)
		... 5 more
Caused by: org.jruby.exceptions.StandardError: (Error) failed to create table bqtest2-428605:mydataset.LOAD_TEMP_93248bb6_8d52_42b8_8226_67856107508b_hello_embulk in us/eu, response:{:status_code=>404, :message=>"notFound: Not found: Dataset bqtest-253710:mydataset", :error_class=>Google::Apis::ClientError}
	at RUBY.create_table_if_not_exists/path/to/embulk-output-bigquery-0.7.1/lib/embulk/output/bigquery/bigquery_client.rb:462)
	at RUBY.auto_create/path/to/embulk-output-bigquery-0.7.1/lib/embulk/output/bigquery.rb:315)
	at RUBY.transaction/path/to/embulk-output-bigquery-0.7.1/lib/embulk/output/bigquery.rb:342)
	at RUBY.transaction/path/to/embulk-0.11.4-java/lib/embulk/output_plugin.rb:64)
Caused by: org.jruby.exceptions.StandardError: (ClientError) notFound: Not found: Dataset bqtest-253710:mydataset
	at RUBY.check_status/path/to/google-apis-core-0.14.0/lib/google/apis/core/http_command.rb:244)
	at RUBY.check_status/path/to/google-apis-core-0.14.0/lib/google/apis/core/api_command.rb:135)
	at RUBY.process_response/path/to/google-apis-core-0.14.0/lib/google/apis/core/http_command.rb:207)
	at RUBY.execute_once/path/to/google-apis-core-0.14.0/lib/google/apis/core/http_command.rb:326)
	at RUBY.do_retry/path/to/google-apis-core-0.14.0/lib/google/apis/core/http_command.rb:131)
	at RUBY.retriable/path/to/retriable-3.1.2/lib/retriable.rb:61)
	at org.jruby.RubyFixnum.times(org/jruby/RubyFixnum.java:312)
	at RUBY.retriable/path/to/retriable-3.1.2/lib/retriable.rb:56)
	at RUBY.do_retry/path/to/google-apis-core-0.14.0/lib/google/apis/core/http_command.rb:128)
	at RUBY.retriable/path/to/retriable-3.1.2/lib/retriable.rb:61)
	at org.jruby.RubyFixnum.times(org/jruby/RubyFixnum.java:312)
	at RUBY.retriable/path/to/retriable-3.1.2/lib/retriable.rb:56)
	at RUBY.do_retry/path/to/google-apis-core-0.14.0/lib/google/apis/core/http_command.rb:118)
	at RUBY.execute/path/to/google-apis-core-0.14.0/lib/google/apis/core/http_command.rb:109)
	at RUBY.execute_or_queue_command/path/to/google-apis-core-0.14.0/lib/google/apis/core/base_service.rb:477)
	at RUBY.insert_table/path/to/google-apis-bigquery_v2-0.65.0/lib/google/apis/bigquery_v2/service.rb:1442)
	at RUBY.create_table_if_not_exists/path/to/embulk-output-bigquery-0.7.1/lib/embulk/output/bigquery/bigquery_client.rb:451)
	at RUBY.with_network_retry/path/to/embulk-output-bigquery-0.7.1/lib/embulk/output/bigquery/google_client.rb:51)
	at RUBY.create_table_if_not_exists/path/to/embulk-output-bigquery-0.7.1/lib/embulk/output/bigquery/bigquery_client.rb:451)
	... 3 more

Error: org.jruby.exceptions.StandardError: (Error) failed to create table bqtest2-428605:mydataset.LOAD_TEMP_93248bb6_8d52_42b8_8226_67856107508b_hello_embulk in us/eu, response:{:status_code=>404, :message=>"notFound: Not found: Dataset bqtest-253710:mydataset", :error_class=>Google::Apis::ClientError}

Jul 17 '24 12:07 hiroyuki-sato

@kashira202111 Fix CI in #166. Could you rebase this PR to check CI?

Jul 19 '24 04:07 hiroyuki-sato

@hiroyuki-sato Thank you. I rebased this PR !

Jul 19 '24 07:07 kashira202111

@joker1007 CI passed. Please take a look when you have time.

Jul 19 '24 08:07 hiroyuki-sato

Thanks!

Jul 21 '24 02:07 hiroyuki-sato

@p-eye and @kashira202111 embulk-output-bigquery 0.7.2 has been released. 0.7.2 includes this PR. Please try it.

Jul 21 '24 11:07 hiroyuki-sato

It works well, thanks to you guys.

Jul 22 '24 06:07 p-eye

embulk-output-bigquery
embulk-output-bigquery copied to clipboard

creates table into destination project when setting `destination_project`,

summary

Setting

Error Log

embulk-output-bigquery embulk-output-bigquery copied to clipboard

creates table into destination project when setting `destination_project`,

summary

Setting

Error Log

embulk-output-bigquery
embulk-output-bigquery copied to clipboard