embulk-output-bigquery icon indicating copy to clipboard operation
embulk-output-bigquery copied to clipboard

creates table into destination project when setting `destination_project`,

Open kashira202111 opened this issue 2 years ago • 8 comments

summary

In embulk-output-bigquery plugin, you can set up billing project and destination project separately.

destination_project A destination project to which the data will be loaded. Use this if you want to separate a billing project (the project value) and a destination project (the destination_project value).

Plugin tries to create the table into the billing project instead of the destination project if you set destination_project and the destination table or temporary table does not exist. This cause error of dataset not found error and permission denied.

I want to creates table into destination project when setting destination_project.

image

Setting

Excerpt setting

out:
  type: bigquery
  auth_method: json_key
  mode = 'replace'
  project = '{xxx}'
  destination_project = '{yyy}'
  dataset = '{aaa}'
  json_keyfile = '{service_account.json}'
  gcs_bucket = '{gcs_bucket}'
  auto_create_table = true
  compression = true
  auto_create_gcs_bucket = false
  allow_quoted_newlines = true

Service account have below roles

  • roles/bigquery.jobUser, roles/bigquery.dataEditor, roles/storage.objectAdmin in yyy project
  • roles/bigquery.jobUser in xxx project

Error Log

Excerpt log

2022-12-22 15:22:37.071 +0900 [INFO] (0001:transaction): embulk-output-bigquery: Get dataset... yyy:aaa
2022-12-22 15:22:38.501 +0900 [INFO] (0001:transaction): embulk-output-bigquery: Create table... yyy:aaa.LOAD_TEMP_91f517c0_43ce_46e6_84a3_55e0f3714602_table
2022-12-22 15:22:39.020 +0900 [ERROR] (0001:transaction): embulk-output-bigquery: insert_table(xxx, 0_temp, us/eu, {:table_reference=>{:table_id=>"LOAD_TEMP_91f517c0_43ce_46e6_84a3_55e0f3714602_table"}, :schema=>{:fields=>[{:name=>"id", :type=>"INTEGER"}, {:name=>"url", :type=>"STRING"}, {:name=>"title", :type=>"STRING"}, {:name=>"created_at", :type=>"TIMESTAMP"}, {:name=>"updated_at", :type=>"TIMESTAMP"}]}}, {}), response:{:status_code=>404, :message=>"notFound: Not found: Dataset xxx:aaa", :error_class=>Google::Apis::ClientError}
org.embulk.exec.PartialExecutionException: org.jruby.exceptions.RaiseException: (Error) failed to create table yyy:aaa.LOAD_TEMP_91f517c0_43ce_46e6_84a3_55e0f3714602_table in us/eu, response:{:status_code=>404, :message=>"notFound: Not found: Dataset xxx:aaa", :error_class=>Google::Apis::ClientError}

kashira202111 avatar Dec 22 '22 08:12 kashira202111

@hiroyuki-sato Hi, could you review it? I want to separate the destination project {yyy} from the billing project {xxx}, but without this commit, it always tries to insert into the billing project and fails.

It needs to insert into the destination project instead.

p-eye avatar Jun 23 '24 02:06 p-eye

@p-eye It seems this PR needs additional modification. The trocco-io version already uses this feature https://github.com/trocco-io/embulk-output-bigquery/commit/8343629792b3b15e0f3fe89eab50eb5edd84dd1b

I'll take a look at the difference.

hiroyuki-sato avatar Jun 23 '24 14:06 hiroyuki-sato

@kashira202111, @p-eye I misunderstood about this PR. Could you provide a reproduced example? (For my test) @kashira202111 Could you resolve the conflicts?

hiroyuki-sato avatar Jun 24 '24 09:06 hiroyuki-sato

@hiroyuki-sato Thank you for your reply, I'm going to resolve the conflicts. Please wait a moment.

kashira202111 avatar Jun 24 '24 09:06 kashira202111

@kashira202111 Thank you for your work and sorry for waiting for the review.

I want to confirm the following.

  • Current behavior: raise error dataset not found error and permission denied (I want to know how to reproduce this error)
  • Whether this PR changes the current behavior or not.

If my understanding is correct, this PR type is the fix for the error, We don't need to consider compatibility. But If this PR changes the current behavior, We need to care about the compatibility.

Please tell me about detail about above.

hiroyuki-sato avatar Jun 24 '24 10:06 hiroyuki-sato

  • Current behavior: raise error dataset not found error and permission denied (I want to know how to reproduce this error)

The important purpose of this PR is create table to destination_project project, not to project. Current behavior: create table into project regardless of setting destination_project.

If you want to reproduce this error, you must prepare below.

  • Use auto_create_table=true
  • Use both project parameter and destination_project parameter
    • project and destination_project are not equal.
  • Use auto_create_dataset=false
  • In xxx project, there aren't 'aaa' dataset. In yyy project, there are 'aaa' dataset.
    • Then raise dataset not found error
out:
  type: bigquery
  auth_method: json_key
  mode: 'replace'
  project: 'xxx'
  destination_project: 'yyy'
  dataset: 'aaa'
  json_keyfile: '{service_account.json}'
  auto_create_dataset: false
  auto_create_table: true

kashira202111 avatar Jun 25 '24 10:06 kashira202111

@hiroyuki-sato I am sorry that my English is so bad. Could you check again?

kashira202111 avatar Jun 25 '24 10:06 kashira202111

@kashira202111 Thank you for your work. I'll review this PR in a week. Please wait.

hiroyuki-sato avatar Jun 25 '24 12:06 hiroyuki-sato

@kashira202111 I can't take time for review. Please wait for a while.

hiroyuki-sato avatar Jul 03 '24 13:07 hiroyuki-sato

Hello, @kashira202111 and @p-eye

Does this PR fix the problem if a user uses the gcs_bucket parameter in the configuration? If you don't use gcs_bucket you can load data into the table which you specified in the detination_project

The following configuration created the bqtest2.mydataset.hello_embulk table.

Is my understanding correct? And I have not set up gcp_bucket for testing yet. (I don't use gcp_bucket parameter in my project) Please wait to set them up.

As a workaround, can you remove gcp_bucket?

out:
  type: bigquery
  mode: replace
  project: bqtest
  destination_project: bqtest2
  dataset: mydataset
  auth_method: service_account
  json_keyfile:  'key_file`
  auto_create_dataset: false
  auto_create_table: true
  table: hello_embulk

hiroyuki-sato avatar Jul 07 '24 13:07 hiroyuki-sato

@hiroyuki-sato Thanks for review.

Does this PR fix the problem if a user uses the gcs_bucket parameter in the configuration?

I'm going to check this. Please wait for a while.

kashira202111 avatar Jul 09 '24 09:07 kashira202111

@hiroyuki-sato

Both using gcs_bucket parameter and not using it, we can load data into destination_project with this commit.

But I found another problem about gcs_bucket in latest release. After the problem is fixed I confirmed that I can load data.

kashira202111 avatar Jul 10 '24 09:07 kashira202111

I issue the problem later. Please wait for a while.

kashira202111 avatar Jul 10 '24 10:07 kashira202111

@hiroyuki-sato

This is the problem that I mentioned above.

https://github.com/embulk/embulk-output-bigquery/pull/164

kashira202111 avatar Jul 10 '24 14:07 kashira202111

@kashira202111

I would like to double-check about this issue, Without applying this PR, I could load data to the detination_project table. https://github.com/embulk/embulk-output-bigquery/pull/148#issuecomment-2212447130 So, I don't completely understand the problem yet.

Could you tell me again the problem?

hiroyuki-sato avatar Jul 11 '24 13:07 hiroyuki-sato

@hiroyuki-sato Hi. I'm sorry that my PR title and description led you misunderstands. I rewrote PR title and description.

Could you review this PR again?

kashira202111 avatar Jul 16 '24 06:07 kashira202111

@kashira202111 Thank you for proposing this PR. I confirmed this PR fixes the problem.

Expect behavior: Load data into the table in the detination_project project and remove all intermediate tables. However, without this PR, It behaves the following.

Dataset in project Actual behavior
Not exists Raise Exception (below)
Already exits create Unnecessary tables (temp and destination).
org.embulk.exec.PartialExecutionException: org.jruby.exceptions.StandardError: (Error) failed to create table bqtest2-428605:mydataset.LOAD_TEMP_93248bb6_8d52_42b8_8226_67856107508b_hello_embulk in us/eu, response:{:status_code=>404, :message=>"notFound: Not found: Dataset bqtest-253710:mydataset", :error_class=>Google::Apis::ClientError}
	at org.embulk.exec.BulkLoader$LoaderState.buildPartialExecuteException(BulkLoader.java:340)
	at org.embulk.exec.BulkLoader.doRun(BulkLoader.java:580)
	at org.embulk.exec.BulkLoader.access$000(BulkLoader.java:36)
	at org.embulk.exec.BulkLoader$1.run(BulkLoader.java:353)
	at org.embulk.exec.BulkLoader$1.run(BulkLoader.java:350)
	at org.embulk.spi.ExecInternal.doWith(ExecInternal.java:26)
	at org.embulk.exec.BulkLoader.run(BulkLoader.java:350)
	at org.embulk.EmbulkEmbed.run(EmbulkEmbed.java:278)
	at org.embulk.EmbulkRunner.runInternal(EmbulkRunner.java:288)
	at org.embulk.EmbulkRunner.run(EmbulkRunner.java:153)
	at org.embulk.cli.EmbulkRun.runInternal(EmbulkRun.java:115)
	at org.embulk.cli.EmbulkRun.run(EmbulkRun.java:24)
	at org.embulk.cli.Main.main(Main.java:53)
	Suppressed: java.lang.NullPointerException
		at org.embulk.exec.BulkLoader.doCleanup(BulkLoader.java:477)
		at org.embulk.exec.BulkLoader$3.run(BulkLoader.java:411)
		at org.embulk.exec.BulkLoader$3.run(BulkLoader.java:408)
		at org.embulk.spi.ExecInternal.doWith(ExecInternal.java:26)
		at org.embulk.exec.BulkLoader.cleanup(BulkLoader.java:408)
		at org.embulk.EmbulkEmbed.run(EmbulkEmbed.java:283)
		... 5 more
Caused by: org.jruby.exceptions.StandardError: (Error) failed to create table bqtest2-428605:mydataset.LOAD_TEMP_93248bb6_8d52_42b8_8226_67856107508b_hello_embulk in us/eu, response:{:status_code=>404, :message=>"notFound: Not found: Dataset bqtest-253710:mydataset", :error_class=>Google::Apis::ClientError}
	at RUBY.create_table_if_not_exists/path/to/embulk-output-bigquery-0.7.1/lib/embulk/output/bigquery/bigquery_client.rb:462)
	at RUBY.auto_create/path/to/embulk-output-bigquery-0.7.1/lib/embulk/output/bigquery.rb:315)
	at RUBY.transaction/path/to/embulk-output-bigquery-0.7.1/lib/embulk/output/bigquery.rb:342)
	at RUBY.transaction/path/to/embulk-0.11.4-java/lib/embulk/output_plugin.rb:64)
Caused by: org.jruby.exceptions.StandardError: (ClientError) notFound: Not found: Dataset bqtest-253710:mydataset
	at RUBY.check_status/path/to/google-apis-core-0.14.0/lib/google/apis/core/http_command.rb:244)
	at RUBY.check_status/path/to/google-apis-core-0.14.0/lib/google/apis/core/api_command.rb:135)
	at RUBY.process_response/path/to/google-apis-core-0.14.0/lib/google/apis/core/http_command.rb:207)
	at RUBY.execute_once/path/to/google-apis-core-0.14.0/lib/google/apis/core/http_command.rb:326)
	at RUBY.do_retry/path/to/google-apis-core-0.14.0/lib/google/apis/core/http_command.rb:131)
	at RUBY.retriable/path/to/retriable-3.1.2/lib/retriable.rb:61)
	at org.jruby.RubyFixnum.times(org/jruby/RubyFixnum.java:312)
	at RUBY.retriable/path/to/retriable-3.1.2/lib/retriable.rb:56)
	at RUBY.do_retry/path/to/google-apis-core-0.14.0/lib/google/apis/core/http_command.rb:128)
	at RUBY.retriable/path/to/retriable-3.1.2/lib/retriable.rb:61)
	at org.jruby.RubyFixnum.times(org/jruby/RubyFixnum.java:312)
	at RUBY.retriable/path/to/retriable-3.1.2/lib/retriable.rb:56)
	at RUBY.do_retry/path/to/google-apis-core-0.14.0/lib/google/apis/core/http_command.rb:118)
	at RUBY.execute/path/to/google-apis-core-0.14.0/lib/google/apis/core/http_command.rb:109)
	at RUBY.execute_or_queue_command/path/to/google-apis-core-0.14.0/lib/google/apis/core/base_service.rb:477)
	at RUBY.insert_table/path/to/google-apis-bigquery_v2-0.65.0/lib/google/apis/bigquery_v2/service.rb:1442)
	at RUBY.create_table_if_not_exists/path/to/embulk-output-bigquery-0.7.1/lib/embulk/output/bigquery/bigquery_client.rb:451)
	at RUBY.with_network_retry/path/to/embulk-output-bigquery-0.7.1/lib/embulk/output/bigquery/google_client.rb:51)
	at RUBY.create_table_if_not_exists/path/to/embulk-output-bigquery-0.7.1/lib/embulk/output/bigquery/bigquery_client.rb:451)
	... 3 more

Error: org.jruby.exceptions.StandardError: (Error) failed to create table bqtest2-428605:mydataset.LOAD_TEMP_93248bb6_8d52_42b8_8226_67856107508b_hello_embulk in us/eu, response:{:status_code=>404, :message=>"notFound: Not found: Dataset bqtest-253710:mydataset", :error_class=>Google::Apis::ClientError}

hiroyuki-sato avatar Jul 17 '24 12:07 hiroyuki-sato

@kashira202111 Fix CI in #166. Could you rebase this PR to check CI?

hiroyuki-sato avatar Jul 19 '24 04:07 hiroyuki-sato

@hiroyuki-sato Thank you. I rebased this PR !

kashira202111 avatar Jul 19 '24 07:07 kashira202111

@joker1007 CI passed. Please take a look when you have time.

hiroyuki-sato avatar Jul 19 '24 08:07 hiroyuki-sato

Thanks!

hiroyuki-sato avatar Jul 21 '24 02:07 hiroyuki-sato

@p-eye and @kashira202111 embulk-output-bigquery 0.7.2 has been released. 0.7.2 includes this PR. Please try it.

hiroyuki-sato avatar Jul 21 '24 11:07 hiroyuki-sato

It works well, thanks to you guys.

p-eye avatar Jul 22 '24 06:07 p-eye