bigquery-emulator icon indicating copy to clipboard operation
bigquery-emulator copied to clipboard

enable import from GCS emulator without `PublicHost`

Open totem3 opened this issue 1 year ago • 0 comments

fixes #209

Summary of problem:

There is an issue with the job that imports files from GCS, specifically when using the GCS Emulator. As detailed in issue #209, attempts to import data from the GCS Emulator sometimes does not work.

This happens when publicHost is not set in GCS Emulator, or access not using publicHost .

We have spent quite some time investigating this issue, and considering there's already an issue created with comments on it, we believe there is value in making it work without needing to set a publicHost.

cause

The problem arises due to two different URL formats used for accessing objects in the GCS Emulator:

  • /storage/v1/b/{bucketName}/o/{objectName}
  • /{bucketName}/{objectName}

The second URL pattern is only valid for accesses to publicHost in the GCS Emulator. The Go GCS SDK, when downloading files from GCS (using client.Bucket(...).Object(...).NewReader()) , accesses the latter URL format, which requires a valid publicHost and results in errors if it's not set.

The issue can be pinpointed in the code here: When building the URL for data reading, the method at google-cloud-go#L788-L793 is used. This method does not take the API prefix (storage/v1) into account, considering only the host, bucket name, and object path. It is internally used in the NewReader method at bigquery-emulator#L1087.

However, in the JSON API this problem does not occur, because even when data reading, it uses the former URL format. (google-api-go-client#L12441).

This issue seems to be specific to the Emulator and not a problem with standard GCS usage, likely due to the ability to access objects directly through URLs without an API Prefix on storage.googleapis.com.

Changes made in this PR:

I have enabled the option to use the JSON API, ensuring that imports work even when a publicHost is not set for the emulator. Since JSON download API introduced in v1.30.0, I have upgraded cloud.google.com/go/storage version.

This might be more a problem with the Go GCS SDK than with the BigQuery Emulator. So, if this fix isn't right, please let me know. If that's the case, I'm thinking of making another PR to add guidelines in the README about setting a publicHost for the GCS Emulator.

Thank you for maintaining such a great product.

totem3 avatar Nov 23 '23 14:11 totem3