cube
cube copied to clipboard
Add support for Google Workload Identity to Cube Store
Is your feature request related to a problem? Please describe. Cube Store currently supports only using a JSON service account credential for authenticating to Google Cloud Storage. Google recommends against using this method when possible for security hardening purposes.
Describe the solution you'd like Implement Workload identity as described in https://cloud.google.com/kubernetes-engine/docs/concepts/workload-identity
Describe alternatives you've considered Alternative is to use the existing hard-coded secrets method.
If you are interested in working on this issue, please leave a comment below and we will be happy to assign the issue to you. If this is the first time you are contributing a Pull Request to Cube.js, please check our contribution guidelines. You can also post any questions while contributing in the #contributors channel in the Cube.js Slack.
Hi, i looked into it a bit.
it seems that if you dont provide credentials cube.js(js bigquery/storage clients side) work correctly. but regarding cubestore it uses a rust package that doesn't have the option. there is an open PR for adding it.
It uses this in order to handle authentication to gcp, maybe can use it diectly instead of waiting for fix in package or maybe can change package?.
The changes needed:
- cubejs - make CUBEJS_DB_BQ_CREDENTIALS optional
- cubestore - make CUBESTORE_GCP_CREDENTIALS optional and handle it- either by gcp-auth or if its possible to change packages to one that handles it already?
Other alternative: Publishing another environment variable that makes working with workload identity explicit.
Hello,
I got Workload Identity working with the following changes, would be awesome if anyone with Rust experience could run the tests and make a PR. Workload Identity makes a huge difference to us, but we don't have the resources for Rust development, this is just a hack to get it working with old versions and toolchains.
in cubestore/cubestore/Cargo.tomlupgrade from cloud-storage 0.7.0 to the 0.11.1 fork by @dark0dave additionally make sure to lock gcp_auth to0.7.4and time to0.3.17otherwise it won't compile with the toolchainchannel = "nightly-2022-11-03". The original cloud-storage does not seem to be actively maintained.
cloud-storage = { git = "https://github.com/dark0dave/cloud-storage-rs", branch = "master", features = ["global-client", "sync", "native-tls"] }
gcp_auth = { version = "=0.7.4" }
time = { version = "=0.3.17" }
-
From
remotefs/gcs.rsremoveensure_credentials_initto remove the requirement on the environment variable. -
To adapt
remotefs/gcs.rsto cloud-storage 0.11.1 I updated list_with_metadata and imports.
use cloud_storage::{ListRequest, Object};
...
async fn list_with_metadata(
&self,
remote_prefix: String,
) -> Result<Vec<RemoteFile>, CubeError> {
let prefix = self.gcs_path(&remote_prefix);
let request = ListRequest {
prefix: Some(prefix.into()),
..Default::default()
};
let list= Object::list(self.bucket.as_str(), request)
.await?
.map_ok(|object_list| object_list.items)
.try_concat()
.await?;
let leading_slash = Regex::new(format!("^{}", self.gcs_path("")).as_str()).unwrap();
let result: Vec<RemoteFile> = list.into_iter()
.map(|obj| RemoteFile {
remote_path: leading_slash.replace(&obj.name, NoExpand("")).to_string(),
updated: obj.updated.clone(),
file_size: obj.size,
})
.collect::<Vec<RemoteFile>>();
let mut pages_count = result.len() / 1_000;
if result.len() % 1_000 > 0 {
pages_count += 1;
}
if pages_count > 100 {
log::warn!("gcs list returned more than 100 pages: {}", pages_count);
}
app_metrics::REMOTE_FS_OPERATION_CORE.add_with_tags(
pages_count as i64,
Some(&vec![
"operation:list".to_string(),
"driver:gcs".to_string(),
]),
);
Ok(result)
}
fix any warnings, run cargo build and hopefully it should work. I tested this on cubestore v0.35.6.
allow me to update the code snippet to solve some time spending on map_ok error:
(just add the correct import of futures)
use cloud_storage::{ListRequest, Object};
use futures::{TryStreamExt, StreamExt};
...
async fn list_with_metadata(
&self,
remote_prefix: String,
) -> Result<Vec<RemoteFile>, CubeError> {
let prefix = self.gcs_path(&remote_prefix);
let request = ListRequest {
prefix: Some(prefix.into()),
..Default::default()
};
let list= Object::list(self.bucket.as_str(), request)
.await?
.map_ok(|object_list| object_list.items)
.try_concat()
.await?;
let leading_slash = Regex::new(format!("^{}", self.gcs_path("")).as_str()).unwrap();
let result: Vec<RemoteFile> = list.into_iter()
.map(|obj| RemoteFile {
remote_path: leading_slash.replace(&obj.name, NoExpand("")).to_string(),
updated: obj.updated.clone(),
file_size: obj.size,
})
.collect::<Vec<RemoteFile>>();
let mut pages_count = result.len() / 1_000;
if result.len() % 1_000 > 0 {
pages_count += 1;
}
if pages_count > 100 {
log::warn!("gcs list returned more than 100 pages: {}", pages_count);
}
app_metrics::REMOTE_FS_OPERATION_CORE.add_with_tags(
pages_count as i64,
Some(&vec![
"operation:list".to_string(),
"driver:gcs".to_string(),
]),
);
Ok(result)
}
@codingchili Thanks for contributing this! ❤️
Indeed, it would be nice if anyone could make a PR and bring it over the finish line 😄
@igorlukanin Opened a PR
I managed to get this to work on my environment, however, I did not write any additional tests for this. Please also note, it is using a forked github repo as a dependency ( cloud-storage-rs by @dark0dave )