dataall icon indicating copy to clipboard operation
dataall copied to clipboard

Redshift Data Sharing

Open anmolsgandhi opened this issue 1 year ago • 5 comments

Description:

Enable seamless data integration with Redshift as a new data source in ‘data.all’. This feature enhances collaboration by allowing users to easily publish, discover and share Redshift data within the data.all platform. Users can securely configure Redshift instance, streamlining the process of making Redshift datasets accessible.

Details:

Adding Redshift Instance and Publishing Tables

  • Users initiate the process by selecting “Create Dataset” and choosing Redshift from the dropdown menu.
  • The interface guides users through a secure credential input, ensuring a streamlined and secure configuration process.
  • Once configured, the dataset owners can select specific tables to publish to the ‘data.all’ catalog, ensuring a controlled inclusion of Redshift data.

Tables Available for Discovery

  • Cataloged Redshift tables automatically become part of the ‘data.all’ catalog, visible to users exploring datasets within the platform.
  • The catalog provides detailed metadata for each table, facilitating a comprehensive understanding of available data.
  • Users can navigate the ‘data.all’ UI to effortlessly discover and explore Redshift tables
  • Dataset owners can edit metadata for each table such as description, tags.

Self-service Share Process for Redshift Data Sharing

  • Consumers interested in specific Redshift tables initiate the share process by selecting the desired dataset.
  • Owners of the shared Redshift tables within data.all Datasets receive access requests, with an easy-to-use interface for managing permissions and approvals.
  • Upon approval, the shared Redshift data becomes dynamically accessible to consumers, maintaining a consistent and user-friendly experience.

Benefits:

  • Additional Data Source Integration: The added capability of Redshift as a new data source enhances flexibility, enabling users to integrate diverse data sources beyond S3, expanding the platform’s utility.
  • User-Friendly Configuration: A guided process ensures a connection of Redshift instances with secure credentials.
  • Efficient Discovery: Automated cataloging promotes effortless exploration of Redshift tables within ‘data.all’ catalog.
  • Streamlined Sharing Workflow: The self-service share process maintains simplicity and consistency across different types of data, allowing users to request and access Redshift data seamlessly as they do with S3 data.

@dlpzx

anmolsgandhi avatar Jan 09 '24 18:01 anmolsgandhi

Design

This design is up to date with the latest implementation changes

Assumptions

  • Redshift clusters/namespaces are created and maintained by DevOps teams outside of data.all
  • Database admin teams manage users in their clusters/namespaces outside of data.all
  • Data producers and consumers can access their clusters/namespaces with the access provided by the database admin teams.
  • Data producers create tables in Redshift outside of data.all
  • Data.all requires a Redshift user of the type IAM:user or database user with credentials stored in AWS Secrets Manager for the data producers that are going to publish data ( In the diagram this is the basis for Authorization 1). Data.all needs to have permissions to use the IAM role or to access the Secret. This user needs to have permissions to create datashares.
  • Data.all requires a Redshift user of the type IAM:user for the data.all PivotRole in all accounts with a Redshift cluster. This user needs to have permissions to create datashares. In the diagram this is the basis for Authorization 2 and 3
  • data.all Share request principal will be REDSHIFT ROLE
  • Data Consumers register their Redshift roles with Redshift Consumption Roles. Database admins can control the roles created in Redshift which roles are attached to which user/group. To isolate data.all access grants from other access grants, we recommend database admins to create dedicated Redshift roles. For example, for projectXYZ a group of Redshift users needs permissions to data in another cluster. The database admin should create a Redshift role DAProjectXYZ and attach it to the roles/users/groups in RS. Data consumers should register the role in data.all and request access to the data they need.

HLD and User experience

Initial design with Lake Formation ---> NOT USED ANYMORE

During implementation we realized that datashares with LakeFormation do not bring much value when not being used to actually share further in Lake Formation (it just plots metadata in Glue). It might be useful in the future if we integrate with IAM Identity Center as the integration Redshift-LF works way better with IAM IC. But for the moment we won't be using it. If in the future we want to revert the changes the code is in commits: https://github.com/data-dot-all/dataall/commit/bf476fc2a7e82ad3275530b24aab62858d718ffd, https://github.com/data-dot-all/dataall/commit/ef9662bd74379ba442b2f2f47e3de20457990b96 and https://github.com/data-dot-all/dataall/commit/17970d758a7977dd57c77f467bc135166c5fa159.

image

Initial data sharing design with data.all Redshift consumption roles ---> NOT USED ANYMORE

There are 3 reasons why this design has been further improved:

  • In this design we assume that the users are taking the necessary outside actions on the pivot role so that it can process data shares in the source and in the target clusters. It connects to the cluster in a different way from how we were doing it for dataset publishing, which adds more code, more IAM policies, more features that we use in Redshift (without needing them).

  • In addition we are creating a data.all abstraction, redshift consumer roles, which is again another layer of complexity for the users to interact with.

  • Finally, data.all does not ensure that the user has taken the necessary preliminary actions before opening a share request. There is no visibility on whether the namespace used for the share request can be accessed by data.all, which can lead to errors in the sharing (but the actual error is in the onboarding of the redshift cluster). We should separate onboarding cluster steps from sharing steps as much as possible.

Redshift-data all-without-warehouses-with-consumptionRoles_UPDATED drawio(3)

Current design

We add more guardrails on the onboarding of clusters by setting as requirement that there must be a pivot role connection created for each cluster used. This is a pre-req for creating other types of connections and for opening share requests.

Redshift-data all-without-warehouses-with-consumptionRoles_UPDATED_2 drawio(1)

Following the numeration above:

  1. Outside of data.all, Database Admin Teams manage Redshift cluster users.
    1. For data producers - They create a regular Redshift user and optionally (mandatory for Redshift serverless) store the credentials in Secrets Manager, ⚠️ [NOT IMPLEMENTED YET] or they create a a Redshift user of the type (IAM:user) that allows IAM federation
    2. For data consumers - They create Redshift roles and attach them to users
  2. Outside of data.all, Database Admin Teams in the data producer and in the data consumer clusters create a user in Redshift for the data.all IAM pivot role and optionally (mandatory for Redshift serverless) store the credentials in Secrets Manager, ⚠️ [NOT IMPLEMENTED YET] or they create a a Redshift user of the type (IAM:user) that allows IAM federation
  3. Outside of data.all, Data producers work in Redshift and create tables
  4. In data.all UI, Database producers create a data.all pivot role Connection. Without a pivot role connection that is valid, no other connection can be created!
  5. In data.all UI, Data producers create a data.all Connection.
    1. When creating a connection, users need to introduce:
      1. The Redshift user (IAM:user) IAM role or SecretArn created by their db admins
      2. Environment where the cluster is
      3. Namespace/cluster id
      4. Database
      5. A data.all Team that owns the connection. Only members of the Team can use it. (similar to consumption IAM roles)
    2. Connections are going to be used to AUTHORIZE the import of data and maybe in next steps to open Redshift QueryEditorV2. There are different types of Redshift users:
      1. ⚠️ [NOT IMPLEMENTED YET] Federated users (the IAM role is stored). The role created has permissions to be used as federated user in Redshift by data.all.
      2. [IMPLEMENTED] AWS Secrets Manager (the secretArn is stored). Customers will need to tag the secret in order for data.all to be able to access it.
      3. NEXT STEPS - IAM Identity Center - it cannot be used at the moment for the publication of data.
      4. NEVER - username and password. From data.all we want to avoid securing passwords in transit.
  6. In data.all UI, Data producers import a Redshift dataset in data.all specifying:
    1. Select the Environment and the Connection to use for import
    2. The Team that owns the Connection also will own the Dataset
    3. Redshift schema and selection of tables to be imported from that schema
  7. Under-the-hood, when a dataset is imported, ~data.all creates a datashare between Redshift and the Glue Catalog using the authorization of the Connection~ the metadata for the schema and tables imported is stored. Dataset and tables are indexed in the data catalog.
  8. In data.all UI, Data producers can fetch the schema of the imported tables in the dataset/Data tab~click on “Sync tables” in the imported dataset as we do with S3/Glue datasets. Tables appear in data.all~. Users can ListDatasets, which lists S3 and Redshift datasets.
  9. Under-the-hood, when the data producer opens the schema of a table, data.all uses redshift data API to read the table details from Redshift. ~clicks sync-tables, data.all reads from the glue database created as part of the datashare from Redshift to Glue Catalog~
  10. In data.all UI, data consumers can discover RS tables and datasets in Catalog
  11. In data.all UI, Database consumers create a data.all pivot role Connection for the target cluster. Without a pivot role connection that is valid, the share request cannot be created.
  12. In data.all UI, data consumers can create a share request by selecting the dataset or tables. They submit the request
    1. In the share request they select the target environment and target group
    2. A dropdown lists the namespaces with pivot role connections in the environment.
    3. Data.all checks that the target group has permissions to use Redshift in the environment
    4. Optionally, Users manually input the Redshift role that is the recipient of the request. If the Redshift role is specified the share request is granted to the role, if not, to the namespace.
  13. In data.all UI, data producers approve the request
  14. Under-the-hood, data.all creates a datashare in the data producers cluster/namespace
  15. Under-the-hood, data.all associates the datashare to the data consumers cluster and grants permissions to the redshift role (if specified)
  16. Data consumers will access the data through:
  • BI tools: Quicksight, Tableau, Power BI, Qlik (JDBC/ODBC connections)
  • SQL clients: DB Beaver, SQL Workbench (JDBC/ODBC connections)
  • ETL workloads in Redshift
  • Ad-hoc queries in Redshift Query Editor

User experience

Redshift connection

Create/Delete and list In the creation we check that the connection is valid by listing databases in redshift and making sure the selected database is part of the cluster/workgroup. Serverless clusters do not accept a db user for federation (see example API call). At the moment db users are disabled for serverless, in the future we can think of assuming and IAM role and then do federation.

If any parameter in the connection form is invalid it will throw an error. If the Team does not have permissions to create a connection in the environment or does not have tenant permissions for redshift it will also throw an error.

https://github.com/user-attachments/assets/28ed3e9a-2aaf-4707-977f-2bd3f7c14aa0

Screenshot 2024-07-15 at 12 37 23

Redshift dataset

Import Form

https://github.com/user-attachments/assets/a6927c3d-9a97-4431-9743-783c00de2162

List Datasets view With icons for S3 and Redshift image

** List Datasets in Environment**

https://github.com/user-attachments/assets/308050b5-9b6e-40eb-bea0-e1d4209a4dd5

Dataset view Screenshot 2024-07-17 at 14 33 24

Dataset edit form, Tables tab and schema modal https://github.com/user-attachments/assets/89c69669-59bf-4ff0-a889-523581d87d25

Table view, columns tab and Table edit form https://github.com/user-attachments/assets/1a8437d0-4b51-4330-833a-e3c0b495e591

Delete Table, Dataset

https://github.com/user-attachments/assets/0254fe55-119b-409f-bb09-b25c65eb861a

They get deleted and removed from the catalog

Screenshot 2024-07-25 at 10 21 31

**Catalog indexing **

https://github.com/user-attachments/assets/509d3f83-7a23-46cb-8d2b-98bb1c254280

** Feed, Votes**

https://github.com/user-attachments/assets/afa18a3b-5f03-4dda-b763-f2b8480da6b6

Glossary

Screenshot 2024-07-25 at 10 16 29

Redshift permissions controls In the admin settings and in the environment team invitation form we can define redshift permissions applied to teams

https://github.com/user-attachments/assets/1b95965c-0366-412a-8ce0-8cdef19e523d

Permissions

IAM permissions

IAM permissions are granted solely to the pivot role. List and describe permissions are granted to all resources if needed and write operations on Redshift workgroups, namespaces and clusters are restricted to those resources that have been onboarded to data.all in the form of Connections. Every time a connection is added to the environment, the pivot role gets updated (the environment stack gets deployed)

data.all application permissions

Here we are referring to permissions guarding API calls. These are not IAM permissions but data.all specific permissions that can be of the type: tenant-level, environment-level, or group-level permissions. For more info, check the Permission model section in the docs.

To avoid complex permissions backfilling migrations or the risk of too open migrations, in this case I am considering dataset sharing and future extensions to decide which permissions to include.

Redshift Connection permissions

All API-facing methods of RedshiftConnectionService are protected by the permission decorators.

  • Tenant permissions
    • MANAGE_REDSHIFT_CONNECTION - 👀 ⚠️ Initially this permission was not defined, assuming that connections were controlled as part of MANAGE_REDSHIFT_DATASETS. However, the actions on redshift connections are subtle to be specially restricted by data.all admin, so it was added back. This permission is applied to create/delete connections. Users without this permission can still be able to import redshift datasets using connections (so they will perform get/list operations). Second warning ⚠️ At the moment this permission won't do anything as a connection has an admin group that will create the connection and then use it, but if in the future we want to share a connection then this will be useful.
  • Environment permissions - granted when inviting a team to an environment
    • CREATE_REDSHIFT_CONNECTION - to limit which groups in an environment are allowed to create connections in the environment. Applied to create_redshift_connection
    • LIST_ENVIRONMENT_REDSHIFT_CONNECTIONS - to prevent that users outside of an Environment fetch another environment connections.
  • Group permissions
    • GET_REDSHIFT_CONNECTION - to prevent that unauthorized users (not belonging to the connection owner team) get the details of the connection. Applied to multiple operations that get info from the connection and granted to dataset admin team. 👀 In the future extensible to non-admin groups that could use the connection without being the admins of it.
    • DELETE_REDSHIFT_CONNECTION - to prevent that unauthorized users delete a connection. Applied to delete_redshift_connection and granted ONLY to connection admin team.

Connections are not editable at the moment, so there are no permissions to UPDATE_CONNECTIONS.

Redshift Dataset permissions

  • Tenant permissions
    • MANAGE_REDSHIFT_DATASETS to limit at the application level which teams can work with Redshift datasets. Applied to all methods of RedshiftDatasetService. If the tenant says no, then it is a no.
  • Environment permissions - granted when inviting a team to an environment
    • IMPORT_REDSHIFT_DATASET to limit which groups in an environment are allowed to import a redshift dataset in the environment. Applied to import_redshift_dataset
  • Group permissions
    • UPDATE_REDSHIFT_DATASET and DELETE_REDSHIFT_DATASET - to prevent that unauthorized users update/delete a dataset. Applied to update and delete_redshift_dataset and granted ONLY to dataset admin team.
    • ADD_TABLES_REDSHIFT_DATASET - to limit the users that can add tables to a dataset. ⚠️ it could be considered as part of update_dataset, but better to be specific as each is a different action in nature.
    • GET_REDSHIFT_DATASET - limits get dataset details. Applied to any method that fetches data for the Dataset
    • GET_REDSHIFT_DATASET_TABLE - limits get table details. Applied to any method that fetches data for the table. Needed when we share redshift tables
    • DELETE_REDSHIFT_DATASET_TABLE- to prevent that unauthorized users delete a table Applied to delete_redshift_table and granted ONLY to dataset admin team (the ones that added the table)
    • UPDATE_REDSHIFT_DATASET_TABLE- to prevent that unauthorized users delete a table Applied to update_redshift_table and granted ONLY to dataset admin team (the ones that added the table)

Sharing with Redshift

We will share Redshift tables. We could have decided to implement full dataset sharing, but sharing with more granularity is more aligned with least-privilege principles.

Datashares only work for Encrypted clusters. Therefore we should add guardrails preventing shares for non-encrypted clusters. Or directly disable the onboarding of clusters that are not encrypted. In the connection we should include the encryption type of the cluster.

Alternative 1: datashare per share request

When a share request is approved: 1) Create datashare (in source account) 2) Add schema to the datashare (in source account) 3) Add share requested tables to the datashare (in source account) 4) Grant access to the consumer cluster to the datashare (in source account) 5) Create local database from datashare (in target account) - WITH PERMISSIONS OPTIONAL 6) Create external schema in local database (in target account) 7) Grant usage access to the redshift role to the local database and schema (in target account) When revoking tables: 1) Remove table from datashare 2) if no more tables in share request -> clean-up: delete external schema, local db, revoke access to datashare (if needed) and delete datashare

Alternative 2: datashare per dataset

When a share request is approved: 1) Create datashare (in source account) if it does not exist already 2) Add schema to the datashare (in source account) if not done already 3) Add tables to the datashare (in source account) if not done already added 4) Grant access to the consumer cluster to the datashare (in source account) if not done already 5) Create local database from datashare (in target account) if not done already - WITH PERMISSIONS is needed 6) Create external schema in local database (in target account) if not done already 7) Grant granular usage access to the redshift role to the local database schema and share requested tables (in target account) ALWAYS

When revoking tables: 1) Revoke permissions (revert step 7) 2) if table not shared in any share request - clean-up table: remove from datashare 3) if no more tables in datashare - clean-up datashare: delete external schema, local db, revoke access to datashare (if needed) and delete datashare

Alternative 3: datashare per dataset-requester namespace

Same steps as alternative 2 but in this case we create a different datashare for each target namespace.

Comparison

  • Simplicity of implementation: they are pretty similar. Alternative 1 has more steps but each share is isolated. Alternative 2 is very fast for additional shares but has a more complex revoke. Alternative 3 is slightly more difficult than 2.
  • User experience: this is the main difference ❗ End users will query from the external schema (SELECT * FROM "dev"."serv_db_public"."customer";, having many external schemas involves having complex names with Ids which might be not straightforward to use. Plus, for the database admins it can be also confusing. So alternatives 2 and 3 are definitely more user friendly.
  • Security/Data Governance: with the WITH PERMISSIONS clause we can restrict access in the consumer side, so there should be no downside of sharing the same datashare across multiple end-consumers (Redshift roles) - I verified that we can grant permissions to a single table in the datashare and that the end-user does not have permissions to other tables in the datashare. They can list and describe them but cannot select them. Between alternative 2 and alternative 3, the later offers more security. At the end the db admins of multiple target namespaces will have access to all the items of the datashare which might include permissions to tables that are not granted explicitly to a particular namespace.

----> decision: Alternative 3 offers the nicest, most secure experience for users

Limitations

All alternatives take into account Redshift service quotas. In principle the max number of dbs in a cluster is 60 (provisioned cluster) and 100 (serverless), but this excludes databases created from datashares, so we are safe. As for datasharing limitations we should add it in the docs: https://docs.aws.amazon.com/redshift/latest/dg/considerations.html.

As for sharing between a Redshift provisioned cluster and a serverless cluster, the documentation states that it is possible: https://docs.aws.amazon.com/redshift/latest/mgmt/serverless-datasharing.html

More constraints: You can create only one consumer database for one datashare on a consumer cluster. You can't create multiple consumer databases referring to the same datashare. --> accounted for in https://github.com/data-dot-all/dataall/pull/1467

dlpzx avatar Mar 13 '24 11:03 dlpzx

Implementation plan


Pre-requisites

To implement the design I will open multiple pull requests (list might vary)

  • [X] Done Pre-reqs: Refactor current datasets into S3 Datasets and Base datasets (#1123)
  • [x] Done Pre-reqs: Refactor current dataset sharing into S3 sharing and base sharing (#1283)

Redshift datasets

  • [x] Done New Redshift Dataset module using Base datasets + publish to catalog logic. Introduce Redshift Connections (https://github.com/data-dot-all/dataall/pull/1424)
    • [X] Redshift Connections - + checks
    • [x] Redshift Dataset import
    • [X] Redshift Tables view,
    • [x] Add and remove tables
    • [X] Delete and edit Dataset
    • [x] Add catalog indexer for Dataset and tables
    • [x] Add logic for glossary, feed
    • [x] Polish IAM permissions
    • [x] Polish data.all permissions
    • [x] Polish frontend views
    • [x] Migrations and backfilling
    • [x] Add unit testing for connections - check comments below
    • [X] Add unit testing for datasets - 94% coverage (leaving out glossary and votes which should be tested in their own modules)

Redshift data sharing

  • [x] In-Progress New Redshift data sharing module using base sharing
    • [X] Edit Connections and introduce mandatory connection as requirement to open share request (https://github.com/data-dot-all/dataall/pull/1451)
    • [X] Add encryption guardrails and store encryption type in connections (https://github.com/data-dot-all/dataall/pull/1447)
    • [X] Adapt FE share request view for Redshift (https://github.com/data-dot-all/dataall/pull/1458)
    • [x] not needed! ~Implement "selector" in the shares_base that distinguishes on the type of dataset (only some processors should be used for each dataset type, meaning that the processors should register the dataset type)~ dataall.modules.shares_base.db.share_object_repositories.ShareObjectRepository.list_shareable_items_of_type already filters by datasetUri, meaning that it will filter the Redshift tables for redshift datasetUri and the tables and folders for s3 datasetUri.
    • [X] Create shares module baseline - processors, types, client, cdk permissions (https://github.com/data-dot-all/dataall/pull/1461)
    • [x] Adapt shareobject principals and creation to Redshift (https://github.com/data-dot-all/dataall/pull/1462)
    • [X] Sharing processor and manager logic (https://github.com/data-dot-all/dataall/pull/1467)
    • [x] Pivot role IAM permissions in cdk pivot role (same PR as previous) ((https://github.com/data-dot-all/dataall/pull/1467))
    • [X] Polish FE (PR: https://github.com/data-dot-all/dataall/pull/1477): redirects, principal
    • [X] Unit testing for redshift dataset shares (for processor is already part of (https://github.com/data-dot-all/dataall/pull/1467))

Related tasks needed for release

  • [x] Data sharing guardrails: Add checks on the creation of the share object - PR https://github.com/data-dot-all/dataall/pull/1484 - users can only open one request per namespace role on the same dataset, - redshift role must exist in the namespace
  • [x] List Datasets currently lists S3 datasets, Redshift datasets and S3 shared datasets, but it does not list redshift shared datasets. There are no check on_delete, and resolution of shared roles. https://github.com/data-dot-all/dataall/pull/1511

Documentation (also needed for release)

  • [ ] Documentation PR for redshift-datasets - include section on secrets creation (needs to be tagged with Redshift specific tag and data.all tag) https://github.com/data-dot-all/dataall/pull/1512
  • [ ] Documentation PR for redshift-datasets-sharing https://github.com/data-dot-all/dataall/pull/1519

Integration testing -----> tracked in https://github.com/data-dot-all/dataall/issues/1510

Wait for #1409

  • [ ] redshift- datasets
  • [ ] redshift-datasets-shares

Redshift next steps ---> tracked in https://github.com/data-dot-all/dataall/issues/1509

  • [ ] Add Connections of IAM Federation type - next steps!
  • [ ] Use getEnums API call to return clusterTypes with utils implemented in #1435
  • [ ] Extract more common dataset_base code from redshift datasets and s3 datasets
    • [ ] Common FE elements in import/create S3 dataset and import Redshift dataset
    • [ ] Common FE elements in edit datasets
    • [ ] Common resolvers (resolve_dataset_environment, resolve_dataset_owners_group, resolve_dataset_stewards_group)
    • [ ] Common updateDataset API call
    • [ ] Common ModifyRedshiftDatasetInput
  • [ ] Following the pattern set by @SofiaSazonova at #1435 I think we should start thinking how to detangle UI from the config.json. Here we could have a query that returns all the enabled modules.Originally posted by @petrkalos in https://github.com/data-dot-all/dataall/pull/1424#discussion_r1697871999

NOT Redshift tasks out of scope

  • [ ] Move glossary, feed, indexer targets to enums in their respective modules
  • [ ] Rename S3 permission descriptions in team invite permission toogle list to clearly specify they are S3/Glue datasets
  • [ ] create a uni-test directory and migrate the current tests to unit tests - check this commit. I started it but reverted the changes as it was getting too complex to be added in the initial PR
  • [ ] Generic search filter and input in input_types API calls
  • [ ] Common styled DataGrid component with cell borders for dark theme

dlpzx avatar Mar 18 '24 15:03 dlpzx

@dlpzx I've read through the design and watched your video as well (it was very helpful as it answered some of my questions).

Overall I don't see any big problems but I do have some concerns.

  1. Addition of a new UI "Warehouses" to manage Redshift connections.I find this UI a bit awkward. My first instinct that this should be a TAB under an environment and not a separate UI outside an environment. Especially because you cannot have a connection that is not part of an environment. I think this would also simplify creating connections because then the environment is already pre-defined and you can also make the connection be owned by the same team that is creating the connection.

I would also want to make sure that there's a consistent user experience when registering consumer roles or redshift consumer connections. Even today I find it weird that we register consumer roles in "Teams" tab under environments. I don't think that's intuitive. Perhaps with the addition of redshift connections we can instead add a new tab on the environment "Consumer Connections" or smth similar where you can manage your consumer IAM roles and redshift consumer connections etc..

Also I don't really feel that this new type "Warehouses" is actually going to be reusable for anything else other than Redshift so I think it's misleading.

I would like to hear your arguments why you think it would be much better to put this as a new UI on the left main bar vs making it a new tab on the environment.

  1. For sure make Redshift modular so that it can be fully disabled as for example we don't use redshift at all and don't want our users to be confused.

  2. We need to check security. Absolutely make sure to scan all infrastructure with checkov and that the permissions are as tight as possible.

  3. I'd really like to see part 2 of your video to understand better how Redshift consumer connections should work.

Thank you!

zsaltys avatar Mar 28 '24 14:03 zsaltys

I really like how descriptive the design is. Answered most of my questions too! I have a few pending though:

  1. Will a dataset be able to have s3, glue and redshift data? Will I be able to create such a dataset?
  2. Will the share UI be the same as the one being used today?
  3. Will all the other modules like QS, Sagemaker, Worksheets be available to use for Redshift too?
  4. Why are we calling it "Warehouses"? How is it any different from a data store like Glue or S3?
  5. Can you provide more information on how data consumers will interact with Redshift data using BI tools and SQL clients? Will consumers have to set up anything extra on their end to be able to use these tools?

anushka-singh avatar Mar 28 '24 15:03 anushka-singh

Thanks @zsaltys and @anushka-singh for the input, you went straight to the tricky points.

  • @zsaltys Regarding point 1, initially I placed it inside environments, but then I questioned if we even needed to place a warehouse inside an environment - let's say you are using Snowflake and it is not linked to an AWS account. What we can do is to place it inside environments, because I agree that the user experience is nicer that way. But then if we need to link other Warehouses with non-AWS links, we can work on creating non-AWS-data.all Environments (something that opens the door to multi-cloud....). In short, happy to change it. 2 - absolutely, 3 - let's prioritize for 2.5, 4 - i have not recorded it yet, i have been focusing in #1123 the last week. Please have a look
  • @anushka-singh thanks for the questions! I think you need to have a look at #1123 for the questions 1 and 2. The idea is to have a generic Dataset model and specific Dataset classes that inherit this model. Instead of adding functionalities to the existing Dataset module, we have opted to make it extensible. For question 2 - yes, very similar, but we need to check the details
  • For question 3, we would need to check case-by-case what is the integration: for Quicksight, how does the data sharing work, for SageMaker, if there is any library to connect with a redshift user or with IAM:role federation then they can access the data. Worksheets depends on the Athena connectors, in this last case we would need to see if it is worthy or we can open the RS Query Editor
  • I called it Warehouses with the idea of making it abstract to other warehousing technologies (also outside AWS)
  • For 5, most probably. I will add more details

DESIGN UPDATED WITH THE FEEDBACK!

dlpzx avatar Apr 08 '24 10:04 dlpzx

Closing this issue, remaining tasks will be tracked in the corresponding documentation pull requests and follow-up github issues

dlpzx avatar Sep 05 '24 12:09 dlpzx