dataall icon indicating copy to clipboard operation
dataall copied to clipboard

Limiting Catalog search results based on Organizations

Open sandeephs1 opened this issue 1 year ago • 5 comments

In catalog, we want to display search results limited to the organization user belongs to

Here is the scenario - Below table details, team, user, environment and associated organization

Team User Environment organization
mgm_usteam mgm_u1 mgm-music-us mgm
mgm_euteam mgm_eu2 mgm-sports-eu mgm
alexa_team x_u1 alexa-android alexa

mgm and alexa are the 2 organization mgm has 2 environments, 8 dataset, 8 tables alexa has 2 envrionemnts, 6 dataset, 6 tables Below table details dataset and tables

Organization Environment Dataset Tables
mgm mgm-music-us mgm_music_eng mgm_music_eng_alltimebest
mgm mgm-music-us mgm_music_eng mgm_music_eng_2023
mgm mgm-music-us mgm_music_esp mgm_music_restofesp
mgm mgm-music-us mgm_music_esp mgm_music_esp_best23
mgm mgm-sports-eu mgm-sports-cricket mgm-sports-cric_worldcup
mgm mgm-sports-eu mgm-sports-cricket mgm-sports-cric_ipl
mgm mgm-sports-eu mgm-sports-football mgm-sports-fb_wc
mgm mgm-sports-eu mgm-sports-football mgm-sports-fb_laliga
alexa alexa-android alexa-android-jp alx_anrd_jp_mgm
alexa alexa-android alexa-android-jp alx_anrd_jp_yt
alexa alexa-android alexa-android-it ale_droid_it_mgm
alexa alexa-android alexa-android-it ale_droid_it_yt
alexa lexa-wear wear-os-events events_music
alexa lexa-wear wear-os-sensor sensor_sports

When user 'mgm_u1' searches, returned results should not exceed (8 dataset + 8 tables)+searchcondition. He should not be displayed with the 'alexa' associated dataset and tables

similarly alexa user 'x_u1' should not be displayed with 'mgm' objects.

How this can be achieved

sandeephs1 avatar Jan 11 '24 14:01 sandeephs1

Hi @sandeephs1 this is a cool feature :) Basically you want a filtered version of the catalog based on the user's organizations. What is the motivation behind the feature? Do you need to restrict access to the metadata, or is it more of a usability problem?

dlpzx avatar Jan 16 '24 15:01 dlpzx

Hi @dlpzx We have multiple products, our customers can subscribe to any of them.

customers -> Organization in Data.All product -> Environment in Data.All

Since multiple customers will be onboarded on "Data.All", should be able to meet the data governance. Currently Catalog search presents all the matching dataset/table irrespective of the Organization user belongs to, so it will be a databreach.

We want to avoid this situation by restricting catalog search results based on the "user-Organization" relation. In a way multi-tenancy feature

sandeephs1 avatar Jan 17 '24 10:01 sandeephs1

Hi @sandeephs1 thanks for the quick response! Based on your requirements, these are the high-level changes that we would need to implement:

  1. Add another field "organizationUri" in the data catalog, for that we need to modify the mappings of the OpenSearch index
  2. Add the information of the "organizationUri" to each item that is added to the catalog
  3. Backfill existing items (we need to investigate on the best approach)
  4. Modify the search API calls to filter by the user's organizations (list of "organizationUri")
  5. Introduce a configuration parameter to enable or disable "organization_data_catalog_isolation" and make this feature configurable in the code -- probably only the part of limiting the search api calls.

Do you have the bandwidth to implement this feature? We are happy to provide guidance and coaching throughout the process. We will consider it in our roadmap, but other features might be prioritized

dlpzx avatar Jan 17 '24 14:01 dlpzx

Thanks @dlpzx we were also thinking in the similar lines, your input is definitely of help. we will implement this feature and update you

sandeephs1 avatar Jan 17 '24 14:01 sandeephs1

Updates [offline discussion]

@sandeephs1 and his team have a proposed implementation in which:

  • manually, data.all admins will create Cognito groups for each of the Organizations. These groups need to follow the naming convention: <SOME-NAME>-accesscontrol-<organizationUri>
  • search_handler Lambda fetches the user's groups
  • search_handler Lambda looks for a particular Cognito group based on a naming convention (contains -accesscontrol-<organizationUri> and extracts the organizationUri from it
  • catalog search API query run_query filters the results based on the organizationUri field

Remarks from data.all team

First of all, we want to highlight that the by-design-purpose of data.all was to work in a single-tenancy scenario. The way to achieve multi-tenancy in the most secure way would be by deploying data.all multiple times, one for each tenant.

But we understand and are interested in the multi-tenancy scenario that you present. We cannot implement this feature in the next couple of weeks, but we can work together on designs and are happy to guide you to contribute back. If the feature is part of the open-source repository we will manage and take ownership of bugs, issues and enhancements.

Here are some remarks that we have pointed out during our internal discussions:

  • Since there will be multiple tenants sharing the same central infrastructure (data.all backend and frontend), we need to make sure that security-wise tenants cannot damage other tenants. For example, Could some tenant teams throttle the API Gateway for all other tenants? The answer is no because of the WAF rules implemented, but I wanted to use an example of the types of infrastructure security checks that we need to fulfill.
  • We need to verify all data.all API calls to ensure that all are decorated and are checking that users cannot access resources that do not belong to them. [We can help you with this task :) Findings will be added to this comment.]
  • We also need to identify if there are any API calls that need ADDITIONAL checks. You already implemented the changes for the searchCatalog API call to filter results based on your organzition. We also noticed that the createShareObject API call needs to be restricted to check that the target requested Dataset is part of your organization. Also, the inviteTeam APIs for Organization and Environment will need to restrict which groups can be added to each of them.
  • In your design we see potential improvements. We would like to avoid the string-manipulation involved in obtaining the user organizationUris. For that, we can use or at least take inspiration from the helper methods that are defined for the generic graphql APIs.

Let's keep on working on this!

dlpzx avatar Jan 29 '24 16:01 dlpzx