dataall
dataall copied to clipboard
Limiting Catalog search results based on Organizations
In catalog, we want to display search results limited to the organization user belongs to
Here is the scenario - Below table details, team, user, environment and associated organization
Team | User | Environment | organization |
---|---|---|---|
mgm_usteam | mgm_u1 | mgm-music-us | mgm |
mgm_euteam | mgm_eu2 | mgm-sports-eu | mgm |
alexa_team | x_u1 | alexa-android | alexa |
mgm and alexa are the 2 organization mgm has 2 environments, 8 dataset, 8 tables alexa has 2 envrionemnts, 6 dataset, 6 tables Below table details dataset and tables
Organization | Environment | Dataset | Tables |
---|---|---|---|
mgm | mgm-music-us | mgm_music_eng | mgm_music_eng_alltimebest |
mgm | mgm-music-us | mgm_music_eng | mgm_music_eng_2023 |
mgm | mgm-music-us | mgm_music_esp | mgm_music_restofesp |
mgm | mgm-music-us | mgm_music_esp | mgm_music_esp_best23 |
mgm | mgm-sports-eu | mgm-sports-cricket | mgm-sports-cric_worldcup |
mgm | mgm-sports-eu | mgm-sports-cricket | mgm-sports-cric_ipl |
mgm | mgm-sports-eu | mgm-sports-football | mgm-sports-fb_wc |
mgm | mgm-sports-eu | mgm-sports-football | mgm-sports-fb_laliga |
alexa | alexa-android | alexa-android-jp | alx_anrd_jp_mgm |
alexa | alexa-android | alexa-android-jp | alx_anrd_jp_yt |
alexa | alexa-android | alexa-android-it | ale_droid_it_mgm |
alexa | alexa-android | alexa-android-it | ale_droid_it_yt |
alexa | lexa-wear | wear-os-events | events_music |
alexa | lexa-wear | wear-os-sensor | sensor_sports |
When user 'mgm_u1' searches, returned results should not exceed (8 dataset + 8 tables)+searchcondition. He should not be displayed with the 'alexa' associated dataset and tables
similarly alexa user 'x_u1' should not be displayed with 'mgm' objects.
How this can be achieved
Hi @sandeephs1 this is a cool feature :) Basically you want a filtered version of the catalog based on the user's organizations. What is the motivation behind the feature? Do you need to restrict access to the metadata, or is it more of a usability problem?
Hi @dlpzx We have multiple products, our customers can subscribe to any of them.
customers -> Organization in Data.All product -> Environment in Data.All
Since multiple customers will be onboarded on "Data.All", should be able to meet the data governance. Currently Catalog search presents all the matching dataset/table irrespective of the Organization user belongs to, so it will be a databreach.
We want to avoid this situation by restricting catalog search results based on the "user-Organization" relation. In a way multi-tenancy feature
Hi @sandeephs1 thanks for the quick response! Based on your requirements, these are the high-level changes that we would need to implement:
- Add another field "organizationUri" in the data catalog, for that we need to modify the mappings of the OpenSearch index
- Add the information of the "organizationUri" to each item that is added to the catalog
- Backfill existing items (we need to investigate on the best approach)
- Modify the search API calls to filter by the user's organizations (list of "organizationUri")
- Introduce a configuration parameter to enable or disable "organization_data_catalog_isolation" and make this feature configurable in the code -- probably only the part of limiting the search api calls.
Do you have the bandwidth to implement this feature? We are happy to provide guidance and coaching throughout the process. We will consider it in our roadmap, but other features might be prioritized
Thanks @dlpzx we were also thinking in the similar lines, your input is definitely of help. we will implement this feature and update you
Updates [offline discussion]
@sandeephs1 and his team have a proposed implementation in which:
- manually, data.all admins will create Cognito groups for each of the Organizations. These groups need to follow the naming convention:
<SOME-NAME>-accesscontrol-<organizationUri>
- search_handler Lambda fetches the user's groups
- search_handler Lambda looks for a particular Cognito group based on a naming convention (contains
-accesscontrol-<organizationUri>
and extracts theorganizationUri
from it - catalog search API query
run_query
filters the results based on theorganizationUri
field
Remarks from data.all team
First of all, we want to highlight that the by-design-purpose of data.all was to work in a single-tenancy scenario. The way to achieve multi-tenancy in the most secure way would be by deploying data.all multiple times, one for each tenant.
But we understand and are interested in the multi-tenancy scenario that you present. We cannot implement this feature in the next couple of weeks, but we can work together on designs and are happy to guide you to contribute back. If the feature is part of the open-source repository we will manage and take ownership of bugs, issues and enhancements.
Here are some remarks that we have pointed out during our internal discussions:
- Since there will be multiple tenants sharing the same central infrastructure (data.all backend and frontend), we need to make sure that security-wise tenants cannot damage other tenants. For example, Could some tenant teams throttle the API Gateway for all other tenants? The answer is no because of the WAF rules implemented, but I wanted to use an example of the types of infrastructure security checks that we need to fulfill.
- We need to verify all data.all API calls to ensure that all are decorated and are checking that users cannot access resources that do not belong to them. [We can help you with this task :) Findings will be added to this comment.]
- We also need to identify if there are any API calls that need ADDITIONAL checks. You already implemented the changes for the
searchCatalog
API call to filter results based on your organzition. We also noticed that thecreateShareObject
API call needs to be restricted to check that the target requested Dataset is part of your organization. Also, theinviteTeam
APIs for Organization and Environment will need to restrict which groups can be added to each of them. - In your design we see potential improvements. We would like to avoid the string-manipulation involved in obtaining the user organizationUris. For that, we can use or at least take inspiration from the helper methods that are defined for the generic graphql APIs.
Let's keep on working on this!