dataall
dataall copied to clipboard
Adding AWS HealthOmics as a Module in "Play" tools
Feature or Bugfix
- Feature
Background
Currently, data.all has integrations to AWS services such as SageMaker, Athena, and QuickSight through the “Play” modules/tools Notebooks, Worksheets, and Dashboards, respectively. This is valuable and convenient for end users who want to process and share their data but may not want to or know how to use these services in the AWS Console.
Researchers, scientists, and bioinformaticians often fall into this end user category. One such AWS service that is popular amongst the research community is AWS HealthOmics. HealthOmics helps users process and generate insights from genomics and other biological data stored in the cloud. Raw genomic data can be sent to a HealthOmics managed workflow (aka Ready2Run workflow) that can perform various tasks such as quality control, read alignment, transcript assembly, and gene expression quantification. The output can then be stored in a manageable format for querying/visualizing/sharing.
AWS HealthOmics Integration
This feature contains both modularized backend and frontend changes for adding HealthOmics as a “Play” module/tool to data.all. It specifically adds the capability to view and instantiate HealthOmics Ready2Run workflows as runs that can output and save omic data as data.all Datasets.
Consumption Patterns
- data.all Worksheets: Users can use Worksheets to make data easier to query and combine with other forms of health data.
- data.all Notebooks/Studio: Users can Notebooks and Studio to build, train, and deploy novel machine learning algorithms on the multiomic and multimodal data.
- data.all Dashboards: Users can use the transformed data in Dashboards for advanced analytics and visualizations.
Considerations
- Linked Environment and Dataset Region: The HealthOmics run must be performed in a data.all Linked Environment that is located in an AWS Region that supports AWS HealthOmics. Similarly, the data.all source and destination Dataset must live in same AWS Region as where the user will perform the HealthOmics run.
- Ready2Run Workflow Support: Currently, only Ready2Run workflows are supported. Ready2Run are pre-built workflows designed by industry leading third-party software companies like Sentieon, Inc. and NVIDIA, as well as common open-source pipelines such as AlphaFold for protein structure prediction. Ready2Run workflows do not require the management of software tools or workflow scripts. Bring your own, also known as Private, workflows where you bring a custom workflow script, are not yet supported. Please note that some Ready2Run workflows require a subscription/license from the software provider to run.
User Journey
This example user journey depicts an end-to-end process from viewing available HealthOmics Ready2Run workflows to instantiating a run and viewing its output in a data.all Worksheet.
-
Initiation:
- User navigates to the "Omics" section within data.all and browses Ready2Run workflows
- User can also search for a specific workflow directly
- After clicking on a workflow, users see a detailed view of it with a full description of what it does
- Creation: After clicking on a workflow, users see a detailed view and hit “Create Run”. Users fill in the run creation form with the following parameters:
-
Workflow ID: Immutable ID of the Ready2Run workflow * Run Name: Customizable name of the run user will submit * Environment: data.all Environment AWS Account where the HealthOmics run will be (NOTE: the Environment must be in an AWS Region supported by HealthOmics, ex: N. Virginia or London) * Region: Pre-populated from the Environment and immutable Region where the run will be * Owners: data.all group who owns the run * Select S3 Output Destination: data.all Dataset where the output omics data will reside (NOTE: please create this prior to kicking off a run) * Run Parameters: JSON parameter input in the format expected by the Ready2Run workflow. It will be pre-populated with the correct fields, and users will paste in their data in the appropriate fields. For example, the raw input data in S3 that will be processed in the run. (NOTE: the input data does not have to be in a data.all Dataset, as long as it is accessible. For example, raw genomic data may be hosted publicly on the AWS Registry of Open Data, and the S3 URI can be provided in a field here)
-
History:
- Users navigate to Run tab at the top to view a history of the data.all-initiated Ready2Run workflows they’ve kicked off. (NOTE: run history deletion is still in progress)
-
Data Consumption:
-
In Worksheets:
- Users can select (or create) a new Worksheet.
-
In Worksheets:
- Users can then query the data using SQL
Relates
- Github Issue - #563
Security
Please answer the questions below briefly where applicable, or write N/A
. Based on
OWASP 10.
- Does this PR introduce or modify any input fields or queries - this includes
fetching data from storage outside the application (e.g. a database, an S3 bucket)? Yes
- Is the input sanitized? Yes
- What precautions are you taking before deserializing the data you consume? N/A
- Is injection prevented by parametrizing queries? N/A
- Have you ensured no
eval
or similar functions are used? N/A
- Does this PR introduce any functionality or component that requires authorization? Yes
- How have you ensured it respects the existing AuthN/AuthZ mechanisms? Yes
- Are you logging failed auth attempts? N/A
- Are you using or adding any cryptographic features? No
- Do you use a standard proven implementations?
- Are the used keys controlled by the customer? Where are they stored?
- Are you introducing any new policies/roles/users? Yes
- Have you used the least-privilege principle? How? Yes, through scoped policies added to the role
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
Looking at the PR some of the changes are unclear why they are happening in this PR would be good if the author commented on any file change that is strictly not HealthOmics related why it is changing and what is being done.
I also see some things that could be better modularized like frontend API files..
Im also concerned we're creating a new VPC even for those who don't plan to use this module.. Im not sure if Im wrong or not. But if that's the case then we should make sure this is controlled via settings if the module is enabled or not as VPCs cost money.
@ironspur5 can you also list and describe the functionalities that this feature allows from a user perspective? Could you include some screenshots? That way we can identify missing API calls, or potential bugs in the design
@ironspur5 can you also list and describe the functionalities that this feature allows from a user perspective? Could you include some screenshots? That way we can identify missing API calls, or potential bugs in the design
Updated PR with details and screenshots!
Hi @ironspur5, take a look at my latest commits. I have re-created the frontend views as we discussed. I also started the code clean-up. There are still a couple of things to do before the PR is ready. From a user experience/features point of view:
- the listRuns frontend view is not very user friendly, we need to re-design it a bit. Add filters ...
- In that view we need to think about what happens after a run: what happens if it fails, how do we delete them...
- in the workflow detail we should should show the user runs for that workflow
From a code perspective:
- modify frontend views accordingly
- create deleteRun methods @ironspur5 investigate and maybe some design
- limit IAM permissions to omics (currently omics:* in *) @ironspur5
- add count resources functions @dlpzx
- add integration testing @dlpzx implement framework
- fix migration scripts
- REVIEW ALL
In addition we need to perform:
- end-to-end testing under different scenarios
One step closer 🚀 Here is an screenshot of the Omics view after changes:
Next steps 20 Feb:
- Scope down permissions based on resource @ironspur5 @dlpzx
- fix migration script @dlpzx
- add designs for omics runs to PR @ironspur5
- implementation of omics runs view @ironspur5
- deployed version @dlpzx
- implement more integration tests
- end-ro-end validation and REVIEW ALL
Hi all! I have committed a few changes to address scoped down permissions. I am also sharing my design UI ideas for the view Run page and updated view Runs page. This also illustrates the proposed idea to allow a re-run from an old run.
Next steps 5 Mar:
- Scope down permissions based on resource @ironspur5 @dlpzx - needs testing
- fix migration script @dlpzx - done
- add designs for omics runs to PR @ironspur5 - completed
- implementation of omics runs view @ironspur5 - needs testing
- deployed version @dlpzx - in-progress
- implement more integration tests @ironspur5 and @dlpzx half and half - at least 75%
- end-to-end validation and REVIEW ALL - @dlpzx @ironspur5 comment on the issue with acceptance criteria
- end-user feedback collection
- Defining the tests
- Doing the test
Testing
General testing
- [X] CICD pipeline runs successfully
- [X] Migration scripts are executed -> new tables for omics
- [ ] Run ECS task omics workflow fetcher -> it returns all workflows and registers them in data.all RDS (without duplicates for READY-TO-RUN workflows) --> error
dataallPivotRole-cdk is not authorized to perform: omics:ListWorkflows on resource: arn:aws:omics:us-east-1::workflow/*
- [X] Enable
omics
module and deploy -> Omics appears in left side pane and Omics workflows and Runs are visible by user - [ ] Disable
omics
module -> Omics APIs are not callable, Omics FE views are not rendered
Environments
- [X] In Environments creation tab we can enable/disable
omics
from the environment - [X] Environment creation ---> issue in permissions of omics:
File "/dataall/modules/omics/cdkpivot_role_omics_policy.py", line 36, in get_statements 2024-03-11T14:22:17.201Z f'arn:aws:omics:{self.region}:{self.account}:run/{self.resource_prefix}*', 2024-03-11T14:22:17.201Z AttributeError: 'PivotRoleStatementSet' object has no attribute 'resource_prefix'
---> fixed!
Workflows
Runs
Next steps 26 Mar:
- implementation of omics runs view @ironspur5 - needs testing
- add Omics as an environment feature (toggle in env view) @dlpzx
- search and filters in workflows page @dlpzx
- fix account Id graphql OmicsRun model @dlpzx
- clean-up postgres duplicates in RDS in real deployment @dlpzx
- remove private workflows from this first iteration (leave the infra, just disable it at the moment) @ironspur5
- end-to-end validation and REVIEW ALL - @ironspur5 comment on the issue with acceptance criteria
- end-user feedback collection
- Defining the tests ******
- implement more integration tests @ironspur5 and @dlpzx half and half - at least 75%
I also noticed that we will need to add some migration scripts for the creation of the new permissions in the database! @ironspur5
So here is the updated list of todos:
- implementation of omics runs view @ironspur5 - needs testing
- add Omics as an environment feature (toggle in env view) @dlpzx - DONE! ✅
- search and filters in workflows page @dlpzx - DONE! ✅
- fix account Id graphql OmicsRun model @dlpzx - DONE! ✅
- clean-up postgres duplicates in RDS in real deployment @dlpzx - DONE! ✅ (I need to review it a bit, thanks @ironspur5 for picking it up)
- remove private workflows from this first iteration (leave the infra, just disable it at the moment) @ironspur5 - DONE! ✅
- end-to-end validation and REVIEW ALL - @ironspur5 comment on the issue with acceptance criteria
- end-user feedback collection
- Defining the tests ******
- implement more integration tests @ironspur5 and @dlpzx half and half - at least 75%
- Ensure permissions get created when the module is enabled. This issue is also present for other modules, @petrkalos is looking into it in #944
Before publishing this feature we need to (written in no particular order)
- Review the code end-to-end and PR review (I will add some TODOs)
- Finish writing integration tests that ensure the correct functioning of the newly added code
- Technical validation - we go trough the common and edge cases of using Omics and ensure that all functionalities work
- User validation - Omics users try out the module and provide feedback on enhancements
1. Code review findings
Added as TODOs in the Omics module code
2. Integration Tests
We need to cover new API calls as ECS tasks. We should test all scenarios and not only "happy scenarios". We do not need to test AWS calls.
- [X] ECS Omics Workflows fetcher task: test for new workflows, test with existing workflows, test with multiple environments
- [ ] API calls for Workflows
- [ ] API calls for Runs
More TBD
3. Technical Validation
Generic tests
- [ ] Deploy fresh deployment of data.all with Omics disabled- check that Omics functionalities are disabled: API calls are not reachable, ECS task not deployed and frontend views do not show any Omics reference in side pane, in environments edit/create/detail views.
- [ ] On the same fresh deployment of data.all enable Omics - the reverse: API calls are reachable, ECS task deployed, Omics visible in frontend: in side pane, in Omics views, and in environment create/edit/details view.
ECS Omics Workflow fetcher task
- [ ] Verify that it is triggered daily. On a deployment with 2 data.all Environments: Check ECS logs and verify that existing workflows get updated and that there are no duplicates after some days.
API calls for Environments
- [ ] Create new environment with Omics enabled
- [ ] .....
API calls for Workflows
- [ ] TBD
API calls for Runs
- [ ] TBD
Backwards compatibility
Pre-reqs: existing deployment with main
code and with at least 1 data.all environment
- [ ] Update an existing deployment (from
main
) with the new Omics code but WITH OMICS DISABLED: CICD pipeline updates successfully, migration scripts create the necessary RDS tables and backfill permissions. - [ ] Update the data.all environment - check that the stack is not changed and succeeds
- [ ] Enable Omics and check API calls are reachable, ECS task deployed, Omics visible in frontend: in side pane, in Omics views, and in environment create/edit/details view.
- [ ] Update the environment, check that pivot role and team IAM roles contain Omics IAM policy statements. Check that stack updates successfully.
4. User Validation
TBD with users
April 9th
- update with changes from
main
✅ - issue with search filter refreshing with every type - @dlpzx ✅
- issue with creation of OmicsRun - @ironspur5
- Add omics run to the resources in startRun permission to pivot role:
"arn:aws:omics:us-east-1:11111111:run/*"
in pivot role - Add passRole to Omics to pivot role: - Add tag omics permissions to pivot role - Add trust to omics to environment team role - Add CloudWatchLogs permissions to environment team role - REVIEW deleteRuns and other run methods @ironspur5
- implementation of omics runs view @ironspur5 - needs testing
- implement more integration tests @ironspur5 and @dlpzx half and half - at least 75%
- Workflows APIs - @dlpzx stretch goal ✅
- Runs APIs✅
- end-to-end validation and REVIEW ALL - complete comment above!
- end-user feedback collection
- Defining the tests ******
PassRole permissions:
{
"Condition": {
"StringEquals": {
"iam:PassedToService": [
"omics.amazonaws.com"
]
}
},
"Action": "iam:PassRole",
"Resource": "arn:aws:iam::111111111111:role/dataall*",
"Effect": "Allow",
"Sid": "PassRoleOmics"
},
Tag resources and Run in StartRun:
{
"Action": [
"omics:GetWorkflow",
"omics:StartRun",
"omics:TagResource"
],
"Resource": [
"arn:aws:omics:us-east-1:111111111111:workflow/*",
"arn:aws:omics:us-east-1::workflow/*",
"arn:aws:omics:us-east-1:111111111111:run/*"
],
"Effect": "Allow",
"Sid": "OmicsWorkflowActions"
},
Trust policy:
{
"Effect": "Allow",
"Principal": {
"Service": "omics.amazonaws.com"
},
"Action": "sts:AssumeRole"
},
CloudWatch logs:
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
"Resource": [
"arn:aws:logs:us-east-1:1111111111111:log-group:/aws/omics/*",
"arn:aws:logs:us-east-1:1111111111111:log-group:/aws/omics/*:log-stream:*"
],
"Effect": "Allow"
}
]
}
Hi all, I have recorded a video demo of this feature:
https://github.com/data-dot-all/dataall/assets/28816838/b36777be-b4bd-48d2-b9cc-09a4f0d2bf0a
First a foremost - this is a great contribution from @ironspur5 and team! All of your work and collaboration is super appreciated and I think this is an amazing starting point for health science applications and data consumption patterns in data.all :)
I left some comments throughout the PR - most are small changes or touch ups but let me know if any questions or disagreements. Additionally, some high-level notes I too:
Notes:
- For existing DA Deployment (a heads up on what we need to call out during release @dlpzx)
- Existing Groups need to be given MANAGE_OMICS permission from tenant
- Need to update envs after enabling omics module
- Need to run omics workflow fetcher at least 1 time before can start runs
- Any need for an Omits VPC Endpoint? Similar to how we have VPC interface endpoints for other supported services
Enhancement Ideas (can be picked up after this PR)
- Permissions
- Is there any need for more in depth permission model? Currently we have:
- MANAGE_OMICS_RUNS (Tenant Level)
- CREATE_OMICS_RUNS (on Env Record)
- DELETE_OMICS_RUN (on OmicsRun Record)
- Especially if we move to supporting Private Workflows May Need more verbose permissions for update, get, list, for one if not both of omics workflow and omics runs
- Is there any need for more in depth permission model? Currently we have:
- Private Workflow Types, RunGroups, etc.
- Search + Filters on Omics Run History
- Better UX JSON editing for parameter template
- Provide Output Locations specific to Table Location or Folder Location rather than entire Dataset
Tests I performed:
- [x] CICD Pipeline Completes
- [x] Env Stack Update - adds Omics permissions to pivot role and env group roles (if they have omics env group permission)
- [x] Omics Permissions Show (both in Admin Tenant Permissions and Env Group Permissions + can edit appropriately for each)
- [x] ECS Workflow Fetch Task Successful (IMPORTANT: Omics Module needs to be enabled - left comment on file)
- [x] Workflows Paginated and showing on Omics Workflow Tab
- [x] Workflow Card with Additional Description and Parameter Template Shows Properly
- [x] Create Run from Workflow (Tested ESM Fold Workflow - mimiced Video)
- [x] Run History Shows OmicsRuns for the Users Teams (other users in separate team cannot view Omics Runs)
- [x] Run Status Updates when Job Completes
- [x] Output Data to correct S3 Output (tested both same dataset and a different owned dataset)
- [x] Omics Run Creator Team can Delete Job Run as well
- [x] Test with Shared Dataset as Input
- [x] Crawl + Query Output metrics in Worksheets
A final suggestion left on omics_workflow_fetcher task
And 1 last thought - should we restrict omics runs to follow a pattern using NamingConventionService
similar to how we do with other resources?
- If so think we can also then restrict the pivot role permissions to
arn:aws:omics:{self.region}:{self.account}:run/{self.env_resource_prefix}*
(right now it is wildcard on name)
If we do not implement naming convention like above - we may want to remove resource prefix restriction on env group role
arn:aws:omics:{self.region}:{self.account}:run/{self.resource_prefix}*
(currently implemented as such in backend/dataall/modules/omics/cdk/env_role_omics_policy.py
A final suggestion left on omics_workflow_fetcher task
And 1 last thought - should we restrict omics runs to follow a pattern using
NamingConventionService
similar to how we do with other resources?* If so think we can also then restrict the pivot role permissions to `arn:aws:omics:{self.region}:{self.account}:run/{self.env_resource_prefix}*` (right now it is wildcard on name)
If we do not implement naming convention like above - we may want to remove resource prefix restriction on env group role
arn:aws:omics:{self.region}:{self.account}:run/{self.resource_prefix}*
(currently implemented as such inbackend/dataall/modules/omics/cdk/env_role_omics_policy.py
I am not 100% certain, but I also don't think the self.env_resource_prefix is actually doing anything. Run ARNs don't include any naming/prefix in their format, just the Run ID. I think I put this in there to match what other modules looked like. But the tag key condition to prevent unauthorized run access is really what I was intending for.
Great work on this initiative team - the collaboration and coordination across the teams, all focused on delivering the best possible experience for our customers, has been awesome. Let's continue this collaboration for other initiatives as well.
Thank you for all your support in getting this added, team!!