dataall Adding AWS HealthOmics as a Module in "Play" tools

Feature or Bugfix

Feature

Background

Currently, data.all has integrations to AWS services such as SageMaker, Athena, and QuickSight through the “Play” modules/tools Notebooks, Worksheets, and Dashboards, respectively. This is valuable and convenient for end users who want to process and share their data but may not want to or know how to use these services in the AWS Console.

Researchers, scientists, and bioinformaticians often fall into this end user category. One such AWS service that is popular amongst the research community is AWS HealthOmics. HealthOmics helps users process and generate insights from genomics and other biological data stored in the cloud. Raw genomic data can be sent to a HealthOmics managed workflow (aka Ready2Run workflow) that can perform various tasks such as quality control, read alignment, transcript assembly, and gene expression quantification. The output can then be stored in a manageable format for querying/visualizing/sharing.

AWS HealthOmics Integration

This feature contains both modularized backend and frontend changes for adding HealthOmics as a “Play” module/tool to data.all. It specifically adds the capability to view and instantiate HealthOmics Ready2Run workflows as runs that can output and save omic data as data.all Datasets.

Consumption Patterns

data.all Worksheets: Users can use Worksheets to make data easier to query and combine with other forms of health data.
data.all Notebooks/Studio: Users can Notebooks and Studio to build, train, and deploy novel machine learning algorithms on the multiomic and multimodal data.
data.all Dashboards: Users can use the transformed data in Dashboards for advanced analytics and visualizations.

Considerations

Linked Environment and Dataset Region: The HealthOmics run must be performed in a data.all Linked Environment that is located in an AWS Region that supports AWS HealthOmics. Similarly, the data.all source and destination Dataset must live in same AWS Region as where the user will perform the HealthOmics run.
Ready2Run Workflow Support: Currently, only Ready2Run workflows are supported. Ready2Run are pre-built workflows designed by industry leading third-party software companies like Sentieon, Inc. and NVIDIA, as well as common open-source pipelines such as AlphaFold for protein structure prediction. Ready2Run workflows do not require the management of software tools or workflow scripts. Bring your own, also known as Private, workflows where you bring a custom workflow script, are not yet supported. Please note that some Ready2Run workflows require a subscription/license from the software provider to run.

User Journey

This example user journey depicts an end-to-end process from viewing available HealthOmics Ready2Run workflows to instantiating a run and viewing its output in a data.all Worksheet.

Initiation:
- User navigates to the "Omics" section within data.all and browses Ready2Run workflows

User can also search for a specific workflow directly

After clicking on a workflow, users see a detailed view of it with a full description of what it does

Creation: After clicking on a workflow, users see a detailed view and hit “Create Run”. Users fill in the run creation form with the following parameters:

Workflow ID: Immutable ID of the Ready2Run workflow * Run Name: Customizable name of the run user will submit * Environment: data.all Environment AWS Account where the HealthOmics run will be (NOTE: the Environment must be in an AWS Region supported by HealthOmics, ex: N. Virginia or London) * Region: Pre-populated from the Environment and immutable Region where the run will be * Owners: data.all group who owns the run * Select S3 Output Destination: data.all Dataset where the output omics data will reside (NOTE: please create this prior to kicking off a run) * Run Parameters: JSON parameter input in the format expected by the Ready2Run workflow. It will be pre-populated with the correct fields, and users will paste in their data in the appropriate fields. For example, the raw input data in S3 that will be processed in the run. (NOTE: the input data does not have to be in a data.all Dataset, as long as it is accessible. For example, raw genomic data may be hosted publicly on the AWS Registry of Open Data, and the S3 URI can be provided in a field here)
History:
- Users navigate to Run tab at the top to view a history of the data.all-initiated Ready2Run workflows they’ve kicked off. (NOTE: run history deletion is still in progress)

Screenshot 2024-01-17 at 11 40 01 PM

Data Consumption:
- In Worksheets:
  - Users can select (or create) a new Worksheet.

Screenshot 2024-01-17 at 11 41 20 PM

Users can then query the data using SQL

Relates

Github Issue - #563

Security

Please answer the questions below briefly where applicable, or write N/A. Based on OWASP 10.

Does this PR introduce or modify any input fields or queries - this includes fetching data from storage outside the application (e.g. a database, an S3 bucket)? Yes
- Is the input sanitized? Yes
- What precautions are you taking before deserializing the data you consume? N/A
- Is injection prevented by parametrizing queries? N/A
- Have you ensured no eval or similar functions are used? N/A
Does this PR introduce any functionality or component that requires authorization? Yes
- How have you ensured it respects the existing AuthN/AuthZ mechanisms? Yes
- Are you logging failed auth attempts? N/A
Are you using or adding any cryptographic features? No
- Do you use a standard proven implementations?
- Are the used keys controlled by the customer? Where are they stored?
Are you introducing any new policies/roles/users? Yes
- Have you used the least-privilege principle? How? Yes, through scoped policies added to the role

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

Jan 09 '24 17:01 ironspur5

Looking at the PR some of the changes are unclear why they are happening in this PR would be good if the author commented on any file change that is strictly not HealthOmics related why it is changing and what is being done.

I also see some things that could be better modularized like frontend API files..

Im also concerned we're creating a new VPC even for those who don't plan to use this module.. Im not sure if Im wrong or not. But if that's the case then we should make sure this is controlled via settings if the module is enabled or not as VPCs cost money.

Jan 09 '24 17:01 zsaltys

@ironspur5 can you also list and describe the functionalities that this feature allows from a user perspective? Could you include some screenshots? That way we can identify missing API calls, or potential bugs in the design

Jan 11 '24 08:01 dlpzx

@ironspur5 can you also list and describe the functionalities that this feature allows from a user perspective? Could you include some screenshots? That way we can identify missing API calls, or potential bugs in the design

Updated PR with details and screenshots!

Jan 23 '24 02:01 ironspur5

Hi @ironspur5, take a look at my latest commits. I have re-created the frontend views as we discussed. I also started the code clean-up. There are still a couple of things to do before the PR is ready. From a user experience/features point of view:

the listRuns frontend view is not very user friendly, we need to re-design it a bit. Add filters ...
In that view we need to think about what happens after a run: what happens if it fails, how do we delete them...
in the workflow detail we should should show the user runs for that workflow

From a code perspective:

modify frontend views accordingly
create deleteRun methods @ironspur5 investigate and maybe some design
limit IAM permissions to omics (currently omics:* in *) @ironspur5
add count resources functions @dlpzx
add integration testing @dlpzx implement framework
fix migration scripts
REVIEW ALL

In addition we need to perform:

end-to-end testing under different scenarios

One step closer 🚀 Here is an screenshot of the Omics view after changes:

Feb 02 '24 09:02 dlpzx

Next steps 20 Feb:

Scope down permissions based on resource @ironspur5 @dlpzx
fix migration script @dlpzx
add designs for omics runs to PR @ironspur5
implementation of omics runs view @ironspur5
deployed version @dlpzx
implement more integration tests
end-ro-end validation and REVIEW ALL

Feb 20 '24 14:02 dlpzx

Hi all! I have committed a few changes to address scoped down permissions. I am also sharing my design UI ideas for the view Run page and updated view Runs page. This also illustrates the proposed idea to allow a re-run from an old run.

dataall-omics-runs-ui

Mar 01 '24 21:03 ironspur5

Next steps 5 Mar:

Scope down permissions based on resource @ironspur5 @dlpzx - needs testing
fix migration script @dlpzx - done
add designs for omics runs to PR @ironspur5 - completed
implementation of omics runs view @ironspur5 - needs testing
deployed version @dlpzx - in-progress
implement more integration tests @ironspur5 and @dlpzx half and half - at least 75%
end-to-end validation and REVIEW ALL - @dlpzx @ironspur5 comment on the issue with acceptance criteria
- end-user feedback collection
- Defining the tests
- Doing the test

Mar 05 '24 14:03 dlpzx

Testing

General testing

[X] CICD pipeline runs successfully
[X] Migration scripts are executed -> new tables for omics
[ ] Run ECS task omics workflow fetcher -> it returns all workflows and registers them in data.all RDS (without duplicates for READY-TO-RUN workflows) --> error dataallPivotRole-cdk is not authorized to perform: omics:ListWorkflows on resource: arn:aws:omics:us-east-1::workflow/*
[X] Enable omics module and deploy -> Omics appears in left side pane and Omics workflows and Runs are visible by user
[ ] Disable omics module -> Omics APIs are not callable, Omics FE views are not rendered

Environments

[X] In Environments creation tab we can enable/disable omics from the environment
[X] Environment creation ---> issue in permissions of omics: File "/dataall/modules/omics/cdkpivot_role_omics_policy.py", line 36, in get_statements 2024-03-11T14:22:17.201Z f'arn:aws:omics:{self.region}:{self.account}:run/{self.resource_prefix}*', 2024-03-11T14:22:17.201Z AttributeError: 'PivotRoleStatementSet' object has no attribute 'resource_prefix' ---> fixed!

Workflows

Runs

Mar 11 '24 12:03 dlpzx

Next steps 26 Mar:

implementation of omics runs view @ironspur5 - needs testing
add Omics as an environment feature (toggle in env view) @dlpzx
search and filters in workflows page @dlpzx
fix account Id graphql OmicsRun model @dlpzx
clean-up postgres duplicates in RDS in real deployment @dlpzx
remove private workflows from this first iteration (leave the infra, just disable it at the moment) @ironspur5
end-to-end validation and REVIEW ALL - @ironspur5 comment on the issue with acceptance criteria
- end-user feedback collection
- Defining the tests ******
implement more integration tests @ironspur5 and @dlpzx half and half - at least 75%

Mar 26 '24 13:03 dlpzx

I also noticed that we will need to add some migration scripts for the creation of the new permissions in the database! @ironspur5

So here is the updated list of todos:

implementation of omics runs view @ironspur5 - needs testing
add Omics as an environment feature (toggle in env view) @dlpzx - DONE! ✅
search and filters in workflows page @dlpzx - DONE! ✅
fix account Id graphql OmicsRun model @dlpzx - DONE! ✅
clean-up postgres duplicates in RDS in real deployment @dlpzx - DONE! ✅ (I need to review it a bit, thanks @ironspur5 for picking it up)
remove private workflows from this first iteration (leave the infra, just disable it at the moment) @ironspur5 - DONE! ✅
end-to-end validation and REVIEW ALL - @ironspur5 comment on the issue with acceptance criteria
- end-user feedback collection
- Defining the tests ******
implement more integration tests @ironspur5 and @dlpzx half and half - at least 75%
Ensure permissions get created when the module is enabled. This issue is also present for other modules, @petrkalos is looking into it in #944

Apr 03 '24 07:04 dlpzx

Before publishing this feature we need to (written in no particular order)

Review the code end-to-end and PR review (I will add some TODOs)
Finish writing integration tests that ensure the correct functioning of the newly added code
Technical validation - we go trough the common and edge cases of using Omics and ensure that all functionalities work
User validation - Omics users try out the module and provide feedback on enhancements

1. Code review findings

Added as TODOs in the Omics module code

2. Integration Tests

We need to cover new API calls as ECS tasks. We should test all scenarios and not only "happy scenarios". We do not need to test AWS calls.

[X] ECS Omics Workflows fetcher task: test for new workflows, test with existing workflows, test with multiple environments
[ ] API calls for Workflows
[ ] API calls for Runs

More TBD

3. Technical Validation

Generic tests

[ ] Deploy fresh deployment of data.all with Omics disabled- check that Omics functionalities are disabled: API calls are not reachable, ECS task not deployed and frontend views do not show any Omics reference in side pane, in environments edit/create/detail views.
[ ] On the same fresh deployment of data.all enable Omics - the reverse: API calls are reachable, ECS task deployed, Omics visible in frontend: in side pane, in Omics views, and in environment create/edit/details view.

ECS Omics Workflow fetcher task

[ ] Verify that it is triggered daily. On a deployment with 2 data.all Environments: Check ECS logs and verify that existing workflows get updated and that there are no duplicates after some days.

API calls for Environments

[ ] Create new environment with Omics enabled
[ ] .....

API calls for Workflows

[ ] TBD

API calls for Runs

[ ] TBD

Backwards compatibility

Pre-reqs: existing deployment with main code and with at least 1 data.all environment

[ ] Update an existing deployment (from main) with the new Omics code but WITH OMICS DISABLED: CICD pipeline updates successfully, migration scripts create the necessary RDS tables and backfill permissions.
[ ] Update the data.all environment - check that the stack is not changed and succeeds
[ ] Enable Omics and check API calls are reachable, ECS task deployed, Omics visible in frontend: in side pane, in Omics views, and in environment create/edit/details view.
[ ] Update the environment, check that pivot role and team IAM roles contain Omics IAM policy statements. Check that stack updates successfully.

4. User Validation

TBD with users

Apr 04 '24 06:04 dlpzx

April 9th

update with changes from main ✅
issue with search filter refreshing with every type - @dlpzx ✅
issue with creation of OmicsRun - @ironspur5 - Add omics run to the resources in startRun permission to pivot role: "arn:aws:omics:us-east-1:11111111:run/*" in pivot role - Add passRole to Omics to pivot role: - Add tag omics permissions to pivot role - Add trust to omics to environment team role - Add CloudWatchLogs permissions to environment team role
REVIEW deleteRuns and other run methods @ironspur5
implementation of omics runs view @ironspur5 - needs testing
implement more integration tests @ironspur5 and @dlpzx half and half - at least 75%
- Workflows APIs - @dlpzx stretch goal ✅
- Runs APIs✅
end-to-end validation and REVIEW ALL - complete comment above!
- end-user feedback collection
- Defining the tests ******

PassRole permissions:

        {
            "Condition": {
                "StringEquals": {
                    "iam:PassedToService": [
                        "omics.amazonaws.com"
                    ]
                }
            },
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::111111111111:role/dataall*",
            "Effect": "Allow",
            "Sid": "PassRoleOmics"
        },

Tag resources and Run in StartRun:

        {
            "Action": [
                "omics:GetWorkflow",
                "omics:StartRun",
                "omics:TagResource"
            ],
            "Resource": [
                "arn:aws:omics:us-east-1:111111111111:workflow/*",
                "arn:aws:omics:us-east-1::workflow/*",
                "arn:aws:omics:us-east-1:111111111111:run/*"
            ],
            "Effect": "Allow",
            "Sid": "OmicsWorkflowActions"
        },

Trust policy:

        {
            "Effect": "Allow",
            "Principal": {
                "Service": "omics.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        },

CloudWatch logs:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "logs:CreateLogGroup",
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": [
                "arn:aws:logs:us-east-1:1111111111111:log-group:/aws/omics/*",
                "arn:aws:logs:us-east-1:1111111111111:log-group:/aws/omics/*:log-stream:*"
            ],
            "Effect": "Allow"
        }
    ]
}

Apr 09 '24 14:04 dlpzx

Hi all, I have recorded a video demo of this feature:

https://github.com/data-dot-all/dataall/assets/28816838/b36777be-b4bd-48d2-b9cc-09a4f0d2bf0a

May 23 '24 21:05 ironspur5

First a foremost - this is a great contribution from @ironspur5 and team! All of your work and collaboration is super appreciated and I think this is an amazing starting point for health science applications and data consumption patterns in data.all :)

I left some comments throughout the PR - most are small changes or touch ups but let me know if any questions or disagreements. Additionally, some high-level notes I too:

Notes:

For existing DA Deployment (a heads up on what we need to call out during release @dlpzx)
- Existing Groups need to be given MANAGE_OMICS permission from tenant
- Need to update envs after enabling omics module
- Need to run omics workflow fetcher at least 1 time before can start runs
Any need for an Omits VPC Endpoint? Similar to how we have VPC interface endpoints for other supported services

Enhancement Ideas (can be picked up after this PR)

Permissions
- Is there any need for more in depth permission model? Currently we have:
  - MANAGE_OMICS_RUNS (Tenant Level)
  - CREATE_OMICS_RUNS (on Env Record)
  - DELETE_OMICS_RUN (on OmicsRun Record)
- Especially if we move to supporting Private Workflows May Need more verbose permissions for update, get, list, for one if not both of omics workflow and omics runs
Private Workflow Types, RunGroups, etc.
Search + Filters on Omics Run History
Better UX JSON editing for parameter template
Provide Output Locations specific to Table Location or Folder Location rather than entire Dataset

Tests I performed:

[x] CICD Pipeline Completes
[x] Env Stack Update - adds Omics permissions to pivot role and env group roles (if they have omics env group permission)
[x] Omics Permissions Show (both in Admin Tenant Permissions and Env Group Permissions + can edit appropriately for each)
[x] ECS Workflow Fetch Task Successful (IMPORTANT: Omics Module needs to be enabled - left comment on file)
[x] Workflows Paginated and showing on Omics Workflow Tab
[x] Workflow Card with Additional Description and Parameter Template Shows Properly
[x] Create Run from Workflow (Tested ESM Fold Workflow - mimiced Video)
[x] Run History Shows OmicsRuns for the Users Teams (other users in separate team cannot view Omics Runs)
[x] Run Status Updates when Job Completes
[x] Output Data to correct S3 Output (tested both same dataset and a different owned dataset)
[x] Omics Run Creator Team can Delete Job Run as well
[x] Test with Shared Dataset as Input
[x] Crawl + Query Output metrics in Worksheets

May 31 '24 19:05 noah-paige

A final suggestion left on omics_workflow_fetcher task

And 1 last thought - should we restrict omics runs to follow a pattern using NamingConventionService similar to how we do with other resources?

If so think we can also then restrict the pivot role permissions to arn:aws:omics:{self.region}:{self.account}:run/{self.env_resource_prefix}* (right now it is wildcard on name)

If we do not implement naming convention like above - we may want to remove resource prefix restriction on env group role arn:aws:omics:{self.region}:{self.account}:run/{self.resource_prefix}* (currently implemented as such in backend/dataall/modules/omics/cdk/env_role_omics_policy.py

Jun 07 '24 20:06 noah-paige

A final suggestion left on omics_workflow_fetcher task

And 1 last thought - should we restrict omics runs to follow a pattern using NamingConventionService similar to how we do with other resources?
* If so think we can also then restrict the pivot role permissions to `arn:aws:omics:{self.region}:{self.account}:run/{self.env_resource_prefix}*` (right now it is wildcard on name)
If we do not implement naming convention like above - we may want to remove resource prefix restriction on env group role arn:aws:omics:{self.region}:{self.account}:run/{self.resource_prefix}* (currently implemented as such in backend/dataall/modules/omics/cdk/env_role_omics_policy.py

I am not 100% certain, but I also don't think the self.env_resource_prefix is actually doing anything. Run ARNs don't include any naming/prefix in their format, just the Run ID. I think I put this in there to match what other modules looked like. But the tag key condition to prevent unauthorized run access is really what I was intending for.

Jun 10 '24 15:06 ironspur5

Great work on this initiative team - the collaboration and coordination across the teams, all focused on delivering the best possible experience for our customers, has been awesome. Let's continue this collaboration for other initiatives as well.

Jun 14 '24 13:06 anmolsgandhi

Thank you for all your support in getting this added, team!!

Jun 14 '24 14:06 ironspur5

dataall dataall copied to clipboard

Adding AWS HealthOmics as a Module in "Play" tools

Feature or Bugfix

Background

AWS HealthOmics Integration

Consumption Patterns

Considerations

User Journey

Relates

Security

Testing

General testing

Environments

Workflows

Runs

1. Code review findings

2. Integration Tests

3. Technical Validation

Backwards compatibility

4. User Validation

dataall
dataall copied to clipboard