uta icon indicating copy to clipboard operation
uta copied to clipboard

Merge Invitae local changes used to build recent UTA

Open bsgiles73 opened this issue 7 months ago • 8 comments

Overview

The build process for UTA has had several technical issues for some time now that needed to be addressed so that recurring builds and data releases can resume. The goal of this work was to get the project in a state where they could resume. Listed below are the requirements for this work.

Requirements

  • Ability to run a UTA/SeqRepo build with minimal intervention.
  • Provide an alternative to retrieve alignments for NCBI RefSeq transcripts.
  • Provide a way to introduce UTA schema modifications.
  • Make build process more transparent.

Changes

Docker

  • Introduce Docker to containerize the UTA build environment.
  • Use docker compose with entry point scripts to provide more visibility into the build workflow. -- 1. seqrepo-pull: Pull the latest data version of seqrepo locally. -- 2. ncbi-download: Download files from NCBI needed by build pipeline. -- 3. uta-extract: Extract and transform data from downloaded files. -- 4. seqrepo-load: Load novel sequences into SeqRepo. -- 5. uta-load: Load genes, associated accessions, transcripts and alignments into UTA.
  • This allows the build to run on any system that has docker installed and the enough disk space (~35 Gb).

Alembic

  • Introduce Alembic to allow schema/model changes easy and transparent.
  • With an initial migration file matching that of the current UTA schema several additional changes were made. -- add model for assocacs table -- add gene_id to gene and transcript tables -- make gene_id the primary key for gene and foreign key for transcript -> gene -- add column to transcript for codon table -- create translation_exception table to hold translation exceptions parsed from RefSeq files at NCBI -- create a materialized view for tx_exon_aln_v that can be used in a future HGVS UTA dataprovider

New NCBI input files

  • Review and determine the minimum set of NCBI files needed for a UTA/SeqRepo build. (etc/ncbi-files.txt)
  • Download files are first step in the build process.
  • Transcript exon structure still determined from RefSeq mRNA_Prot GBFF files.
  • Alignments are parsed from NCBI genome builds annotated with RefSeq transcripts (GFF file format).

One time workflows

  • Included in this PR are code and configurations of several pre-UTA build workflows ran to get the UTA database and SeqRepo ready for the latest build. -- 1) misc/gene-update/docker-compose-gene-update.yml: The entry point script added the initial Alembic migration, added the gene_id columns, performed the data backfill, and applied the rest of the schema changes. -- 2) misc/mito-transcripts/docker-compose-mito-extract.yml: Extract and transform Mitochondrial gene sequences from NC_012920.1 so they could be loaded into UTA. -- 3) misc/refseq-historical-backfill/docker-compose-backfill.yml: Extract and transform RefSeq transcripts and alignments from "refseq/H_sapiens/historical/GRCh38/GCF_000001405.40-RS_2023_03_historical".

Results of latest build

Historical RefSeq Backfill

  1. Extract intermediate files from NCBI RefSeq backfill (~50 minutes)
docker compose -f docker-compose.yml -f misc/refseq-historical-backfill/docker-compose-backfill.yml run uta-extract-historical
  1. SeqRepo load for historical RefSeq backfill (~ 10 minutes)
docker compose run seqrepo-load
  1. UTA load for historical RefSeq backfill (~ 4 hrs)
docker compose run uta-load
...
+-----------------------+------+---------+---------+-----+---------+--------+------------------------------------------------+
|         table         |  t   |    n1   |    n2   | nu1 |    nc   |  nu2   |                      cols                      |
+-----------------------+------+---------+---------+-----+---------+--------+------------------------------------------------+
| associated_accessions | 8.8  |  265048 |  274192 |  0  |  265048 |  9144  |              tx_ac,pro_ac,origin               |
|          exon         | 51.9 | 8311010 | 8658305 |  0  | 8311010 | 347295 |                       *                        |
|        exon_aln       | 36.5 | 5604227 | 5810798 |  0  | 5604227 | 206571 | exon_aln_id,tx_exon_id,alt_exon_id,cigar,added |
|        exon_set       | 6.5  |  894156 |  922894 |  45 |  894111 | 28783  |                       *                        |
|          gene         | 0.5  |  64092  |  64643  |  0  |  64092  |  551   |                    gene_id                     |
|          meta         | 0.0  |    5    |    5    |  1  |    4    |   1    |                       *                        |
|         origin        | 0.0  |    6    |    6    |  0  |    6    |   0    |                       *                        |
|          seq          | 27.8 |  340385 |  351449 |  0  |  340385 | 11064  |                       *                        |
|        seq_anno       | 2.8  |  360101 |  371704 |  0  |  360101 | 11603  |     seq_anno_id,seq_id,origin_id,ac,added      |
|       transcript      | 11.1 |  314264 |  325711 |  0  |  314264 | 11447  |                       ac                       |
+-----------------------+------+---------+---------+-----+---------+--------+------------------------------------------------+

UTA/SeqRepo Build

  1. Run ncbi-download to start standard update (~10 minutes)
docker compose run ncbi-download
  1. Run uta-extract to generate intermediate files from downloaded files
docker compose run uta-extract
  1. Run SeqRepo load
docker compose run seqrepo-load
  1. Run UTA load
UTA_ETL_NEW_UTA_VERSION=uta_20240523 docker compose run uta-load
...
+-----------------------+------+---------+---------+-----+---------+---------+------------------------------------------------+
|         table         |  t   |    n1   |    n2   | nu1 |    nc   |   nu2   |                      cols                      |
+-----------------------+------+---------+---------+-----+---------+---------+------------------------------------------------+
| associated_accessions | 13.9 |  274192 |  405253 |  0  |  274192 |  131061 |              tx_ac,pro_ac,origin               |
|          exon         | 94.9 | 8658305 | 9716651 |  0  | 8658305 | 1058346 |                       *                        |
|        exon_aln       | 75.2 | 5810798 | 6847303 |  0  | 5810798 | 1036505 | exon_aln_id,tx_exon_id,alt_exon_id,cigar,added |
|        exon_set       | 13.6 |  922894 | 1022751 |  30 |  922864 |  99887  |                       *                        |
|          gene         | 3.3  |  64643  |  229123 |  0  |  64643  |  164480 |                    gene_id                     |
|          meta         | 0.0  |    5    |    5    |  1  |    4    |    1    |                       *                        |
|         origin        | 0.0  |    6    |    6    |  0  |    6    |    0    |                       *                        |
|          seq          | 48.3 |  351449 |  354745 |  0  |  351449 |   3296  |                       *                        |
|        seq_anno       | 3.9  |  371704 |  375097 |  0  |  371704 |   3393  |     seq_anno_id,seq_id,origin_id,ac,added      |
|       transcript      | 17.7 |  325711 |  328839 |  0  |  325711 |   3128  |                       ac                       |
+-----------------------+------+---------+---------+-----+---------+---------+------------------------------------------------+

How to test

Running from latest UTA release (uta_20210129b)

You will need to set some local working directories and a variable for the new uta build artifact

  1. Build the UTA image
docker build --target uta -t uta-update .
  1. Set necessary env variables
export UTA_ETL_OLD_UTA_IMAGE_TAG=uta_20210129b
export UTA_ETL_NEW_UTA_VERSION=uta_20240522
export UTA_ETL_NCBI_DIR=./ncbi-data
export UTA_ETL_WORK_DIR=./output/artifacts
export UTA_ETL_LOG_DIR=./output/logs
  1. Run gene id schema and data migration (~10-15 minutes)
compose -f docker-compose.yml -f misc/gene-update/docker-compose-gene-update.yml run uta-gene-update

Running the standard UTA build using output artifact from last step

  1. Pull SeqRepo (~30 mintues)
docker compose run seqrepo-pull
  1. Download files from NCBI (~10 minutes)
docker compose run ncbi-download
  1. Run uta-extract to generate intermediate files from downloaded files
docker compose run uta-extract
  1. Run SeqRepo load
docker compose run seqrepo-load
  1. Run UTA load
UTA_ETL_OLD_UTA_VERSION=uta_20240522 \
UTA_ETL_NEW_UTA_VERSION=uta_20240523 \
docker compose run uta-load

bsgiles73 avatar Jul 10 '24 21:07 bsgiles73