motif-clustering
motif-clustering copied to clipboard
Clustering motif models to remove redundancy
Non-redundant TF motif matches genome-wide
Below describes the general workflow for clustering motif models to remove redundancy, generating an "archetype" motif, and then finally, performing genome-wide scans of motifs and remvoval of redundancy. Please note that this documentation is incomplete, as should be used a a rough guide only.
If you are looking for the final results (motif clusters, genome-wide scans and a browser shot) please see the following website: https://resources.altius.org/~jvierstra/projects/motif-clustering/
Note: The above link is for version 1.0 of the motif clusters. While we have yet to update the website, version 2.1-beta (latest) can be found at https://resources.altius.org/~jvierstra/projects/motif-clustering-v2.1beta/.
Contact me at jvierstra (at) altius.org
with any questions/requests/comments.
Requirements
- Python 3.5+
- numpy
- scipy
- seaborn
- matplotlib
- genome-tools (http://www.github.com/jvierstra/genome-tools)
- TOMTOM (http://meme-suite.org/doc/download.html)
- BEDOPS (http://bedops.readthedocs.io)
Versions
-
Version 2.0beta-human (https://resources.altius.org/~jvierstra/projects/motif-clustering-v2.0beta/)
- CIS-BP (homo sapien; only motifs w/ direct measurement)
- JASPAR2022
- BANP from Grand 2021
- 5,193 total motif model; 693 distinct clusters
-
Version 1.0 (complete documation: https://resources.altius.org/~jvierstra/projects/motif-clustering/)
- Jolma et al., Cell 2013 (Supplemental Table 2)
- JASPAR 2018
- HOCOMOCO version 11 (757 motif models; both human and mouse)
Quick start
Step 1: Download and preproces motifs
See runall
script in each motif database directory (databases/*
)
Step 1: Compute pair-wise motif similarity
Here we use TOMTOM to determine the similarity between all motif models (all pairwise) with the following code:
meme2meme databases/*/*.meme > tomtom/all.dbs.meme
tomtom \
-bfile /net/seq/data/projects/motifs/hg19.K36.mappable_only.5-order.markov \
-dist kullback \
-motif-pseudo 0.1 \
-text \
-min-overlap 1 \
tomtom/all.dbs.meme tomtom/all.dbs.meme \
> tomtom/tomtom.all.txt
I have a provided a script that will load this operation up on a SLURM parallel compute cluster (see bin/runall.tomtom.v2.0beta-human for an example)
Step 2: Hierarchically cluster motifs
After running TOMTOM, open up the provided Jupyter Notebook to perform the clustering and visualization
We perform hierarchical clustering (distance: correlation, complete linkage) from the TOMTOM similarity E-values. Below is a heatmap representation of motifs clustered by simililarity and clusters identified cutting the dendrogram at height 0.7.
Step 3: Process each cluster to build a motif archetype
Again, inside the notebook there is code that will process and visualize each motif cluster.
AC0002 (homeodomain) | AC0240 (CCAAT-box) |
---|---|
![]() |
![]() |
Step 4: Make HTML output
Run the BASH script bin/runall.make-html
to generate an HTML webpage (index.html) in the results
directory
Step 5: Scan genome using all motif models (individually and archetype)
I use the software package MOODS to find motif matches genome-wide. Its a great tool and that I highly reccomend. See bin/runall.scan_models for an example of how to do this on a SLURM cluster.
Step 5: Create working and browser tracks
To create a bigBed file from a bed9+4, we need to include an AutoSql file (bed_format.as)
table hg38_motifs_collapsed
"Collapsed motifs matches in hg38 (see: http://www.github.com/jvierstra/motif-clustering)"
(
string chrom; "Reference sequence chromosome or scaffold"
uint chromStart; "Start position of feature on chromosome"
uint chromEnd; "End position of feature on chromosome"
string name; "Name of motif"
uint score; "Score"
char[1] strand; "+ or - for strand"
uint thickStart; "Coding region start"
uint thickEnd; "Coding region end"
uint reserved; "itemRgb"
)
Make the tracks for the archetypes
bedToBigBed -as=bed_format.as -type=bed9+4 -tab moods.combined.all.bed chrom.sizes moods.combined.all.bb
awk -v OFS="\t" '{ print $1, $2, $3, $4, $11, $6, $10, $13}' moods.combined.all.bed | bgzip -c > moods.combined.all.bed.gz
tabix -p bed moods.combined.all.bed.gz
Make the tracks for the full motif scans.
fetchChromSizes hg38 > /tmp/chrom.sizes
awk -v OFS="\t" '{ print $1, 0, $2; }' /tmp/chrom.sizes | sort-bed - > /tmp/chrom.sizes.bed
bedops -e 100% moods.combined.all.bed /tmp/chrom.sizes.bed \
| awk -v OFS="\t" '{ print $1, $2, $3, $4, 0, $6, $2, $3, "0,0,0", $5, $7 }' > /tmp/moods
bedToBigBed -as=bed_format.as -type=bed9+2 -tab /tmp/moods chrom.sizes moods.combined.all.bb