YAMDA
YAMDA copied to clipboard
Keep finding the same motif
I asked for 10 motifs with -n 10, and found the same exact motif 10 times. Am I missing something here?
Can you describe to me your dataset and show me the output motifs and your command line? Motif discovery is a bit more art than science. I suspect you may have a dataset full of repeats.
Absolutely. And this is uncharted territory too. I am looking for cis regulatory elements that are involved in robust DNA methylation in the honey bee. What I have are 1kb windows around CG sites that are methylated across a wide variety of conditions. CG sites themselves are all in exons, so I don't expect an overwhelming number of repeats, but 1 kb is wide enough that I am probably getting into the introns too.
The multiply found motif is indeed repetitive.
Command line I used was python run_em.py -i ../positive.fa -j ../negative.fa -r -n 10 -o apis
I tried running with -e but now it just found that 1 motif and stopped. Maybe masking before motif finding is necessary?
That does sound like a very difficult dataset to analyze. I'm not sure if any motif discovery algorithm can get high quality motifs from it. Have you tried centrimo?
Thank you for your input nevertheless. I think I'd need a good set of candidates for centrimo. I have an instance of DREME running on this but it has been going on for a few days now. Unfortunately it won't write any output before it terminates, at least not on this machine, so I wait.
Masking low complexity/repetitive regions (window masker) unfortunately did not improve the situation. I also tried smaller windows to no avail. I confirmed it finds multiple distinct motifs in the test set.
All things considered maybe task at hand is not tractable, nevertheless YAMDA has been the only tool that I can at least rapidly iterate with, so thank you for your work! It's funny that the initial counts take majority of the time while GPU makes easy work out of the actual EM part.
Hello, I've been using YAMDA recently and have run into the same problem as littleblackfish. To generate my dataset, I clustered the genes of one domesticated and one wild tomato species into co-regulated groups based on their expression patterns (exact details aren't important). Each cluster has about 1000 genes in it (smallest has about 400, largest about 1900), includes genes from both species, and only includes genes that had variable expression patterns. I then extracted 1000 bp of sequence upstream of the transcription state site of each gene in each cluster to generate putative promoters. Before running YAMDA, I removed simple repeats from these promoters using:
python erase_annoying_sequences.py -i cluster1_promoters.fasta -o cluster1_promoters_noreps.fasta
Then, I applied YAMDA to each cluster of masked promoter sequences with:
python run_em.py -f 0.1 -r -e -maxs 20000000 -mins 20 -n 10 -w 8 -a dna -i cluster1_promoters_noreps.fasta -j background_promoters_noreps.fasta -oc yamda_output
I've applied MEME, HOMER, DREME, and DiNAMO to this same dataset with similar parameters as above and achieved decent results. However, 9/10 of the motifs discovered by YAMDA in each cluster are nearly identical to one another and the first motif is usually just a simple string of 8 G's. I've tried lowering the fudge parameter (-f) but that prevented YAMDA from discovering any motifs whatsoever.
Any advice you're able to give would be greatly appreciated. Thank you for creating an awesome open source motif discovery tool!
EDIT
YAMDA appears to give me different results when I lower the -maxs parameter. I reran my code as:
python run_em.py -f 0.1 -r -e -maxs 5000 -mins 20 -n 10 -w 8 -a dna -i cluster1_promoters_noreps.fasta -j background_promoters_noreps.fasta -oc yamda_output
And the output motifs were CCCCCCCC and GGGGGGGG, each repeated 5 times. It didn't matter which cluster of genes I input into YAMDA either. I always got this same result.
@milesroberts-123 It looks like you're using your own negative sequences. Can you try using a dinucleotide shuffled version instead? A lot of repetitive motifs suggests you have a lot of repeats in your dataset. I guess it's expected since these are promoter sequences, so the C/G content is high.
I also suggest removing the -e option. These are promoter sequences, not ChIP-seq sequences. Whereas ChIP-seq sequences, it's relative reasonable that each FASTA sequence has only one motif occurrence, I don't think you can say the same about promoter sequences, which can have many motifs. The -e option will delete the whole promoter sequence each time it discovers a motif in that sequence, which I'm guessing you don't want.
@daquang, thanks for your quick response! I'll definitely give that a try. Yes, I realized my silliness with including the -e option not long after making this post, so I'll remove it and let you know how it goes!
@daquang I created a dinucleotide shuffled version of my sequences like so:
fasta-shuffle-letters -kmer 2 cluster1_promoters_noreps.fasta cluster1_promoters_shuffled_noreps.fasta
Then I ran YAMDA using the shuffled sequences as background:
python run_em.py -f 0.1 -r -mins 20 -n 10 -w 8 -k 4 -a dna -i cluster1_promoters_noreps.fasta -j cluster1_promoters_shuffled_noreps.fasta -oc yamda_output
However, I'm still getting GGGGGGGG and CCCCCCCC for all of my motifs. Do you have any other suggestions?
@milesroberts-123 Hmm, why don't you try going into the erase_annoying_sequences script and modify the following line:
annoying_subsequences = [
'AAAAAA',
'TTTTTT',
'CCCGCCC',
'GGGCGGG'
]
and add 'CCCCC' and 'GGGGG' to the list of annoying sequences?
@daquang Great idea! I'll give it a try.
@daquang Removing the CCCCC and GGGGG subsequences improved the results immensely! There was also a TATATA repeat (way too long to be a TATA-box) in many of my sequences, which I removed, and that helped even more! There's now a CTCTCT repeat that comes up a lot in my motifs, but I'm gonna keep it because I have reason to believe it is relevant to what I'm studying.
Thanks again for all of your help!
To anyone else that finds this thread, I found it helpful to mask all dinucleotide repeats 6bp or longer and all single nucleotide repeats 5 bp or longer when using YAMDA or other motif algorithms on promoter sequences.
annoying_subsequences = [
'AAAAA',
'TTTTT',
'CCCGCCC',
'GGGCGGG',
'CCCCC',
'GGGGG',
'CTCTCT',
'ACACAC',
'GTGTGT',
'GCGCGC',
'GAGAGA',
'TATATA'
]