YAMDA icon indicating copy to clipboard operation
YAMDA copied to clipboard

Keep finding the same motif

Open littleblackfish opened this issue 4 years ago • 12 comments

I asked for 10 motifs with -n 10, and found the same exact motif 10 times. Am I missing something here?

littleblackfish avatar Jul 16 '20 17:07 littleblackfish

Can you describe to me your dataset and show me the output motifs and your command line? Motif discovery is a bit more art than science. I suspect you may have a dataset full of repeats.

daquang avatar Jul 16 '20 20:07 daquang

Absolutely. And this is uncharted territory too. I am looking for cis regulatory elements that are involved in robust DNA methylation in the honey bee. What I have are 1kb windows around CG sites that are methylated across a wide variety of conditions. CG sites themselves are all in exons, so I don't expect an overwhelming number of repeats, but 1 kb is wide enough that I am probably getting into the introns too.

The multiply found motif is indeed repetitive. logoM0

Command line I used was python run_em.py -i ../positive.fa -j ../negative.fa -r -n 10 -o apis

I tried running with -e but now it just found that 1 motif and stopped. Maybe masking before motif finding is necessary?

littleblackfish avatar Jul 16 '20 23:07 littleblackfish

That does sound like a very difficult dataset to analyze. I'm not sure if any motif discovery algorithm can get high quality motifs from it. Have you tried centrimo?

daquang avatar Jul 17 '20 06:07 daquang

Thank you for your input nevertheless. I think I'd need a good set of candidates for centrimo. I have an instance of DREME running on this but it has been going on for a few days now. Unfortunately it won't write any output before it terminates, at least not on this machine, so I wait.

Masking low complexity/repetitive regions (window masker) unfortunately did not improve the situation. I also tried smaller windows to no avail. I confirmed it finds multiple distinct motifs in the test set.

All things considered maybe task at hand is not tractable, nevertheless YAMDA has been the only tool that I can at least rapidly iterate with, so thank you for your work! It's funny that the initial counts take majority of the time while GPU makes easy work out of the actual EM part.

littleblackfish avatar Jul 17 '20 07:07 littleblackfish

Hello, I've been using YAMDA recently and have run into the same problem as littleblackfish. To generate my dataset, I clustered the genes of one domesticated and one wild tomato species into co-regulated groups based on their expression patterns (exact details aren't important). Each cluster has about 1000 genes in it (smallest has about 400, largest about 1900), includes genes from both species, and only includes genes that had variable expression patterns. I then extracted 1000 bp of sequence upstream of the transcription state site of each gene in each cluster to generate putative promoters. Before running YAMDA, I removed simple repeats from these promoters using:

python erase_annoying_sequences.py -i cluster1_promoters.fasta -o cluster1_promoters_noreps.fasta

Then, I applied YAMDA to each cluster of masked promoter sequences with:

python run_em.py -f 0.1 -r -e -maxs 20000000 -mins 20 -n 10 -w 8 -a dna -i cluster1_promoters_noreps.fasta -j background_promoters_noreps.fasta -oc yamda_output

I've applied MEME, HOMER, DREME, and DiNAMO to this same dataset with similar parameters as above and achieved decent results. However, 9/10 of the motifs discovered by YAMDA in each cluster are nearly identical to one another and the first motif is usually just a simple string of 8 G's. I've tried lowering the fudge parameter (-f) but that prevented YAMDA from discovering any motifs whatsoever.

Any advice you're able to give would be greatly appreciated. Thank you for creating an awesome open source motif discovery tool!

EDIT

YAMDA appears to give me different results when I lower the -maxs parameter. I reran my code as:

python run_em.py -f 0.1 -r -e -maxs 5000 -mins 20 -n 10 -w 8 -a dna -i cluster1_promoters_noreps.fasta -j background_promoters_noreps.fasta -oc yamda_output

And the output motifs were CCCCCCCC and GGGGGGGG, each repeated 5 times. It didn't matter which cluster of genes I input into YAMDA either. I always got this same result.

milesroberts-123 avatar Aug 09 '20 12:08 milesroberts-123

@milesroberts-123 It looks like you're using your own negative sequences. Can you try using a dinucleotide shuffled version instead? A lot of repetitive motifs suggests you have a lot of repeats in your dataset. I guess it's expected since these are promoter sequences, so the C/G content is high.

I also suggest removing the -e option. These are promoter sequences, not ChIP-seq sequences. Whereas ChIP-seq sequences, it's relative reasonable that each FASTA sequence has only one motif occurrence, I don't think you can say the same about promoter sequences, which can have many motifs. The -e option will delete the whole promoter sequence each time it discovers a motif in that sequence, which I'm guessing you don't want.

daquang avatar Aug 09 '20 18:08 daquang

@daquang, thanks for your quick response! I'll definitely give that a try. Yes, I realized my silliness with including the -e option not long after making this post, so I'll remove it and let you know how it goes!

milesroberts-123 avatar Aug 09 '20 18:08 milesroberts-123

@daquang I created a dinucleotide shuffled version of my sequences like so:

fasta-shuffle-letters -kmer 2 cluster1_promoters_noreps.fasta cluster1_promoters_shuffled_noreps.fasta

Then I ran YAMDA using the shuffled sequences as background:

python run_em.py -f 0.1 -r -mins 20 -n 10 -w 8 -k 4 -a dna -i cluster1_promoters_noreps.fasta -j cluster1_promoters_shuffled_noreps.fasta -oc yamda_output

However, I'm still getting GGGGGGGG and CCCCCCCC for all of my motifs. Do you have any other suggestions?

milesroberts-123 avatar Aug 10 '20 11:08 milesroberts-123

@milesroberts-123 Hmm, why don't you try going into the erase_annoying_sequences script and modify the following line:

annoying_subsequences = [
    'AAAAAA',
    'TTTTTT',
    'CCCGCCC',
    'GGGCGGG'
]

and add 'CCCCC' and 'GGGGG' to the list of annoying sequences?

daquang avatar Aug 10 '20 18:08 daquang

@daquang Great idea! I'll give it a try.

milesroberts-123 avatar Aug 10 '20 18:08 milesroberts-123

@daquang Removing the CCCCC and GGGGG subsequences improved the results immensely! There was also a TATATA repeat (way too long to be a TATA-box) in many of my sequences, which I removed, and that helped even more! There's now a CTCTCT repeat that comes up a lot in my motifs, but I'm gonna keep it because I have reason to believe it is relevant to what I'm studying.

Thanks again for all of your help!

milesroberts-123 avatar Aug 10 '20 22:08 milesroberts-123

To anyone else that finds this thread, I found it helpful to mask all dinucleotide repeats 6bp or longer and all single nucleotide repeats 5 bp or longer when using YAMDA or other motif algorithms on promoter sequences.

annoying_subsequences = [
        'AAAAA',
        'TTTTT',
        'CCCGCCC',
        'GGGCGGG',
        'CCCCC',
        'GGGGG',
        'CTCTCT',
        'ACACAC',
        'GTGTGT',
        'GCGCGC',
        'GAGAGA',
        'TATATA'
]

milesroberts-123 avatar Aug 11 '20 11:08 milesroberts-123