RepeatMasker icon indicating copy to clipboard operation
RepeatMasker copied to clipboard

N's present in assembly after softmasking

Open NicMAlexandre opened this issue 3 years ago • 2 comments

Hello,

After generating a library of repeats based on repeatmodeller and other software like LTRharvest and LTRRETREIVER, I ran the following using the RepeatMasker software.

The assembly I provided does not contain any N's, so I am confused as to why the softmasked genome now has N's using the following script.

perl RepeatMasker/./RepeatMasker -gff -pa 24 -lib RM_Radish.fa consensus2.frby.2.fasta.modified -xsmall -dir Softmask

I get a total of 141 N's in the output assembly. Do you know what the cause of this could be?

NicMAlexandre avatar Jul 13 '21 16:07 NicMAlexandre

I get a total of 141 N's in the output assembly. Do you know what the cause of this could be?

That is unusual. Do the Ns in the masked output match up with the positions of repeats listed in the .out report file?

jebrosen avatar Jul 14 '21 21:07 jebrosen

Hi Jeb,

I reached out to Robert Hubley via email and sent him the relevant files and he said he would get back to me. I will let you know once I get a response. It's very few N's relative to the total number of masked bases. Here are the steps I used to get to this point:

  1. RepeatModeler-2.0.1/BuildDatabase -name Radish consensus2.frby.2.fasta

#repbase_radish.fasta taken from repbase website. 2. cp repbase_radish.fasta RepeatModeler-2.0.1/Libraries/RepeatMasker.Lib

  1. RepeatModeler-2.0.1/RepeatModeler -database Radish -pa 5

  2. genometools/bin/./gt suffixerator -db consensus2.frby.2.fasta -indexname \t Radish_harvest.fa -tis -suf -lcp -des -ssp -sds -dna

  3. genometools/bin/./gt ltrharvest -index Radish_harvest.fa > genome.fa.harvest.scn

  4. LTR_FINDER_parallel/./LTR_FINDER_parallel -seq consensus2.frby.2.fasta -threads 10 -harvest_out -size 1000000 -time 300

  5. cat genome.fa.harvest.scn consensus2.frby.2.fasta.finder.combine.scn > genome.fa.rawLTR.scn

#Sequence names are too long in my genome so I need to remove some header 8. sed 's/|.*$//' consensus2.frby.2.fasta > consensus2.frby.2.fasta.modified

  1. LTR_retriever/./LTR_retriever -genome consensus2.frby.2.fasta.modified -inharvest genome.fa.rawLTR.scn -threads 10

  2. cat consensus2.frby.2.fasta.modified.LTRlib.redundant.fa RM_24887.MonNov91320352020/consensi.fa.classified > RM_Radish.fa

  3. perl RepeatMasker/./RepeatMasker -gff -pa 24 -lib RM_Radish.fa consensus2.frby.2.fasta.modified -xsmall -dir Softmask

Nicolas

On Wed, Jul 14, 2021 at 3:21 PM Jeb Rosen @.***> wrote:

I get a total of 141 N's in the output assembly. Do you know what the cause of this could be?

That is unusual. Do the Ns in the masked output match up with the positions of repeats listed in the .out report file?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/rmhubley/RepeatMasker/issues/118#issuecomment-880218909, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFB63323KNKUGMBVQVBHWJLTXX5XFANCNFSM5AJPWTLQ .

-- Best,

Nicolas Alexandre PhD Candidate, Integrative Biology Whiteman Lab University of California - Berkeley @.*** @.***>

NicMAlexandre avatar Jul 14 '21 21:07 NicMAlexandre