RepeatMasker icon indicating copy to clipboard operation
RepeatMasker copied to clipboard

Repeat Masker Output Files

Open sumin5784 opened this issue 3 years ago • 4 comments

Hello, I'm trying to generate gene annotation file using RepeatMasker. Specifically, I need transposable elements in lncRNA sequence. Currently, I'm using Dfam library and RMBlast search engine.

I put lncRNA fasta file to RepeatMasker:

GenomeFasta="path/to/input/fasta/file"
RepeatMasker -species human -nolow -gff -u ${GenomeFasta}

And I got output files: fa.cat, fa.masked, fa.ori.out, fa.out, fa.out.gff, fa.tbl

I have two questions:

  1. how can I view or open up these files? I opened gff file using R studio but I think it's a bit different from usual gff file?
  2. I need gene annotation file like gff or gtf format. How can I convert RepeatMasker output to gene annotation file? I'm trying to use bedtools, but not sure how can I feed these output files into bedtools.

Any feedbacks would be appreciated. Thank you for the help in advance!

sumin5784 avatar Jun 25 '21 00:06 sumin5784

how can I view or open up these files? I opened gff file using R studio but I think it's a bit different from usual gff file?

Each of the output files are plain text, and should be able to be opened in most text editors. Can you explain more specifically how you opened the file in R studio (e.g. which menu options or R code), and why you think it is different from a usual gff file?

I need gene annotation file like gff or gtf format. How can I convert RepeatMasker output to gene annotation file? I'm trying to use bedtools, but not sure how can I feed these output files into bedtools.

fa.out.gff is already in GFF(2) format, but we do also provide a script util/rmOutToGFF3.pl which can be used to convert RepeatMasker .out files to GFF(3) instead.

jebrosen avatar Jun 25 '21 01:06 jebrosen

Thank you for the feedback. I just figured it out, I'm using biomartr, and read_rm function.

Still, I need gene coordinates, like in gtf format, which in chromosome/start/end format. In this sense, I was trying to use bedtools to get gene coordinates in bed format. But I'm a bit confused, like .fa.masked is not fasta file format, so how can I convert it? Does rmOutToGFF3.pl can generate gene annotation file with chr/start/end?

Thank you

Or can I simply change the extention from fa.masked to .fa and feed in to bedtools?

sumin5784 avatar Jun 25 '21 01:06 sumin5784

Still, I need gene coordinates, like in gtf format, which in chromosome/start/end format.

Yes, that and other information is included in the .gff file.

Does rmOutToGFF3.pl can generate gene annotation file with chr/start/end?

rmOutToGFF3.pl converts RM output to GFF3, which also contains that information.

In this sense, I was trying to use bedtools to get gene coordinates in bed format. But I'm a bit confused, like .fa.masked is not fasta file format, so how can I convert it? Or can I simply change the extention from fa.masked to .fa and feed in to bedtools?

The fa.masked file is already a FASTA file - you should not even need to change the file extension in order to use it with bedtools!

jebrosen avatar Jun 25 '21 03:06 jebrosen

What does the last two columns in the gff mean? Are they the number of counts? Why they are different? I couldn't find the explanation anywhere. Thanks!

smallfishcui avatar Jun 26 '23 04:06 smallfishcui