MIDAS Build own database: <genome

Hi, I would like to use MIDAS with my own genomes as they are not present in the reference database.

I'm able to get all the required input files but I'm not sure how to get the <genome_id>.genes file:

gene_id (CHAR) scaffold_id (CHAR) start (INT) end (INT) strand (+ or -) gene_type (CDS or RNA)

Is there any program which can produce that kind of file as as output, or do you have a recommendation about how to get that kind of file?

Thank you very much in advance.

Feb 23 '18 10:02 palomo11

If you run Prodigal on your .fna file, it will produce the necessary genes files (.ffn and .faa) as well as a .gff file that contains the coordinates of the genes. You can then convert the .gff to the .genes format. When I get the chance, I can add support for .gff files instead of the non-standard .genes files.

Thanks, Stephen

Feb 23 '18 20:02 snayfach

Hi Stephen,

Thanks for the advice. That's what I've done and it is working so far.

Mar 02 '18 14:03 palomo11

Hello, I am also trying to build my own database, I was just curious how can you convert the .gff file into the .genes file, is it just by renaming the file and changing the extension name from ".gff" to ".genes"? Or does MIDAS now support for .gff files instead of the .genes files?

I would appreciate any guidance on this!

Oct 12 '21 22:10 cirodri1

Hi,

Not sure if MIDAS now support *gff, otherwise you can get the .genes files as follows (I guess there must be simplest way of doing it, but this one works well for me):

#Starting with ".faa" generated with Prodigal for file in ./Genome_/.faa

do

awk 'sub(/^>/, "")' < $file > ${file%.faa}.txt

awk -F" # " '$1=$1' OFS="\t" ${file%.faa}.txt > ${file%.faa}_OK.txt

awk '{gsub("-1","-",$4)}1' OFS="\t" ${file%.faa}_OK.txt > ${file%.faa}_OK2.txt

awk '{gsub("1","+",$4)}1' OFS="\t" ${file%.faa}_OK2.txt > ${file%.faa}_OK3.txt

awk 'match($1,/_[0-9]+$/) {printf("%s\t%s\n", $1, substr($1,0,RSTART), substr($1,RSTART,RLENGTH))}' ${file%.faa}_OK3.txt > ${file%.faa}_OK4.txt

sed -i 's/.$//' ${file%.faa}_OK4.txt

awk -vOFS='\t' '{$5 = "CDS"; print}' ${file%.faa}_OK3.txt > ${file%.faa}_OK3_OK.txt

awk -vOFS='\t' 'NR==FNR {h[$1] = $2; next} {print $1,h[$1],$2,$3,$4,$5}' ${file%.faa}_OK4.txt ${file%.faa}_OK3_OK.txt > ${file%.faa}.genes

sed -i '1 i\gene_id\tscaffold_id\tstart\tend\tstrand\tgene_type' ${file%.faa}.genes

rm ./Genome_/.txt

done

Oct 13 '21 02:10 palomo11

Dear @palomo11 Thank you so much for your help! I tried the code you kindly shared with me but I got the following error: "sed: 1: "8_1C_n2-B.animalis_P19_ ...": invalid command code _ awk: invalid -v option

awk: invalid -v option"

The name of my .faa file is 8_1C_n2-B.animalis_P19.faa and the code I ran was: `for file in *.faa

do

awk 'sub(/^>/, "")' < $file > ${file%.faa}.txt

awk -F" # " '$1=$1' OFS="\t" ${file%.faa}.txt > ${file%.faa}_OK.txt

awk '{gsub("-1","-",$4)}1' OFS="\t" ${file%.faa}_OK.txt > ${file%.faa}_OK2.txt

awk '{gsub("1","+",$4)}1' OFS="\t" ${file%.faa}_OK2.txt > ${file%.faa}_OK3.txt

awk 'match($1,/_[0-9]+$/) {printf("%s\t%s\n", $1, substr($1,0,RSTART), substr($1,RSTART,RLENGTH))}' ${file%.faa}_OK3.txt > ${file%.faa}_OK4.txt

sed -i 's/.$//' ${file%.faa}_OK4.txt

awk -vOFS='\t' '{$5 = "CDS"; print}' ${file%.faa}_OK3.txt > ${file%.faa}_OK3_OK.txt

awk -vOFS='\t' 'NR==FNR {h[$1] = $2; next} {print $1,h[$1],$2,$3,$4,$5}' ${file%.faa}_OK4.txt ${file%.faa}_OK3_OK.txt > ${file%.faa}.genes

sed -i '1 i\gene_id\tscaffold_id\tstart\tend\tstrand\tgene_type' ${file%.faa}.genes

done`

I am attaching the files I got by running that code. Do you have any more guidance on this? Any help would be greatly appreciated!

8_1C_n2-B.animalis_P19_OK2.txt 8_1C_n2-B.animalis_P19_OK.txt 8_1C_n2-B.animalis_P19_OK3.txt 8_1C_n2-B.animalis_P19_OK4.txt 8_1C_n2-B.animalis_P19.txt

I also get files with names 8_1C_n2-B.animalis_P19_OK3_OK.txt 8_1C_n2-B.animalis_P19.genes but these are empty

Oct 16 '21 00:10 cirodri1

I think there should be a space after -v:

awk -v OFS='\t' '{$5 = "CDS"; print}' ${file%.faa}_OK3.txt > ${file%.faa}_OK3_OK.txt

awk -v OFS='\t' 'NR==FNR {h[$1] = $2; next} {print $1,h[$1],$2,$3,$4,$5}' ${file%.faa}_OK4.txt ${file%.faa}_OK3_OK.txt > ${file%.faa}.genes

sed -i '1 i\gene_id\tscaffold_id\tstart\tend\tstrand\tgene_type' ${file%.faa}.genes

Oct 24 '21 09:10 palomo11

Dear @palomo11 ,

Thank you so much for your help!

I was able to obtain a ".genes" file that is not empty, but I still get an error at the end that says "sed: 1: "8_1C_n2-B.animalis_P19_ ...": invalid command code _ sed: 1: "8_1C_n2-B.animalis_P19. ...": invalid command code _ "

The name of my .faa file is 8_1C_n2-B.animalis_P19.faa and the code I ran was:

`for file in *.faa

do

awk 'sub(/^>/, "")' < $file > ${file%.faa}.txt

awk -F" # " '$1=$1' OFS="\t" ${file%.faa}.txt > ${file%.faa}_OK.txt

awk '{gsub("-1","-",$4)}1' OFS="\t" ${file%.faa}_OK.txt > ${file%.faa}_OK2.txt

awk '{gsub("1","+",$4)}1' OFS="\t" ${file%.faa}_OK2.txt > ${file%.faa}_OK3.txt

awk 'match($1,/_[0-9]+$/) {printf("%s\t%s\n", $1, substr($1,0,RSTART), substr($1,RSTART,RLENGTH))}' ${file%.faa}_OK3.txt > ${file%.faa}_OK4.txt

sed -i 's/.$//' ${file%.faa}_OK4.txt

awk -v OFS='\t' '{$5 = "CDS"; print}' ${file%.faa}_OK3.txt > ${file%.faa}_OK3_OK.txt

awk -v OFS='\t' 'NR==FNR {h[$1] = $2; next} {print $1,h[$1],$2,$3,$4,$5}' ${file%.faa}_OK4.txt ${file%.faa}_OK3_OK.txt > ${file%.faa}.genes

sed -i '1 i\gene_id\tscaffold_id\tstart\tend\tstrand\tgene_type' ${file%.faa}.genes

done`

I am attaching below the .genes file I obtained (I changed the extension to .txt so I could attach it here), I still do not see the correct columns that MIDAS asks for: gene_id (CHAR) scaffold_id (CHAR) start (INT) end (INT) strand (+ or -) gene_type (CDS or RNA)

It seems I only have 3 columns, one with genome ID and a number attached to it, a possible protein name, and the gene type "CDS." 8_1C_n2-B.animalis_P19.genes.txt

Do you have any insights into what could be going wrong with the sed command?

Oct 25 '21 19:10 cirodri1

Hello! I just wanted to post an update for those struggling to make a ".genes" file. I was able to convert a .gff file obtained from Prodigal to a .genes file with the correct columns by using the program csvtk and doing the steps below:

To count number of columns in a .gff file:

csvtk dim --cols -t 8_1C_n2-B.animalis_P19.gff

Notes: The CSV parser requires all the lines have same number of fields/columns. Even lines with spaces will cause error. Use '-I/--ignore-illegal-row' to skip these lines if necessary. By default, csvtk handles CSV files, use flag -t for tab-delimited files.

To select fields/columns:

csvtk cut -f 1,3-5,7,9 --ignore-illegal-row -t 8_1C_n2-B.animalis_P19.gff > 8_1C_n2-B.animalis_P19.genes

Check the number of columns on the resulting .genes file:

csvtk dim --cols -t 8_1C_n2-B.animalis_P19.genes Result: 6

To rename fields/columns in genes file:

csvtk rename -f 1-6 -t -n scaffold_id,gene_type,start,end,strand,gene_id 8_1C_n2-B.animalis_P19.genes > 8_1C_n2-B.animalis_P19_renamed_columns.genes

Nov 30 '21 22:11 cirodri1

MIDAS
MIDAS copied to clipboard

Build own database: <genome_id>.genes

MIDAS MIDAS copied to clipboard

Build own database: <genome_id>.genes

MIDAS
MIDAS copied to clipboard