gapseq icon indicating copy to clipboard operation
gapseq copied to clipboard

Reconstruct models from protein fasta

Open glucksfall opened this issue 3 years ago • 2 comments

Dear @jotech,

Thank you for developing gapseq. I tried it, however, I have only protein fasta files which I think are incompatible right now. Do you plan to add support for protein fastas?

Best regards, Rodrigo

glucksfall avatar Apr 06 '21 20:04 glucksfall

Hi Rodrigo, thank you for the feature request. Supporting protein-fasta as input is on our TODO lists, but we cannot say when we will be able to implement this, as there are other improvements that we want to make before.

We think that using the genome sequence (or MAGs/bins) as some advantages over CDS/Protein-fasta files as input, because the reconstruction can provide also the information about the genomic organization of the enzyme-coding genes. This can help to further manually curate the models. But of course we also understand that protein fasta input support could be helpful to use gapseq in existing workflows.

Best Silvio

Waschina avatar Apr 09 '21 12:04 Waschina

I'd also like to see this implemented.

PedroMTQ avatar Apr 12 '21 12:04 PedroMTQ

It is now implemented. Please refer to #132

Waschina avatar Sep 30 '22 14:09 Waschina

Hello :) is it possible to use protein multifasta as input using the function "doall"? How?

Best, Arianna

arianccbasile avatar Nov 09 '23 13:11 arianccbasile

Hi Arianna,

The workflow should be able to recognise that the input is a protein multifasta. Thus, something like the following should work:

gapseq doall -K 16 ecoli.faa.gz

Or did you experience an error with protein input to "doall"?

Best Silvio

Waschina avatar Nov 09 '23 14:11 Waschina

Hi Silvio, Yes I have this final error which eventually causes the program to stop:

Error in gzfile(file, "rb") : cannot open the connection
Calls: readRDS -> gzfile
In addition: Warning message:
In gzfile(file, "rb") :
  cannot open compressed file './org-rxnWeights.RDS', probable reason 'No such file or directory'
Execution halted

But before that I have a range of other warnings/other errors

FASTA-Reader: Ignoring invalid residues at position(s): On line 23556: 2, 4-5, 8, 10, 12-13, 16, 18, 21, 23-24, 27, 30, 34-35, 38, 40-41, 45, 48, 52, 56, 58, 60, 67, 69
FASTA-Reader: Ignoring invalid residues at position(s): On line 23557: 7, 12, 15, 17-20, 23-24, 27-28, 36, 39-41, 51-53, 59, 67-68, 70
FASTA-Reader: Ignoring invalid residues at position(s): On line 23558: 3, 12-14, 24, 26, 32, 37, 39-41, 43-44, 46, 49-50, 53-55, 58, 66, 69
FASTA-Reader: Ignoring invalid residues at position(s): On line 23559: 1-4, 6, 15-17, 23, 25-26, 30, 32-33
FASTA-Reader: Ignoring invalid residues at position(s): On line 23561: 4-5, 7-8, 13-14, 20, 22, 24-25, 27, 35, 37, 47-48, 59-60, 67
FASTA-Reader: Ignoring invalid residues at position(s): On line 23562: 2, 4-5, 7, 11, 17, 19, 24, 31-34, 37-38, 44-45, 51, 56, 61, 68
FASTA-Reader: Ignoring invalid residues at position(s): On line 23563: 4-8, 10, 13-15, 19-20, 26, 31, 37, 42-43, 45, 50, 53, 57, 63, 67
FASTA-Reader: Ignoring invalid residues at position(s): On line 23564: 1, 4-6, 9-10, 13, 17, 25, 27, 30-31, 34, 37-38, 45, 52, 55, 59, 63, 68, 70
FASTA-Reader: Ignoring invalid residues at position(s): On line 23565: 1, 5, 9, 14, 17-18, 24-25, 32, 36-37, 39-40, 42-43, 45, 51, 53, 58, 61-62, 64-67
FASTA-Reader: Ignoring invalid residues at position(s): On line 23566: 3, 5, 16-17, 19, 25, 27, 29, 32-34, 37-38, 45, 47, 57, 59-60, 63, 65
FASTA-Reader: Ignoring invalid residues at position(s): On line 23567: 1, 5, 12, 15-17, 26, 28, 34, 36-37, 40-41, 52, 59, 61, 68-69
FASTA-Reader: Ignoring invalid residues at position(s): On line 23570: 5, 7, 9-10, 12-15, 18, 21, 24, 26-27, 32, 36-37, 39, 43, 50, 54-58, 61, 63-64, 66
FASTA-Reader: Ignoring invalid residues at position(s): On line 23571: 3-5, 7-8, 19, 24, 31-33, 35, 37-39, 43, 45, 47, 50, 54, 56-57, 59-60, 65-67
FASTA-Reader: Ignoring invalid residues at position(s): On line 23572: 5-6, 8, 11, 18-19, 23-25, 28-30, 34, 39, 41, 43-44, 49, 52-54, 56, 59, 70
FASTA-Reader: Ignoring invalid residues at position(s): On line 23573: 1, 3-4, 7, 17, 21, 27, 30-32, 34, 38, 41, 43, 46, 48, 50, 56-58, 61, 65, 67
FASTA-Reader: Ignoring invalid residues at position(s): On line 23574: 1, 3, 5, 8-9, 13, 15, 24-25, 28, 31-33, 35, 47, 50-51
FASTA-Reader: Ignoring invalid residues at position(s): On line 23576: 4-6, 13-15, 18, 23, 25, 32, 40-42, 45-47
FASTA-Reader: Ignoring invalid residues at position(s): On line 23578: 5, 8-10, 12, 14, 18, 21, 24, 26, 29-30, 32-33, 36-37, 40-41, 45, 48, 51-52, 59, 61-62, 64-65, 69-70
FASTA-Reader: Ignoring invalid residues at position(s): On line 23579: 2, 5-7, 9, 13, 16, 18-19, 21, 23, 29, 37, 42, 46
FASTA-Reader: Ignoring invalid residues at position(s): On line 23581: 5, 7-8, 11, 13-14, 26, 31, 34, 39-42, 45-46, 48-49, 52-54, 57, 62, 64-65, 70
FASTA-Reader: Ignoring invalid residues at position(s): On line 23582: 8, 10-12, 17
FASTA-Reader: Ignoring invalid residues at position(s): On line 23584: 2, 4-5, 7, 10, 12, 18, 29-32, 38, 41, 47, 49, 51, 53, 58, 61, 63, 66, 68
FASTA-Reader: Ignoring invalid residues at position(s): On line 23585: 2-3, 7, 9, 19-20, 26-27, 32, 37-38, 43-44, 46, 48, 52, 60, 62, 65, 69
FASTA-Reader: Ignoring invalid residues at position(s): On line 23586: 1-3, 7
FASTA-Reader: Ignoring invalid residues at position(s): On line 23588: 6-7, 9, 11, 13, 15-17, 19-20, 23, 26, 29, 33, 42, 49, 51, 54, 56, 60, 62, 65-66, 68
FASTA-Reader: Ignoring invalid residues at position(s): On line 23589: 6, 14-17, 19, 23, 31-33, 35, 50, 54, 56, 61, 63-64, 69
FASTA-Reader: Ignoring invalid residues at position(s): On line 23590: 3-6, 10-11, 14, 16, 18, 22, 25-29, 31, 36-37, 39, 41, 44, 46, 49, 61, 64-65, 67
FASTA-Reader: Ignoring invalid residues at position(s): On line 23591: 3, 5-6, 8-9, 12, 15, 17-18, 20, 25, 27, 30, 34, 38, 44, 52, 60-63, 68
FASTA-Reader: Ignoring invalid residues at position(s): On line 23592: 2-4, 7-8, 11-12, 14, 16, 21-22, 25, 37, 45, 47, 50-53, 59, 65-66
FASTA-Reader: Ignoring invalid residues at position(s): On line 23593: 5-7, 10, 17, 21, 24, 27, 30, 35, 39-40, 43
FASTA-Reader: Ignoring invalid residues at position(s): On line 23595: 5, 8, 11, 13, 16, 19, 21-22, 24, 26, 29, 32, 34-35, 37, 56, 58, 60, 62, 66, 69-70
FASTA-Reader: Ignoring invalid residues at position(s): On line 23596: 10, 12, 14, 16-17, 22, 28, 30, 34-35, 39-40, 42, 44, 52, 54, 58-59, 63, 68
FASTA-Reader: Ignoring invalid residues at position(s): On line 23597: 1-2, 4-7, 9-13, 15, 18-19, 22-23, 25, 29, 31, 33, 35, 38-39, 45, 51-52, 54, 56, 61, 63, 67, 70
FASTA-Reader: Ignoring invalid residues at position(s): On line 23598: 4, 11, 14, 16, 27, 31-32, 37, 46-48, 53, 57, 67-69
FASTA-Reader: Ignoring invalid residues at position(s): On line 23599: 1, 3, 5, 11, 14

Found transporter for substances:

Found transporter without reaction in DB:
Total number of found substance transporter for P_merdae_carveme: 0
Running time: 56 s.
[barrnap] ERROR: nhmmer failed to run - Error: Invalid alphabet type in target for nhmmer. Expect DNA or RNA.
index file P_merdae_carveme.faa.tmp.fai not found, generating...
Warning: [blastn] Query is Empty!
Warning message:
In .Call2("fasta_index", filexp_list, nrec, skip, seek.first.rec,  :
  reading FASTA file P_merdae_carveme.faa.tmp: ignored 445073 invalid one-letter sequence codes

Predicted biomass:  Gram_pos 
Error in .checkTypos(e, names_x) : 
  Object 'pathway' not found amongst DHLAXANAU-RXN, 1,2-dichloroethane dehalogenase, 3.8.1.5, V4, V5 and 14 more
Calls: build_draft_model_from_blast_results ... tryCatchList -> tryCatchOne -> <Anonymous> -> .checkTypos
No traceback available 
Error in saveRDS(mod$mod, file = paste0(model.name, "-draft.RDS")) : 
  object 'mod' not found
10: stop("Object '", used, "' not found amongst ", paste(head(ref, 
        5L), collapse = ", "), if (length(ref) <= 5L) "" else paste(" and", 
        length(ref) - 5L, "more"))
9: .checkTypos(e, names_x)
8: value[[3L]](cond)
7: tryCatchOne(expr, names, parentenv, handlers[[1L]])
6: tryCatchList(expr, classes, parentenv, handlers)
5: tryCatch(eval(.massagei(isub), x, ienv), error = function(e) {
       if (grepl(":=.*defined for use in j.*only", e$message)) 
           stop("Operator := detected in i, the first argument inside DT[...], but is only valid in the second argument, j. Most often, this happens when forgetting the first comma (e.g. DT[newvar := 5] instead of DT[ , new_var := 5]). Please double-check the syntax. Run traceback(), and debugger() to get a line number.")
       else .checkTypos(e, names_x)
   })
4: `[.data.table`(dt, pathway != "|PWY-6168|")
3: dt[pathway != "|PWY-6168|"]
2: prepare_candidate_reaction_tables(blast.res, transporter.res, 
       high.evi.rxn.BS, min.bs.for.core)
1: build_draft_model_from_blast_results(blast.res = blast.res, transporter.res = transporter.res, 
       biomass = biomass, model.name = model.name, genome.seq = genome.seq, 
       high.evi.rxn.BS = high.evi.rxn.BS, min.bs.for.core = min.bs.for.core, 
       script.dir = script.dir, pathway.pred = pathway.pred)
Error in saveRDS(mod$cand.rxns, file = paste0(model.name, "-rxnWeights.RDS")) : 
  object 'mod' not found
1: saveRDS(mod$mod, file = paste0(model.name, "-draft.RDS"))
Error in saveRDS(mod$rxn_x_genes, file = paste0(model.name, "-rxnXgenes.RDS")) : 
  object 'mod' not found
1: saveRDS(mod$cand.rxns, file = paste0(model.name, "-rxnWeights.RDS"))
Error in write_gapseq_sbml(mod$mod, paste0(model.name, "-draft")) : 
  object 'mod' not found
1: saveRDS(mod$rxn_x_genes, file = paste0(model.name, "-rxnXgenes.RDS"))
Error in gzfile(file, "rb") : cannot open the connection
Calls: readRDS -> gzfile
In addition: Warning message:
In gzfile(file, "rb") :
  cannot open compressed file './P_merdae_carveme-draft.RDS', probable reason 'No such file or directory'

arianccbasile avatar Nov 09 '23 16:11 arianccbasile

Could you check the first log lines from the doall-call?

It should look similar to this:

/tmp/tmp.UG83rhZH9N
Protein fasta detected.
Predicted taxonomy: Bacteria
Checking updates for Bacteria /home/silvio/Software/gapseq/src/../dat/seq/Bacteria
Reference sequences are up-to-date.

If in your case it doesn't say "Protein fasta detected", then there's an issue with the protein/nucleotide fasta recognition.

Waschina avatar Nov 09 '23 16:11 Waschina

Loading rhel7/default-peta4
  Loading requirement: dot slurm turbovnc/2.0.1 vgl/3.1/64 singularity/current
    rhel7/global intel/compilers/2017.4 intel/mkl/2017.4 intel/impi/2017.4/intel
    intel/libs/idb/2017.4 intel/libs/tbb/2017.4 intel/libs/ipp/2017.4
    intel/libs/daal/2017.4 intel/bundles/complib/2017.4 cmake/latest

P_merdae_carveme.faa

DEPRECATION NOTICE: Option '-k' is deprecated. To disable multi-threading use '-K 1' instead.
/tmp/tmp.oYvS5auQmh
Protein fasta detected.

Seems to be fine, also it works well with nucleotides.

Best, Arianna

arianccbasile avatar Nov 09 '23 16:11 arianccbasile

Yes, this looks fine.

It's the "FASTA-Reader: Ignoring invalid residues at position(s): On line" messages that worry me. It occurs in the transporter prediction script. Is there any line in the logs that says "Nucleotide fasta detected"? Could you provide the output of "gapseq test"?

Waschina avatar Nov 09 '23 16:11 Waschina

First I ran: grep "Nucleotide fasta detected" slurm-30880283.out And the output is empty.

Also gapseq test seems fine to me:


gapseq version: 1.2 9b30141
linux-gnu
#1 SMP Mon Oct 16 15:30:09 BST 2023 


#######################
#Checking dependencies#
#######################
GNU Awk 5.1.0, API: 3.0
sed (GNU sed) 4.8
grep (GNU grep) 3.4
This is perl 5, version 32, subversion 1 (v5.32.1) built for x86_64-linux-thread-multi
tblastn: 2.12.0+
exonerate from exonerate version 2.4.0
bedtools v2.30.0
barrnap 0.9 - rapid ribosomal RNA prediction
R version 4.1.3 (2022-03-10) -- "One Push-Up"
R scripting front-end version 4.1.3 (2022-03-10)
git version 2.35.3
GNU parallel 20220222

Missing dependencies: 0


#####################
#Checking R packages#
#####################
data.table 1.14.2 
stringr 1.4.0 
sybil 2.2.0 
getopt 1.20.3 
doParallel 1.0.17 
foreach 1.5.2 
R.utils 2.11.0 
stringi 1.7.6 
glpkAPI 1.3.3 
BiocManager 1.30.17 
Biostrings 2.62.0 
jsonlite 1.8.0 
CHNOSZ 1.4.3 

Missing R packages:  0 


##############################
#Checking basic functionality#
##############################
Optimization test: OK 
Building full model: OK 
Blast test: OK

Passed tests: 3/3

arianccbasile avatar Nov 09 '23 16:11 arianccbasile

Okay, thank you.

The issue is most likely due to your gapseq version. There was a bug-fix in September 2022 (https://github.com/jotech/gapseq/commit/642549cdbd76bd2e71efc3f629781bae60da4834) in the protein fasta handling in the transporter prediction. The version you are using (9b30141) is from April 2022.

Could you update your gapseq version?

Waschina avatar Nov 09 '23 16:11 Waschina

Hello Silvio, I updated the version but I got a similar error with the file -rxnWeights.RDS not found. I checked the head of the log and also in this case:

P_merdae_carveme.faa
/tmp/tmp.Q3weGHA9av
Protein fasta detected.

I spotted also this error which refers to a pseudo gene in my multifasta:

Parse failed (sequence file ./Pmerdae/gapseq/P_merdae_carveme.faa):
Line 289: illegal character -

And this is the output of head -n 290 to check what is going on there. I guess the - is the problem but I am not sure how to tackle it.

>lcl|CP102286.1_prot_NQ542_00215_43 [locus_tag=NQ542_00215] [protein=hydroxylamine reductase] [pseudo=true] [location=complement(53738..54120)] [gbkey=CDS]
-ALADKVVNAVKSGAIRKFVVMAGCDGRMKKRNYYTEFAEQLPDDCVILTAGFAKYRYNKLSLGDINGIP
RVLDTGQCNDSYSLALTAMKLQDVFGLEGM*PKSWSKISGSTPYKLRKKI*NSSS*H

The output of gapseq test is fine. Any other suggestion?

Best, Arianna

arianccbasile avatar Nov 10 '23 15:11 arianccbasile

Hi Arianna,

mhh, I am not sure what's happening here. Could you share all the logs or even the genome fasta file (if it is public)? When I can reproduce the error, it will be easier to locate the source of the issue.

Waschina avatar Nov 10 '23 21:11 Waschina

Hi Sivio, I probably found the origin of the problem, it seems the software cannot find the files of the database. Which is curious because it can find them when inputing a nucleotide fasta file: Checking updates for Bacteria /rds/user/ab2851/hpc-work/programs/gapseq/src/../dat/seq/Bacteria Updating reference sequences to latest version... Reference sequences updated. ATTENTION: gapseq sequence archives are missing! Sequences will be needed to be downloaded from uniprot directly which is rather slow. 3290 Error: Execution halted md5sum: 3.8.1.5.fasta: No such file or directory cat: 3.8.1.5.fasta: No such file or directory Error: Execution halted md5sum: 3.8.1.5.fasta: No such file or directory cat: 3.8.1.5.fasta: No such file or directory

I also have the suspect it is just looking in the wrong folder because inside seq/Bacteria there are sequences just divided in other subfolders. I am a bit lost here.

Best, Arianna

arianccbasile avatar Nov 24 '23 14:11 arianccbasile