Finder icon indicating copy to clipboard operation
Finder copied to clipboard

codan fails and kills pipeline due to finding duplicate key(s)

Open laurabaxter21 opened this issue 1 year ago • 4 comments

Running the latest run_finder-v1.1.0. Everything runs fine until the codan step (Braker is complete), which finds a duplicate key and kills the pipeline. Looking at the assemblies_psiclass_modified/combined/combined_split_transcripts_with_bad_SJ_redundancy_removed.fasta file for duplicated sequence IDs, I find 2 (C2.27447_0_covsplit.0 and C7.149167_0_covsplit.0, both with different sequences in each of the duplicates).

Could I just delete these out from FASTA/gtf and continue from checkpoint 5?

assemblies_psiclass_modified/combined/cds_predict.error:

Traceback (most recent call last): File "/softwares/CODAN/CodAn-1.2/bin/codan.py", line 524, in main() File "/softwares/CODAN/CodAn-1.2/bin/codan.py", line 506, in main codan_BOTH(options.transcripts, options.output_folder, options.model, options.cpu) File "/softwares/CODAN/CodAn-1.2/bin/codan.py", line 355, in codan_BOTH retrieveORF_BOTH(transcripts, outF+"minus.fa", outF) File "/softwares/CODAN/CodAn-1.2/bin/codan.py", line 147, in retrieveORF_BOTH record_dictP = SeqIO.index(transcripts, "fasta") File "/usr/lib/python3/dist-packages/Bio/SeqIO/init.py", line 979, in index return _IndexedSeqFileDict( File "/usr/lib/python3/dist-packages/Bio/File.py", line 350, in init raise ValueError("Duplicate key '%s'" % key) ValueError: Duplicate key 'C2.27447_0_covsplit.0'

laurabaxter21 avatar Mar 21 '23 22:03 laurabaxter21

Hello @laurabaxter21,

Thank you very much for your interest in finder. We have decided to focus our attention on developing the 2nd version of the software. As of now, we do not have the capabilities to support the older version due to a lack of personnel and I sincerely apologize for that. If you want to follow up on this please email me at [email protected] and I will do my best to help you out.

Thank you.

sagnikbanerjee15 avatar Mar 22 '23 00:03 sagnikbanerjee15

Running the latest run_finder-v1.1.0. Everything runs fine until the codan step (Braker is complete), which finds a duplicate key and kills the pipeline. Looking at the assemblies_psiclass_modified/combined/combined_split_transcripts_with_bad_SJ_redundancy_removed.fasta file for duplicated sequence IDs, I find 2 (C2.27447_0_covsplit.0 and C7.149167_0_covsplit.0, both with different sequences in each of the duplicates).

Could I just delete these out from FASTA/gtf and continue from checkpoint 5?

assemblies_psiclass_modified/combined/cds_predict.error:

Traceback (most recent call last): File "/softwares/CODAN/CodAn-1.2/bin/codan.py", line 524, in main() File "/softwares/CODAN/CodAn-1.2/bin/codan.py", line 506, in main codan_BOTH(options.transcripts, options.output_folder, options.model, options.cpu) File "/softwares/CODAN/CodAn-1.2/bin/codan.py", line 355, in codan_BOTH retrieveORF_BOTH(transcripts, outF+"minus.fa", outF) File "/softwares/CODAN/CodAn-1.2/bin/codan.py", line 147, in retrieveORF_BOTH record_dictP = SeqIO.index(transcripts, "fasta") File "/usr/lib/python3/dist-packages/Bio/SeqIO/init.py", line 979, in index return _IndexedSeqFileDict( File "/usr/lib/python3/dist-packages/Bio/File.py", line 350, in init raise ValueError("Duplicate key '%s'" % key) ValueError: Duplicate key 'C2.27447_0_covsplit.0'

I am having the same issue. Did you figure out a solution?

DrDoom-EvoGen avatar Jun 07 '23 15:06 DrDoom-EvoGen

Hi, yes I recall I just deleted the offending duplicated sequences from the FASTA file and their corresponding entries from the gft file (they didn't seem critically important). Then I re-ran finder from checkpoint 5 and it completed OK.

Hope that helps, Laura


From: Gregory M. Chorak, PhD @.> Sent: 07 June 2023 16:03 To: sagnikbanerjee15/Finder @.> Cc: Baxter, Laura @.>; Mention @.> Subject: Re: [sagnikbanerjee15/Finder] codan fails and kills pipeline due to finding duplicate key(s) (Issue #76)

Running the latest run_finder-v1.1.0. Everything runs fine until the codan step (Braker is complete), which finds a duplicate key and kills the pipeline. Looking at the assemblies_psiclass_modified/combined/combined_split_transcripts_with_bad_SJ_redundancy_removed.fasta file for duplicated sequence IDs, I find 2 (C2.27447_0_covsplit.0 and C7.149167_0_covsplit.0, both with different sequences in each of the duplicates).

Could I just delete these out from FASTA/gtf and continue from checkpoint 5?

assemblies_psiclass_modified/combined/cds_predict.error:

Traceback (most recent call last): File "/softwares/CODAN/CodAn-1.2/bin/codan.py", line 524, in main() File "/softwares/CODAN/CodAn-1.2/bin/codan.py", line 506, in main codan_BOTH(options.transcripts, options.output_folder, options.model, options.cpu) File "/softwares/CODAN/CodAn-1.2/bin/codan.py", line 355, in codan_BOTH retrieveORF_BOTH(transcripts, outF+"minus.fa", outF) File "/softwares/CODAN/CodAn-1.2/bin/codan.py", line 147, in retrieveORF_BOTH record_dictP = SeqIO.index(transcripts, "fasta") File "/usr/lib/python3/dist-packages/Bio/SeqIO/init.py", line 979, in index return _IndexedSeqFileDict( File "/usr/lib/python3/dist-packages/Bio/File.py", line 350, in init raise ValueError("Duplicate key '%s'" % key) ValueError: Duplicate key 'C2.27447_0_covsplit.0'

I am having the same issue. Did you figure out a solution?

— Reply to this email directly, view it on GitHubhttps://github.com/sagnikbanerjee15/Finder/issues/76#issuecomment-1581011133, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AFLU2GXLSA533TUDYDT4HB3XKCJ2RANCNFSM6AAAAAAWDCCUWU. You are receiving this because you were mentioned.Message ID: @.***>

laurabaxter21 avatar Jun 07 '23 15:06 laurabaxter21

That worked for me also.

Thank you!

Greg

DrDoom-EvoGen avatar Jun 07 '23 19:06 DrDoom-EvoGen