EMBLmyGFF3 icon indicating copy to clipboard operation
EMBLmyGFF3 copied to clipboard

Gene sorting compared to fasta

Open stephan-nylinder opened this issue 1 year ago • 4 comments

Describe the bug When running EMBLmyGFF3 on a fasta file with a specific order of the genes, it re-orders the genes in the flat file output in what seems to be alphanumerical order based on the gene names. Input gene order is not retained.

General (please complete the following information):

  • EMBLmyGFF3 2.2
  • EMBLmyGFF3 installation/use: Manual
  • OS: macOS and Windows 11

To Reproduce Any fasta+gff Any input parameters

Expected behavior To guarantee the gene order in the flat file output to be in the same order as in the fasta input

Screenshots None

Additional context None

stephan-nylinder avatar Feb 06 '24 10:02 stephan-nylinder

This behavior is related to biopython. Could you tell what version of biopython you use? see https://github.com/NBISweden/EMBLmyGFF3/issues/70

Juke34 avatar Feb 06 '24 10:02 Juke34

Seems to be 1.80

stephan-nylinder avatar Feb 06 '24 11:02 stephan-nylinder

Chiming in here. We noted the issue of contigs not being sorted according to the input order of the assembly file. The culprit here is not biopython, but rather BCBio.GFF. In EMBLmyGFF.py you have the following. At line line 1365:

seq_dict = SeqIO.to_dict( SeqIO.parse(infasta, "fasta") )

I have verified that the input order is preserved.

Later on, at line 1484:

for record in GFF.parse(infile, base_dict=seq_dict):

is where the input order is scrambled. GFF.parse calls a function GFF.parse_in_parts which looks as follows:

def parse_in_parts(self, gff_files, base_dict=None, limit_info=None,
            target_lines=None):
        """Parse a region of a GFF file specified, returning info as generated.

        target_lines -- The number of lines in the file which should be used
        for each partial parse. This should be determined based on available
        memory.
        """
        for results in self.parse_simple(gff_files, limit_info, target_lines):
            if base_dict is None:
                cur_dict = dict()
            else:
                cur_dict = copy.deepcopy(base_dict)
            cur_dict = self._results_to_features(cur_dict, results)
            all_ids = list(cur_dict.keys())
            all_ids.sort()
            for cur_id in all_ids:
                yield cur_dict[cur_id]

By commenting out the line all_ids.sort I can preserve input order.

percyfal avatar Mar 28 '24 10:03 percyfal

Excellent, thank you for the feedback @percyfal !

Juke34 avatar Mar 28 '24 12:03 Juke34