genotyphi icon indicating copy to clipboard operation
genotyphi copied to clipboard

Port to Python 3 and other changes

Open peterk87 opened this issue 3 years ago • 1 comments

Hello,

I was recently trying to use Genotyphi for subtyping S. Typhi genomes and encountered some issues integrating the tool with the rest of the tools in a workflow, so I forked and refactored the code to be compatible with Python 3.7+ and to understand it a bit better.

I hope the changes help the project.

Thanks, Peter

Changes:

  • conversion to Python 3 compatible code with 2to3
  • Python TempDirectory for BAM to VCF
  • Python subprocess piping of output from samtools mpileup to bcftools call to reduce disk IO
  • Converted print statements to console logging statements
  • added some type hints using the Python typing module
  • fixed line endings for Linux (initially I couldn't get the script to execute due to an issue with \r characters)
  • tabs to spaces (as recommended in PEP8)
  • BioPython for FASTA parsing with SimpleFastaParser
  • default write output to standard output so users can run $genotyphi --args ... > genotyphi-output.txt or pipe it into other programs
  • removed prepending of timestamp to output file path which could cause issues when users provide an absolute file path for output
  • refactoring for readability (extracted methods, f-string literals)

Also, one thing I noticed is that multiple lists are used to describe linked data where one dict of tuples could be used instead, e.g.

loci = [1, 2, 3, ...]
snp_alleles = ['A', 'G', 'T', 'C', ...]
groups = ['0.1', '0.0.1', ...]

The above could be converted to the following:

snp_markers = {1: ('A', '0.1'), 2: ('G', '0.0.1'), ...}

This change would make it easier to add/update loci/groups/alleles, and when checking if a variant from a VCF file is one of the genotyping markers, search for the key in a snp_markers dict would be in constant time rather than linear time with loci.index(variant_location), e.g.

def checkSNP(vcf_line_split, this_groups, proportions, args):
    location = int(vcf_line_split[1])
    try:
        snp_allele, group = snp_markers[location]
        ...
    except KeyError:
        pass
    return this_groups, proportions

peterk87 avatar Sep 04 '20 20:09 peterk87

Thanks Peter. Converting to py3 is something we know needs to be done, but have been putting off doing ourselves as it was not needed for our own workflows and we have had other priorities. So - we very much appreciate you making and sharing these changes. We will review & test the code and merge if all looks good.

Re the use of dictionary - of course you are right, and the current setup is a legacy from early code development that we never went back to optimise. We are actually planning to change the code to read the genotype info into a dict from a simple text file, to make it as easy as possible to change/add new groups and also to see/reuse the genotype definitions.

katholt avatar Sep 05 '20 11:09 katholt