pyani icon indicating copy to clipboard operation
pyani copied to clipboard

`pyani anim` loads sequences into database before checking class/label files

Open widdowquinn opened this issue 5 years ago • 3 comments

Summary:

The pyani anim command should check for correct formatting before committing sequences to the database - this would save time

Description:

Currently, all sequences are processed and loaded into the database, and then label/class files are checked. This slows things down if there's an error. It would be better to check the formats first.

pyani Version:

v0.3.0dev

Python Version:

3.6

Operating System:

CentOS6

widdowquinn avatar Mar 14 '19 12:03 widdowquinn

@widdowquinn, Is the checking you mean here the if inhash in label_dict: bit? I'm trying to determine if this issue is still relevant, and if so, what should be done about it. I believe the function below is the only thing called from subcmd_anim that uses labels and classes.

The only thing that looks like it deals with the formatting of those labels and classes is the load_classes_labels() call near the top.

def add_run_genomes(
    session, run, indir: Path, classpath: Path, labelpath: Path, **kwargs
) -> List:
    """Add genomes for a run to the database.
    :param session:       live SQLAlchemy session of pyani database
    :param run:           Run object describing the parent pyani run
    :param indir:         path to the directory containing genomes
    :param classpath:     path to the file containing class information for each genome
    :param labelpath:     path to the file containing class information for each genome
    This function expects a single directory (indir) containing all FASTA files
    for a run, and optional paths to plain text files that contain information
    on class and label strings for each genome.
    If the genome already exists in the database, then a Genome object is recovered
    from the database. Otherwise, a new Genome object is created. All Genome objects
    will be associated with the passed Run object.
    The session changes are committed once all genomes and labels are added to the
    database without error, as a single transaction.
    """
    # Get list of genome files and paths to class and labels files
    infiles = get_fasta_and_hash_paths(indir)  # paired FASTA/hash files
    class_data = {}  # type: Dict[str,str]
    label_data = {}  # type: Dict[str,str]
    all_keys = []  # type: List[str]
    if classpath:
        class_data = load_classes_labels(classpath)
        all_keys += list(class_data.keys())
    if labelpath:
        label_data = load_classes_labels(labelpath)
        all_keys += list(label_data.keys())

    # Make dictionary of labels and/or classes
    new_keys = set(all_keys)
    label_dict = {}  # type: Dict
    for key in new_keys:
        label_dict[key] = LabelTuple(label_data[key] or "", class_data[key] or "")

    # Get hash and sequence description for each FASTA/hash pair, and add
    # to current session database
    genome_ids = []
    for fastafile, hashfile in infiles:
        try:
            inhash, _ = read_hash_string(hashfile)
            indesc = read_fasta_description(fastafile)
        except Exception:
            raise PyaniORMException("Could not read genome files for database import")
        abspath = fastafile.absolute()
        genome_len = get_genome_length(abspath)
        # If the genome is not already in the database, add it as a Genome object
        genome = session.query(Genome).filter(Genome.genome_hash == inhash).first()
        if not isinstance(genome, Genome):
            try:
                genome = Genome(
                    genome_hash=inhash,
                    path=str(abspath),
                    length=genome_len,
                    description=indesc,
                )
                session.add(genome)
            except Exception:
                raise PyaniORMException(f"Could not add genome {genome} to database")

        # Associate this genome with the current run
        try:
            genome.runs.append(run)
        except Exception:
            raise PyaniORMException(
                f"Could not associate genome {genome} with run {run}"
            )

        # If there's an associated class or label for the genome, add it
        if inhash in label_dict:
            try:
                session.add(
                    Label(
                        genome=genome,
                        run=run,
                        label=label_dict[inhash].label,
                        class_label=label_dict[inhash].class_label,
                    )
                )
            except Exception:
                raise PyaniORMException(
                    f"Could not add labels for {genome} to database."
                )
        genome_ids.append(genome.genome_id)

    try:
        session.commit()
    except Exception:
        raise PyaniORMException("Could not commit new genomes in database.")

    return 

baileythegreen avatar Apr 13 '22 14:04 baileythegreen

@widdowquinn Is this issue still relevant?

baileythegreen avatar Apr 29 '22 10:04 baileythegreen

I think I was referring to the order of processing, which was - approximately:

(1)

  1. Parse genome file and process
  2. Add genome to the database
  3. Parse labels/classes and process
  4. Add labels/classes info to the database

This meant that it was possible to generate a new row in the database, but then for the run to fail because of a formatting or other error in the labels/classes files. This could probably be dealt with at the same time as #136

A more sensible order of processing would be:

(2)

  1. Parse genome files and process
  2. Parse labels/classes and process
  3. Add genomes/labels/classes info to the database

It's still relevant if the order of operations looks like (1) and not like (2).

widdowquinn avatar Apr 29 '22 11:04 widdowquinn