pyani
pyani copied to clipboard
`pyani anim` loads sequences into database before checking class/label files
Summary:
The pyani anim
command should check for correct formatting before committing sequences to the database - this would save time
Description:
Currently, all sequences are processed and loaded into the database, and then label/class files are checked. This slows things down if there's an error. It would be better to check the formats first.
pyani Version:
v0.3.0dev
Python Version:
3.6
Operating System:
CentOS6
@widdowquinn, Is the checking you mean here the if inhash in label_dict:
bit? I'm trying to determine if this issue is still relevant, and if so, what should be done about it. I believe the function below is the only thing called from subcmd_anim
that uses labels
and classes
.
The only thing that looks like it deals with the formatting of those labels
and classes
is the load_classes_labels()
call near the top.
def add_run_genomes(
session, run, indir: Path, classpath: Path, labelpath: Path, **kwargs
) -> List:
"""Add genomes for a run to the database.
:param session: live SQLAlchemy session of pyani database
:param run: Run object describing the parent pyani run
:param indir: path to the directory containing genomes
:param classpath: path to the file containing class information for each genome
:param labelpath: path to the file containing class information for each genome
This function expects a single directory (indir) containing all FASTA files
for a run, and optional paths to plain text files that contain information
on class and label strings for each genome.
If the genome already exists in the database, then a Genome object is recovered
from the database. Otherwise, a new Genome object is created. All Genome objects
will be associated with the passed Run object.
The session changes are committed once all genomes and labels are added to the
database without error, as a single transaction.
"""
# Get list of genome files and paths to class and labels files
infiles = get_fasta_and_hash_paths(indir) # paired FASTA/hash files
class_data = {} # type: Dict[str,str]
label_data = {} # type: Dict[str,str]
all_keys = [] # type: List[str]
if classpath:
class_data = load_classes_labels(classpath)
all_keys += list(class_data.keys())
if labelpath:
label_data = load_classes_labels(labelpath)
all_keys += list(label_data.keys())
# Make dictionary of labels and/or classes
new_keys = set(all_keys)
label_dict = {} # type: Dict
for key in new_keys:
label_dict[key] = LabelTuple(label_data[key] or "", class_data[key] or "")
# Get hash and sequence description for each FASTA/hash pair, and add
# to current session database
genome_ids = []
for fastafile, hashfile in infiles:
try:
inhash, _ = read_hash_string(hashfile)
indesc = read_fasta_description(fastafile)
except Exception:
raise PyaniORMException("Could not read genome files for database import")
abspath = fastafile.absolute()
genome_len = get_genome_length(abspath)
# If the genome is not already in the database, add it as a Genome object
genome = session.query(Genome).filter(Genome.genome_hash == inhash).first()
if not isinstance(genome, Genome):
try:
genome = Genome(
genome_hash=inhash,
path=str(abspath),
length=genome_len,
description=indesc,
)
session.add(genome)
except Exception:
raise PyaniORMException(f"Could not add genome {genome} to database")
# Associate this genome with the current run
try:
genome.runs.append(run)
except Exception:
raise PyaniORMException(
f"Could not associate genome {genome} with run {run}"
)
# If there's an associated class or label for the genome, add it
if inhash in label_dict:
try:
session.add(
Label(
genome=genome,
run=run,
label=label_dict[inhash].label,
class_label=label_dict[inhash].class_label,
)
)
except Exception:
raise PyaniORMException(
f"Could not add labels for {genome} to database."
)
genome_ids.append(genome.genome_id)
try:
session.commit()
except Exception:
raise PyaniORMException("Could not commit new genomes in database.")
return
@widdowquinn Is this issue still relevant?
I think I was referring to the order of processing, which was - approximately:
(1)
- Parse genome file and process
- Add genome to the database
- Parse labels/classes and process
- Add labels/classes info to the database
This meant that it was possible to generate a new row in the database, but then for the run to fail because of a formatting or other error in the labels/classes files. This could probably be dealt with at the same time as #136
A more sensible order of processing would be:
(2)
- Parse genome files and process
- Parse labels/classes and process
- Add genomes/labels/classes info to the database
It's still relevant if the order of operations looks like (1) and not like (2).