python-bioformats
python-bioformats copied to clipboard
Inefficient retrieval of WellSample attribute values
Getting attribute values of a WellSample object is very inefficient. This becomes a major bottleneck when working with a large number of images, for example in case of a 384 well plate.
Here is small test program that demonstrates the problems and a potential solution:
#!/usr/bin/env python
import bioformats
import itertools
import logging
import argparse
import time
FORMAT = '%(asctime)-15s | %(message)s'
logging.basicConfig(format=FORMAT)
logger = logging.getLogger('test')
logging_level_verbosity = {
0: logging.CRITICAL,
1: logging.INFO,
2: logging.DEBUG,
}
if __name__ == '__main__':
parser = argparse.ArgumentParser('bioformats test program')
parser.add_argument('-v', dest='verbosity', action='count', default=0,
help='increase logging verbosity')
parser.add_argument('--hashing', action='store_true',
help='store reference image IDs in dictionary')
args = parser.parse_args()
logger.setLevel(logging_level_verbosity[args.verbosity])
start_time_1 = time.time()
metadata = bioformats.OMEXML()
logger.info('prepare first plate')
plate = metadata.PlatesDucktype(metadata.root_node).newPlate(name='default')
plate.RowNamingConvention = 'letter'
plate.ColumnNamingConvention = 'number'
plate.Rows = 16
plate.Columns = 24
n_samples = 48
logger.info('process first plate')
well_count = 0
sample_count = 0
lookup = dict()
for r, c in itertools.product(range(plate.Rows), range(plate.Columns)):
logger.info('process well # %d: %s%.2d', well_count+1, chr(r+1+64), c+1)
well = metadata.WellsDucktype(plate).new(row=r, column=c)
samples = metadata.WellSampleDucktype(well.node)
for s in xrange(n_samples):
logger.debug('process sample # %d', s)
samples.new(index=s)
samples[s].ImageRef = 'Image:%d' % sample_count
# Store the attribute value in a dictionary for subsequent hashing
lookup[(r, c, s)] = samples[s].ImageRef
sample_count += 1
well_count += 1
end_time_1 = time.time() - start_time_1
logger.info('processing first plate took %s seconds', end_time_1)
start_time_2 = time.time()
logger.info('prepare second plate')
new_metadata = bioformats.OMEXML()
new_plate = new_metadata.PlatesDucktype(
new_metadata.root_node).newPlate(name='default')
new_plate.RowNamingConvention = 'letter'
new_plate.ColumnNamingConvention = 'number'
new_plate.Rows = plate.Rows
new_plate.Columns = plate.Columns
n_samples = len(plate.Well[0].Sample)
logger.info('process second plate')
for w, well_id in enumerate(plate.Well):
logger.info('process well # %d: %s', w+1, well_id)
well = new_metadata.WellsDucktype(plate).new(
row=plate.Well[w].Row, column=plate.Well[w].Column)
samples = new_metadata.WellSampleDucktype(well.node)
for s in xrange(n_samples):
logger.debug('process well sample # %d', s)
samples.new(index=s)
if args.hashing:
samples[s].ImageRef = lookup[(well.Row, well.Column, s)]
else:
samples[s].ImageRef = plate.Well[w].Sample[s].ImageRef
end_time_2 = time.time() - start_time_2
logger.info('processing second plate took %s seconds', end_time_2)
Running the program with --hashing
argument
test_bioformats.py -v --hashing
takes less than 5 minutes
2015-12-02 11:04:13,044 | processing second plate took 206.617892981 seconds
while running it in default mode
test_bioformats.py -v
takes almost 1 hour
2015-12-02 12:17:57,098 | processing second plate took 3459.09807611 seconds
I assume the problem lies in the implementation of the __getitem__
method of the WellDucttype and WellSampleDucktype classes.
The major bottleneck is the __getitem__
method of the WellDucttype class.
Changing the code in the above example to:
logger.info('process second plate')
for w, well_id in enumerate(plate.Well):
logger.info('process well # %d: %s', w+1, well_id)
well = new_metadata.WellsDucktype(plate).new(
row=plate.Well[w].Row, column=plate.Well[w].Column)
samples = new_metadata.WellSampleDucktype(well.node)
currently_processed_well = plate.Well[w] # store the well object
for s in xrange(n_samples):
logger.debug('process well sample # %d', s)
samples.new(index=s)
if args.hashing:
samples[s].ImageRef = lookup[(well.Row, well.Column, s)]
else:
samples[s].ImageRef = currently_processed_well.Sample[s].ImageRef
boosts the performance of
test_bioformats.py -v
to
2015-12-02 12:39:40,283 | processing second plate took 284.545585871 seconds
The for
loops in __getitem__
should probably be replaced by hashing.
Thanks, Markus, You're right about the caching speedup. I was aiming for simplicity and a low memory footprint there. The problem with a cache here is that the underlying XML document could conceivably be changed and the cache would not reflect the changes. The other problem is that iter returns well names and in your case, the problem would have been solved if it returned wells instead. What do you think about adding a function to WellsDucktype that would return an iterator over Well instances (WellsDucktype.get_wells() and WellSampleDucktype.get_samples())? The code would then look like this:
for w, src_well in enumerate(plate.Well.get_wells()):
logger.info('process well # %d: %s', w+1, src_well.ID)
well = new_metadata.WellsDucktype(plate).new(
row=src_well.Row, column=src_well.Column)
samples = new_metadata.WellSampleDucktype(well.node)
for s, src_sample in enumerate(src_well.get_samples()):
logger.debug('process well sample # %d', s)
sample = samples.new(index=s) # needs WellSampleDucktype.new() to return sample
sample.ImageRef = src_sample.ImageRef
Realistically, I may have some trouble finding time to do this, but I'd take a pull request for the changes needed.
Thank you for your immediate feedback!
I would welcome an __iter__
method that returns well objects.
Since WellsDucktype inherits from dict a dict-like iterator would be the most intuitive in my opinion:
for well_id, src_well in plate.Well.iteritems():
well = new_metadata.WellsDucktype(plate).new(
row=src_well.Row, column=src_well.Column)
...
This should also be compatible with the current implementation, since
for well_id in plate.Well:
src_well = plate.Well[well_id]
...
would still work.
The coupling with the underlying XML document is a nice feature, in particular because of the to_xml()
. However, it also brings some problems with it. Do you think one could uncouple it and separate XML parsing/dumping from the Python object structure? I would be happy to help.