netcdf4-python icon indicating copy to clipboard operation
netcdf4-python copied to clipboard

Large dataset write speed

Open mfreer opened this issue 5 years ago • 5 comments

Hey all,

I've been working on some code to convert binary image datasets into NetCDF format. The motivation behind this work is to aid users of these images by creating a common format for these images, which can come from multiple, esoteric formats, which are difficult for new users to process. In each file, there are a large number of images (typically several million), each having dimensions of roughly 128 x 128.

I've found that attempts to write these large datasets to NetCDF are a bit slow. I've tried various chunking schemes, but I haven't been able to achieve any significant performance increases. Are there any methods or tricks I might be missing to increase my write speeds?

Here's a very simple code snippet which shows what I'm looking to improve. On my machine (2016 MBP), the write command here of 100,000 'images' takes around 60s to complete... Any suggestions for improvement would be greatly appreciated!

#!/usr/bin/env python
# coding: utf-8

import netCDF4 as nc
import numpy
import time

rootgrp = nc.Dataset('testrun.nc', 'w')

rootgrp.createDimension('Time', None)
rootgrp.createDimension('Slices', None)
rootgrp.createDimension('Array', 128)

test1 = rootgrp.createVariable('test1', 'u2', ('Time', 'Slices', 'Array'), chunksizes=(10000, 128, 128))

a = numpy.zeros((100000, 128, 128))

print('starting write')
t0 = time.time()
test1[:] = a
print('write finished: ', time.time() - t0)

rootgrp.close()

mfreer avatar Dec 21 '19 05:12 mfreer

I'm guessing it has something to do with their being more than one unlimited dimension. Do you really need both 'Time' and 'Slices' to be unlimited?

jswhit avatar Dec 24 '19 22:12 jswhit

The 'Time' dimension will likely need to stay as an unlimited dimension, since the number of images isn't known until the dataset is fully decompressed and processed. For the 'Slices' dimension, not all images have the same number of slices, however, there is a defined maximum number possible, depending on the source of the images. From a speed point of view, would it be better if the 'Slices' dimension is set at this maximum from the beginning?

mfreer avatar Dec 30 '19 00:12 mfreer

I think having the 'Slices' dimension be fixed would speed things up considerably.

jswhit avatar Jan 22 '20 23:01 jswhit

Thanks all for the input. I've recently had a chance to do some testing, and looked at having a fixed vs unlimited slice dimension. Using the code above, there was a minor difference in write speed between the fixed and unlimited case (22 sec vs 24 sec for fixed vs unlimited), so not as significant as I was hoping.

Are there any other things that could be having an effect on the write speed? I thought perhaps the chunksize could have some impact, but I haven't found a combination that gave any significant improvement...

mfreer avatar Feb 25 '20 00:02 mfreer

Chunksizes can have a large impact on read and write speed. See https://www.unidata.ucar.edu/software/netcdf/docs/netcdf_perf_chunking.html and https://www.unidata.ucar.edu/blogs/developer/en/entry/chunking_data_why_it_matters.

jswhit avatar Feb 26 '20 19:02 jswhit