ADIOS icon indicating copy to clipboard operation
ADIOS copied to clipboard

Slow read times with with multi-time step file using Python bindings

Open rmchurch opened this issue 8 years ago • 6 comments

I have a bp file that has multiple (~100's) of 1d arrays written every timestep, with a total of ~7000 timesteps. Using the Python bindings, reading a single variable is pretty slow:

f = ad.file(file)
key = f.var.keys()[0]
print key,f[key]
e_radial_mom_flux_ExB_df_avg AdiosVar (varid=107, dtype=dtype('float64'), ndim=1, dims=(167L,), nsteps=6961)
%time data = f[key][...]
CPU times: user 394 ms, sys: 942 ms, total: 1.34 s                                                          
Wall time: 1min 3s

If I convert using bp2h5, the conversion takes a long time (~30min), but the reading is much faster:

f = h5py.File(file)
print f[key]
<HDF5 dataset "e_radial_mom_flux_ExB_df_avg": shape (6961, 167), type "<f8">
%time data = f[key][...]
CPU times: user 7.65 ms, sys: 1.12 ms, total: 8.77 ms
Wall time: 8.78 ms

I assume this is because the h5 file has the data in a 2d array format, whereas in the original bp file, the data for a single variable may not be contiguous due to the timestepping. Is there any way to improve this situation, either by changing the way the file is written, or changing how I read the data with Python? I often want to read in all of the data from the file, but this takes a long time, even though its not much data.

rmchurch avatar Feb 28 '18 21:02 rmchurch

Michael,

Can you make your file available for us at OLCF or NERSC? Is this a single bp file or a directory with many subfiles? This is the diagnostics written by a single process, right?

I just made a test file on my VM of 200 variables and 7000 steps (each variable is 5 by 5 2D array) and the read time is fast.

AdiosVar (varid=7, name='v001', dtype=dtype('int32'), ndim=2, dims=(5L, 5L), nsteps=7000, attrs=[]) 0.37046790123

This is my python test reader #!/usr/bin/python import numpy import adios from timeit import default_timer as timer

f = adios.file('many_vars.bp') v = f.var['v001'] print v s=timer() data = v.read() e=timer() print(e-s) f.close()

Thanks Norbert

On Wed, Feb 28, 2018 at 4:35 PM, Michael Churchill <[email protected]

wrote:

I have a bp file that has a multiple (~100's) of 1d arrays written every timestep, with a total of ~7000 timesteps. Using the Python bindings, reading a single variable is pretty slow:

f = ad.file(file) key = f.var.keys()[0] print key,f[key] e_radial_mom_flux_ExB_df_avg AdiosVar (varid=107, dtype=dtype('float64'), ndim=1, dims=(167L,), nsteps=6961) %time data = f[key][...] CPU times: user 394 ms, sys: 942 ms, total: 1.34 s Wall time: 1min 3s

If I convert using bp2h5, the conversion takes a long time (~30min), but the reading is much faster:

f = h5py.File(file) print f[key] <HDF5 dataset "e_radial_mom_flux_ExB_df_avg": shape (6961, 167), type "<f8"> %time data = f[key][...] CPU times: user 7.65 ms, sys: 1.12 ms, total: 8.77 ms Wall time: 8.78 ms

I assume this is because the h5 file has the data in a 2d array format, whereas in the original bp file, the data for a single variable may not be contiguous due to the timestepping. Is there any way to improve this situation, either by changing the way the file is written, or changing how I read the data with Python? I often want to read in all of the data from the file, but this takes a long time, even though its not much data.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS/issues/168, or mute the thread https://github.com/notifications/unsubscribe-auth/ADGMLRBOoqFe7mi8FpqYhJzh8jENqLlNks5tZcadgaJpZM4SXYi1 .

pnorbert avatar Mar 05 '18 15:03 pnorbert

The data is on Edison, /scratch2/scratchdirs/rchurchi/xgca_rmc_mira043redo/xgc.oneddiag.bp It's a single file (yes, written by a single process), no directories with subfiles (total size is only about 1.5GB). I found that on the 2nd read of the same data, the read time drops to 0.5s, not sure if this is caching done by Edison or Adios.

rmchurch avatar Mar 05 '18 16:03 rmchurch

Okay, I see. ADIOS does not cache it. It is the system that caches data. With the current file format and read implementation, there are 7000 consecutive seeks and reads to get the array with all steps, and this is slow for remote disks. The next time it's reading from cache and it's much faster.

I wonder where the hdf5 file was when you got the data in a few milliseconds.

On Mon, Mar 5, 2018 at 11:28 AM, Michael Churchill <[email protected]

wrote:

The data is on Edison, /scratch2/scratchdirs/ rchurchi/xgca_rmc_mira043redo/xgc.oneddiag.bp It's a single file (yes, written by a single process), no directories with subfiles (total size is only about 1.5GB). I found that on the 2nd read of the same data, the read time drops to 0.5s, not sure if this is caching done by Edison or Adios.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS/issues/168#issuecomment-370477023, or mute the thread https://github.com/notifications/unsubscribe-auth/ADGMLRMi_031GElESVIt1OwE77Ce1f_Iks5tbWe0gaJpZM4SXYi1 .

pnorbert avatar Mar 05 '18 17:03 pnorbert

HDF5 is in the same location, you can try it there also (just h5 suffix instead of bp).

rmchurch avatar Mar 05 '18 18:03 rmchurch

I meant, was it in cache or not.

On Mon, Mar 5, 2018 at 1:01 PM, Michael Churchill [email protected] wrote:

HDF5 is in the same location, you can try it there also (just h5 suffix instead of bp).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ornladios/ADIOS/issues/168#issuecomment-370507441, or mute the thread https://github.com/notifications/unsubscribe-auth/ADGMLam7hvDA_jJ-2Qz4G9YzMAMlZJtQks5tbX1ngaJpZM4SXYi1 .

pnorbert avatar Mar 05 '18 18:03 pnorbert

I don't think so. I tried both the bp and h5 today, after having accessed them last week (so I assume both were out of cache by now). Both had the same read timings as before, and had the same characteristic that the 2nd read of the same data would take much less time (suggesting it was cached). The HDF5 data took about 100ms to read on the first read, whereas the bp file took 1 minute on the first read.

rmchurch avatar Mar 05 '18 18:03 rmchurch