Order of processing steps
I'm playing around with giving glmtools more data than I have in the past and I noticed that it seems to run for a long time and then write output files. I'm wondering if this is required and if output files could be generated one at a time. The other thing I'm curious about with this is that while it is running for a long time, it seems to stay at a relatively steady memory usage of ~1.5GB, then jump to ~2GB when writing the output files.
So for the tasks of reading the data, doing calculations, and writing output, is it possible to run the algorithm(s) per-time resolution (1 minute by default for make_GLM_Girds.py)? Any idea if this would be faster? How is memory kept near-constant if all/most of the data is being read in?
The processing doesn't assume that the grid time slices align with the start or end of data file chunks. A 3D (x,y,time) grid is created at the start of processing, and then as files are processed the flashes are "routed" to the right time window and plopped onto the grid at the necessary index. Then, as you observed, the grids are written to disk once all data have been processed.
One could rewrite lmatools.grid.make_grids.FlashGridder to use a list of 2D xarrays instead of a 3D numpy grid, and then age off those products based on some knowledge that all data for an interval had been processed.
The memory probably balloons due to the process of turning the numpy arrays into 2D xarray data structures as files are written, or maybe it's something in the low-level NetCDF writer where there's additional malloc.
I don't know how speed would be impacted. Even for 1 min of data over full disk, the NetCDF writer is something like 30% of the runtime based on profiling, if memory serves. Sure seems like there's room for optimization on write.
Ok, so are there calculations that depend on the 3D array? Like calls to numpy functions that use the time dimension? I assume that to come up with the final gridded data array the arrays are combined (summed, averaged, etc) along the time dimension.
I suppose in the future you could use dask to do this same set of operations without actually computing things until it is actually needed. That might be the easiest way to not restructure the entire code base just to get some performance out of it.
About being faster, I think it may really help in the full disk cases in the future. In my experience people assume that because they have the memory on their processing machine that that means things are going to perform well. However, in most cases the task of asking your system for gigabytes of contiguous memory is usually not easy/fast and if you end up having to copy the data (usually accidentally) then that contiguous memory allocation time gets even worse.
I looked at the key steps this afternoon, and I don't think there are any calculations that depend on the 3D array. In glmtools.grid.make_grids.GLMlutGridder.*_pipeline_setup the 3D grid is created and then a 2D slice is sent to the accum pipeline steps. The flashes_to_frames function (really, a coroutine) is the first step in each pipeline, and fans out to N_times accumulators, but doesn't know anything about the 3D grid, just what branch in the pipeline goes with which time.
On write, in glmtoos.io.imagery.write_goes_imagery, the gridder.outgrids attribute (which isn't touched during the gridding process, based on my quick check) is processed one frame at a time, writing one file for each grid accumulation interval the user has selected.
So, my conclusion is that nothing depends on having the 3D array. I think it would be possible to allocate a bunch of individual 2D arrays in a list in glmtools.grid.make_grids.GLMlutGridder.*_pipeline_setup, and then process that list upon write without changing any of the accumulation code.
That would solve the "big contiguous memory" problem associated with a 3D grid. It does not solve the problem of writing files as they are complete. There really isn't any logic in place to track that kind of information — the code presumes data files might come in any order and cross time boundaries.
the code presumes data files might come in any order
Can they be sorted before or after calling the function and that sorting be a "promise" of using some of the functions if needed?
I don’t see why not. Is there a coding pattern for implementing this sort of thing? I’d like to not poorly reinvent a standard abstraction. I can point to how to shut down an accumulator, though triggering file write and incremental cleanup of associated memory is harder for me to see, and might need some rearchitecting.
On Feb 16, 2020, at 17:42, David Hoese [email protected] wrote:
the code presumes data files might come in any order
Can they be sorted before or after calling the function and that sorting be a "promise" of using some of the functions if needed?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.
I was just thinking about input_files = sorted(input_files) and assume that they can be chronologically sorted by filename. I would not worry about it right now. I would think about it more when a larger refactor might happen if you start doing more numba or dask work on the processing. In the end it would be great to have things only load the data that it needs, write out the results, and move on to the next thing. With dask this would come "by accident".
Maybe the next conference we are at together we can sit down and talk about this. If you need optimization earlier than that then we can do a video call.