hnn-core icon indicating copy to clipboard operation
hnn-core copied to clipboard

Internal data containers and saving simulation results to disk

Open cjayb opened this issue 3 years ago • 12 comments

We agree on the niceties of Numpy-arrays for data output containers

  • Any analysis of time series we generate will involve np.something, so sooner or later the casting has to happen. We might find that large LFP arrays sampled at the default 40 kHz for several seconds and hundreds of trials become very inefficient to maintain as plain Python-lists, with all the overhead involved.
  • To help build consistency, we have dicussed creating a base class (or Mixin's) for the data container that the other objects inherit from? CellReponse is probably the trickiest one to convert.
  • We have to be careful how we index gids though when we deal with cell_response.vsoma etc.

For long simulations on a bunch of headless cluster nodes, getting the sim results (net and dpl) to an interactive session will require a disk write. This could be an argument for keeping things like LFPArray._data as lists, not ndarray's, as the former can (in principle) be pickled. However, I don't think pickling long multi-trial simulation results is a feasible route. It will quickly become terribly inefficient, and potentially error-prone.

I'm starting this Issue to advocate a binary format, such as HDF5. We could also consider going the FIFF-way, as the dependency on mne seems inevitable.

A relatively low-threshold approach might be to create h5 containers with

  • a serialised dump of the Network, minus _data-like attributes, into one container
  • an explicit saving of all ndarraya into separate containers

I think the number of containers would be on the order of a handful, no more.

This is a spin-off of #340

cjayb avatar May 28 '21 13:05 cjayb

However, I don't think pickling long multi-trial simulation results is a feasible route. It will quickly become terribly inefficient, and potentially error-prone.

Let's not invent a problem that does not exist yet. MNE has so far relied on Joblibs which uses pickling and that works just fine. The alternative to saving files is going to mess up user directories as in the old HNN.

However, I think storing of objects has to be thought through. We discussed this with Blake before and HDF5 (for CellResponse) / npy / mat (for LFPArray and Dipole) makes sense to me (c.f. https://github.com/jonescompneurolab/hnn-core/issues/159). Are we ready to deprecate the dipole.txt files? (cc @rythorpe).

jasmainak avatar May 30 '21 02:05 jasmainak

Summarizing what we discussed today:

simulate_dipole -> simulate

After thinking a bit, I think it's not straightforward to combine dipole object with cell_response and LFP. Even if we did:

dipole.plot -> cell_response.plot_dipole

There is the question what happens when you do average, what happens to the spikes? Similarly with LFP the dimensions of the array are (n_electrodes, n_trials, n_times) rather than (n_cells, n_trials, n_times). Another approach might be to do:

dipole, cell_response, lfp = simulate(net)

The only thing is that it's not super neat when you don't record lfp for instance, then you have two return arguments:

dipole, cell_response = simulate(net)

Another alternative is to pass LFPArray as an input to simulate(net), so you have:

dipole, cell_response = simulate(net, lfparray)

and then you can do:

lfparray.plot() 

etc. For saving to disk, each of these containers would have a save method that saves to hdf5 (all trials in one file)

jasmainak avatar Jun 01 '21 17:06 jasmainak

I too dislike the varying number of output arguments, very opaque what's going on there. A monolithic Results-container doesn't feel right either. I think I'm currently +1 for your last example. I guess semantically it's saying: "simulate this net, and also this extracellular array on the side". Makes sense in terms of the modularity we were discussing.

Oh, and the array keeps track of its own time, so it is actually disjoint from the net!

cjayb avatar Jun 01 '21 19:06 cjayb

+1 for the last example. It's the most intuitive to me.

rythorpe avatar Jul 02 '21 20:07 rythorpe

Hey @ntolley @jasmainak @rythorpe @cjayb , I am interested in the GSOC idea of developing IO routines for HNN-core outputs. I started studying about it and had a few queries :-

  • Dipole and extracellular arrays are stored as text files and read as numpy arrays. Does this txt format needs to be changed to some other format(hdf5) or do the existing functions of read and write need to be modified?
  • Similarly for cell responses spike times are stored in txt files. Read and write methods are already present. Do they need to be modified?
  • params file is in a .json or .param format. The network is built using params but currently connections cannot be added by only reading the params file. The Jones model needs to be called for making the complete network. Function for Storing network and reading it in hdf5 format is required as given in the idea description. But does a function to build network from a params file also required?

raj1701 avatar Mar 15 '23 10:03 raj1701

Hi @raj1701 we need a strategy to migrate from "old formats" to "new formats". I think the most straightforward approach is to support ready/write for both old and new formats but support write only for new formats. This way, old files will still work but new files will be created in the new format. The first step is to define the formats in a document and then to write the IO functions following the definition agreed upon.

My preference is for hdf5 since it is cross-language: can be read in both Matlab and Python, as well as very flexible. The flexibility will allow us to use this format across all functions. And finally, being a binary format, it will be memory efficient. In your proposal, it would be important to identify your "plan of action". Part of that is to identify what might be the potential challenges and roadblocks: e.g., defining the file format for the Network object might be challenging, particularly when you consider also the newer additions to the object such as calcium dynamics. The reader should be able to read the file and instantiate the connections directly.

Finally I do want to note that there is another strong candidate interested in this proposal, but they have not shown a full commitment yet through a pull request. However, I do encourage you to select the proposal that you find most interesting and doable according to your skill sets as that will determine your overall success.

jasmainak avatar Mar 15 '23 20:03 jasmainak

Thanks @jasmainak. I will try to come up with a plan of action and start working on the proposal

raj1701 avatar Mar 16 '23 07:03 raj1701

Don't hesitate to share an early draft of your proposal on google docs

jasmainak avatar Mar 16 '23 18:03 jasmainak

Yes surely, will keep on updating the proposal according to the reviews I receive from you guys.

raj1701 avatar Mar 16 '23 20:03 raj1701

Hey @jasmainak @rythorpe @ntolley, can you please share your email ids with me. I have worked on the project idea and detailed description section of the proposal. Please review it once. Thanks!!

raj1701 avatar Mar 22 '23 09:03 raj1701

Hey @raj1701, my email address is ryan_thorpe at brown dot edu

rythorpe avatar Mar 22 '23 19:03 rythorpe

Thanks @rythorpe!!

raj1701 avatar Mar 23 '23 04:03 raj1701