Designing spyglass from a user's perspective

Open edeno opened this issue 1 year ago • 1 comments

Introduction

Right now Spyglass is designed from a coder's perspective. There are many tables that track different parameters and operations. People often copy notebooks and run those blindly without understanding parameters, even though the original intent was to get users to track and understand their parameters. Additionally, it is not always clear how to process a lot of data at once. It should be easy to identify which parameter sets are currently the recommended/most trustworthy. In this issue, I aim to outline what I think a user wants to do in processing their data.

It would be ideal to have a relatively small set of functions that accomplish these main tasks efficiently.

NOTE: This is sort of a meta-issue that could encompass several PRs so maybe it would be better as a discussion

User Process

Step 1: Collect data, convert to NWB, ingest into Spyglass for processing

I have just collected some data and converted it to NWB. I probably want to check that my data has faithfully been converted into NWB. This could be done pre-Spyglass or within Spyglass. I want visualization and other tools to check this.
Assuming I trust that data has been faithfully converted into NWB, I next want to ingest my data into Spyglass. This is done by populate_all_common. This converts my NWB file into database tables that represent the NWB files. If I have done processing outside of Spyglass, I want that to be loaded into imported tables that I can access just like processed data (Trodes position should really be imported position). I would expect all the data from my NWB file to be imported (even sensor data). I might have trial information in the NWB file; it is a processed datasets.
I would like to visualize my data to make sure the conversion occurred correctly. In particular, I want to make sure my electrode mappings are correct, my imported position is correct, time intervals in the experiment have properly been identified, my DIOs are correctly identified, I have access to the raw video. I also want all my data to have units so I know what I'm dealing with (or some future user knows what they are looking at).
I also want to quickly assess the quality of the recording (see IBL spike sorting white paper) and identify any bad channels.
I probably want to start by ingesting one NWB file and checking it. But then I want to ingest a bunch of NWB files. I want to check some or all of these files to see that they have been ingested properly. This can be difficult with a lot of data.

Step 2: Process Data through Pipelines

Like before, I want to process one NWB file through the main data processing pipelines (Position, LFP, Spikesorting). I want to try default parameters that should just work. Then I want to verify that these worked through visualization. If these didn't work, then I will go back and determine what set of parameters should work. I will potentially need different sets of parameters for different brain areas.

Position Pipeline

If I don't have imported position data, I need to be able to estimate position from the video. I need to use a pose estimation algorithm (DLC, SLEAP, etc). I need to train a pose model or use a pre-trained model that will just work. I want to start with the pre-trained model to see how it works. I need to track what body parts these positions correspond to and how they relate to other body parts (skeleton). I want to use standard body part names so that they are searchable later. I also want to compute derivatives like velocity, speed, orientation, and acceleration. Potentially if I have multiple sensor estimates (like accelerometers), I would like to be able to use these to compute more stable estimates (sensor fusion). I want to apply the pose model to data.
If I have imported position data, I want to do some additional processing like smoothing or combining points for stability. I want to identify what body part this position represents and how it relates to other body parts (skeleton). I want to use standard body part names so that they are searchable later. I also want to compute derivatives like velocity, speed, orientation, and acceleration. Potentially if I have multiple sensor estimates (like acceleration), I would like to be able to use these to compute more stable estimates (sensor fusion).
After I have estimated the position and applied any processing, I want to understand how well this worked. I want to be able to visualize the position(s) and derivatives on top of the original video. If I trained a model, I want metrics on how well this worked. I want to look for edge cases (occlusions, reflections, etc) in particular.
I then want to process positions for multiple NWB files (in parallel if possible). Then check how well it worked. It would be nice to have tools to find potentially problematic estimates, but I also spot check with the video across different animals.

LFP Pipeline

I want to filter the raw ephys trace into LFPs (0.5 - 400 Hz) and downsample (to save space, 1000 Hz is reasonable because Nyquist is sampling frequency / 2. You might want something slightly higher if you for some reason wanted to filter the data again in a high frequency band but most people should not need this) using a default parameter set (because filter design is hard).
I want to make sure I detect any artifacts (does this need to be on referenced data?) and remove them. I want a good way of detecting artifacts so I might need to try different methods. I could possibly use an external signal to detect artifacts (do I need to detect artifacts separately for LFP and for spike sorting?)
I mostly want referenced LFP but there are occasions (like detecting theta from a reference in the callosum) in which I want the unreferenced LFP. I want to pick an electrode contact to use as reference or use something like common average referencing or laplacian referencing ( subtracting the average of K-nearest neighbor electrodes). I might want to try different types of referencing but I want a sensible default. Would I want to reference differently per brain area? This seems less common. But if I want to Laplacian reference then I probably need to know the geometry of the probe.
I want to filter all electrode contacts for all my data except for any bad contacts. Maybe group filtering by probe device? I probably don’t need to know what brain area everything corresponds to yet. Parallelize if possible.
I want to be able to visualize the artifacts on top of the original trace.
If I’ve imported LFP, I might still want to detect artifacts?
I also might want to further filter the LFP (imported or not) into narrow bands (delta, theta, high gamma, low gamma, ripple). There are different definitions of frequency bands but I want a standardized set of filters to choose from (because again filter design is hard and consistency is good). I may want to select a set of electrodes to filter (potentially by brain region? could be even more granular but brain region would at least be a good starting point). Perhaps doing all LFPs is okay since we are working with downsampled data. Mostly this is a disk space concern.
I want to be able to get the instantaneous amplitude and phase of the different bands.
I want to be able to visualize the filtered band trace over the LFP trace to check the filtering. I also want to be able to visualize the artifacts on top of this.
I eventually want to know which brain area corresponds to which electrode contact. I might want to know the spatial position in the brain as well (for CSD analyses for example). If this does not come with the NWB file (where brain area is only defined at the granularity of the probe device anyways), then I need a way to assign brain areas to recording contacts. I do this through histology or prior annotation of brain areas based on ???. I need a table to put this in and I need to use a consistent vocabulary (brain atlas) to label things.

Spike Sorting Pipeline

I want to spike sort with some default parameters. I potentially want to spike sort with different parameters for different brain regions which involves knowing which contacts belong to which brain region. I also want to do artifact detection (but this is different and separate from the LFP pipeline?) before the spike sorting. Then I want to run curation automatically at first and then refine it later.
After establishing parameters, I would like to spike sort over many datasets. I would want to then visualize and check this (although this happens over the curation process as well).
I also want to know which brain region the neuron corresponds to and possibly what type of neuron it is. This would involve accessing the spike waveforms or spike waveform features and then classifying things like the spike width and firing rate. It would be nice to have these features easily available to quickly filter and select these groups of neurons for analysis.
If I have imported spikes, then I want to pull in the spike times and any associated features of the neuron.
I might want a basic raster plot to make sure the spiking makes sense?

Process experimental cues/triggers

I want to use things like the DIOs and Statescript to know when certain experimental or behavioral events occurred. If I have a trial structure, I want to know what the experimental condition was for a given time period. I want to be able to easily visualize these alongside the video to know that they are correct. I want to be able to easily find the time or time intervals of experimental or behavioral events.
I also want to have some session/day related information so I can do things like characterize learning or experimental conditions that last per session. Currently some of this is given by some obscure strings or by parsing the name of the NWB file, but it would be nice to easily pull this up. Later analyses could add labels to this table for filtering or certain analyses.
There are basically three levels of events I want to access quickly: within trial, trial-level, across session/days. Sometimes I will need to go from start and end times to per time bin labels. This can involve some sort of interpolation that I need access to easily for analysis. Interpolation is also useful when dealing with time series with different sampling rates.

Step 3: Process Data through Derived Pipelines

This step is similar to the above but depending on the goal of my analysis, I may only want to run a subset of these steps. I also want to be able to run these processes over many animals/days/sessions. I want to be able to visualize their outputs.

Ripple Times

Depends on Position (speed), LFP (referenced)
Visualization would involve LFP, filtered LFP at Ripple Band, consensus trace, speed, potentially MUA if available.

Multiunit Activity and High Multiunit Activity Times

Depends on Spike Sorting
Depends on Spike Sorting, Position (speed)
Visualization would be raster of all spikes, speed.

Linearization (of Position)

Depends on Position
Visualization would involve 2D position on video.

Decoding

Depends on Position, Spike Sorting, potentially ripple times for excluding non-local times.
Visualization would involve position, speed, raster.

Behavioral syllables

Depends on Position

I would like to process all (or most of my data) for most NWB files. This probably involves looping over animals and sessions. I want to be able to re-run all my data with a single script.

Step 4: Analysis outside spyglass

I would like to use a package with commonly done analyses.

This is obviously inspired by and related to: https://github.com/LorenFrankLab/spyglass/issues/529

Ideal workflow?

graph TD;
    common--> bad_channel_detection;
    bad_channel_detection--> reference_ephys;
    reference_ephys--> filter_downsample_ephys;
    filter_downsample_ephys--> artifact_detection
    artifact_detection--> whitening --> spike_sorting;
    artifact_detection--> LFP;
    LFP--> LFPBand;
    common--> train_position --> position_inference;
    position_inference--> behavior_syllable;
    position_inference--> decoding;
    spike_sorting--> decoding;
    LFPBand--> ripple_detection;
    position_inference--> ripple_detection;

Collapse MUA and Linearization into Spike_Sorting and Position respectively?
Spikeinterface tutorials filter and then reference. Our spike sorting pipeline references and then filters. This is probably more correct.
Does artifact detection always happen after filter and down sample? This justifies our current artifact detection happening separately in LFP and Spike Sorting as you would get separate kinds of events. We do use different code for both which should probably be combined. If we use common average/median referencing, then perhaps an artifact detection step before referencing makes sense to avoid this affecting the average/median (although the median should be more robust to outliers).
whitening and common average referencing are kind of doing the same thing.

Apr 18 '25 02:04 edeno

Here's the LLM answer which I think is not a crazy proposal:

The core tension you've identified is between the developers need for granular control, explicit parameter tracking, and provenance (which DataJoint tables excel at) and the user's desire for efficient, task-oriented functions with sensible defaults, integrated validation, and easy batch processing.

Here's a breakdown of how your outlined "User Process" relates to the current Spyglass structure and potential high-level functions/abstractions that could bridge the gap, building on some of our previous discussion:

Step 1: Ingestion and Validation

User Wants: Simple ingestion from NWB, validation tools (pre-Spyglass or within), visualization of ingested data (electrode mapping, position, intervals, DIOs, video access), handling of pre-processed imported data (e.g., Trodes position), easy batch ingestion and checking.
Current Spyglass: populate_all_common handles ingestion into common tables. ImportedLFP, ImportedSpikesorting, ImportedPose exist but might need streamlining (as you noted for Trodes position). Visualization often requires fetching data from specific tables and using external plotting libraries or custom helper functions (like sortingview). Batching requires looping populate_all_common over NWB files.
Proposed High-Level Solution:
- spyglass.ingest_session(nwb_path, validate=True, visualize=False, import_processed=True):
  - Wraps populate_all_common.
  - validate=True could add checks for expected metadata (e.g., electrode locations, task info). Pre-Spyglass NWB validation tools are likely separate, but Spyglass could check for essential fields post-ingestion.
  - visualize=True could optionally trigger standard plots: electrode geometry (common_electrode), basic raw trace snippet (common_ephys), position trace (common_position/ImportedPose), interval list summary (common_interval.IntervalList().plot_intervals).
  - import_processed=True could trigger population of relevant Imported* tables if corresponding data exists in NWB. Standardize how imported data (like Trodes position) is handled, perhaps always routing through an ImportedPose table.
- Visualization Suite: Develop a more integrated spyglass.visualize module with functions tied to primary keys (e.g., spyglass.visualize.plot_electrode_geometry(nwb_file_name), spyglass.visualize.plot_position_over_video(key)) that abstract the data fetching and plotting.
- Batch Ingestion: A simple helper spyglass.ingest_sessions(nwb_path_list, parallel=True, **kwargs) to loop ingest_session.

Step 2: Process Data through Pipelines (Position, LFP, Spike Sorting)

User Wants: Run pipelines with default parameters first, visualize outputs for verification, iterate on parameters if needed (potentially varying by brain region), batch process efficiently.
Current Spyglass: Requires selecting parameters from *Params tables, inserting into *Selection tables, and calling populate on computed tables. High-level pipeline runners (populate_spyglass_*) exist but still require specifying many keys/parameter names. Visualization requires fetching results and plotting. Batching involves looping these steps.
Proposed High-Level Solution:
- Pipeline Runner Functions: Create dedicated functions for each major pipeline:
  - spyglass.run_position_pipeline(nwb_file_name, interval_name, params_name='default', source='video'/'imported', visualize=True, **kwargs): Handles DLC model selection/inference or imported position processing, derivative calculation, inserts into PositionOutput, and optionally calls visualization.
  - spyglass.run_lfp_pipeline(nwb_file_name, electrode_group_name, interval_name, lfp_params='default', artifact_params='default', band_params=None, visualize=True, **kwargs): Wraps the logic previously discussed (potentially including the suggested referencing refactor), handles base LFP, artifact detection, and optional band extraction via simplified band_params dict structure.
  - spyglass.run_spikesorting_pipeline(nwb_file_name, electrode_group_name, interval_name, sorting_params='default', artifact_params='default', curation_params='auto', visualize=True, **kwargs): Manages artifact detection (potentially specific parameters for sorting), sorter execution, automatic curation, insertion into SpikeSortingOutput.
- Intelligent Defaults: These functions should query the relevant *Params tables for the entry named 'default' (or another specified name). Parameter tables need well-vetted default entries.
- Simplified Parameter Iteration: Provide helpers to easily view available *Params names and their contents. The runner functions accept params_name arguments, making it easy to switch between parameter sets ('default', 'CA1_optimized', etc.).
- Integrated Visualization: The visualize=True flag in runners triggers standard plots for the pipeline output (e.g., position overlay, LFP trace with artifacts, raster plot).
- Batch Processing Function: spyglass.batch_run_pipeline(pipeline_func, keys_list, parallel=True, **kwargs): Takes one of the run_*_pipeline functions and a list of primary keys (e.g., list of dicts with nwb_file_name, interval_name, etc.) and executes the pipeline across them.

Handling Specific Pipeline Points:

Position: The run_position_pipeline should handle both DLC/SLEAP (using pre-trained models by default, linking to training tables) and imported data flows. Standardization of bodypart names needs convention enforcement (perhaps via PoseGroup). Sensor fusion is advanced and likely beyond default pipelines initially.
LFP:
- The run_lfp_pipeline should use default filters (needs FirFilterParameters populated with standards).
- Artifact detection should be included by default (using artifact_params='default'). Clarify if LFP/Spikesorting artifacts need separate detection (current design allows this via spikesorting.v1.ArtifactDetection vs lfp.v1.LFPArtifactDetection).
- Referencing needs simplification (as discussed previously – ideally an explicit step or simplified options within run_lfp_pipeline). Default could be no reference or common average if geometry is available.
- Filtering all contacts by default is reasonable. band_params allows specifying subsets if needed.
- Getting amplitude/phase: Add methods to LFPBandV1 (like compute_signal_phase, compute_signal_power which exist) easily callable via the merge table: (LFPOutput & key).compute_signal_phase(...).
- Assigning brain areas: Requires populating common_region.BrainRegion and common_region.ElectrodeRegion or similar, using standard atlases. This is often manual or semi-automated based on histology/probe tracks. Helper functions spyglass.get_units_by_region(region_name) or spyglass.get_lfp_by_region(region_name) would be useful.
Spike Sorting:
- run_spikesorting_pipeline uses default sorter params. Brain region-specific sorting requires selecting appropriate electrode_group_name inputs (which could map to regions).
- Automatic curation (via MetricCuration or similar) should be the default. Manual curation involves later interaction with specific curation tables/figurls.
- Neuron features (spike width, firing rate): These could be computed in a downstream table (e.g., UnitFeatures) dependent on SpikeSortingOutput. UnitWaveformFeatures already computes some. Provide easy access: (SpikeSortingOutput & key).fetch_unit_features().

Step 3: Process Experimental Cues/Triggers & Derived Pipelines

User Wants: Easy access to DIOs, trial structure, session info; run derived pipelines (Ripples, MUA, etc.) easily; visualize outputs.
Current Spyglass: DIOs in common_dio, trial info often in NWB epochs table -> IntervalList, session info in Session or Nwbfile. Derived pipelines have their own tables and often high-level populator functions.
Proposed High-Level Solution:
- Event/Trial Access: spyglass.get_session_events(key, event_types=['dio', 'trials']): Function to retrieve and potentially align different event types (from DioEvents, relevant IntervalList entries for trials) into a common structure (e.g., pandas DataFrame). Needs standardized way to identify trial intervals.
- Interpolation Helpers: Enhance/expose interpolation utilities (like in common_interval or ripple.v1.ripple.interpolate_to_new_time) for aligning different time series.
- Derived Pipeline Runners: Similar structure to Step 2 runners, e.g., spyglass.run_ripple_detection(key, ripple_params_name='default', visualize=True), spyglass.run_mua_detection(key, mua_params_name='default', visualize=True). These would find the necessary inputs (Position, LFP, Spikes) via the merge tables based on the provided session/interval key.
- Batching: Use the same spyglass.batch_run_pipeline function with the derived pipeline runners.

Step 4: Analysis Outside Spyglass

User Wants: Easy export to common analysis packages/formats.
Current Spyglass: Fetching data via .fetch_nwb(), .fetch1_dataframe(), etc., returns standard Python/pandas/NWB objects.
Proposed High-Level Solution: Continue providing easy fetch methods that return standard, usable formats. Potentially add export helpers for specific external tools if common workflows emerge (e.g., (SpikeSortingOutput & key).export_to_phy()). SpyglassMixin already provides basic fetching capabilities.

Summary:

Your proposal correctly identifies the need for a user-centric layer on top of the existing DataJoint structure. The key is to create a set of intuitive, task-oriented functions (ingest_session, run_*_pipeline, batch_run_pipeline, visualize.*, get_*_events) that encapsulate the underlying table interactions, parameter selections, and population calls. This retains the benefits of DataJoint (provenance, reproducibility) while offering the streamlined workflow users desire. This would indeed be a significant undertaking, likely involving multiple PRs focused on ingestion, each pipeline, visualization, and batching.

Apr 18 '25 02:04 edeno