spyglass
spyglass copied to clipboard
`IntervalList` is hard to use and highly redundant in contents
Description
IntervalList
is a central table shared across many pipelines. It is sufficiently generic to suit the needs of any pipeline, but I would argue that the lack of pipeline-specific fields have overburdened the name as a place to store data. A lack of uniqueness in the numpy array stored has lead to redundant contents and slow fetch times.
In our production database, this in the only core schema with a backup >1GB and it has been increasing in size. This may slow down day-to-day operations, and makes restoring a backup more cumbersome. Full backups of the database now exceed file size limits for common data transfer methods. Just isolating this tables as a candidate for data size reduction:
Interval lists: 131345
... with unique valid_times: 33876, 25.7916%
... sets of repeated valid_times: 5718, 4.3534%
... sets with >10 repeated valid_times: 3327, 2.5330%
... with no downstream: 2031, 1.5463%
... size reduction if no repeated valid_times: 69.8550%
Qualitatively, most redundancy comes from registering a new interval list for each electrode for each session, when, even after processing methods, valid_times is unlikely to change within a probe
- Status-quo structure Pros:
- Can help standardize an interval as a data type - but this is not enforced when inserting the blob
- Permits easier comparison of entries across pipelines - if this has happened, I have not seen examples
- Structure Cons:
- Pipelines use very different naming conventions
- Data stored in substrings is much harder to compress, store, and search. This data is often redundant with the keys it is linked to. For example, the interval list name
SubjectDate_InitialInterval_Electrode_ProcessingStep_valid
would be more searchable asSession * IntervalList & uuid_based_key
datajoint QueryExpression . - Rather than reuse existing entries with the same
valid_times
, new times are added.
Script used to examine redundancy
from hashlib import sha256
import datajoint as dj
import numpy as np
from spyglass.common import IntervalList
from spyglass.utils.dj_helper_fn import get_child_tables
from spyglass.utils.logging import logger
schema = dj.schema("cbroz_temp")
@schema
class IntervalListHashes(dj.Computed):
definition = """
-> IntervalList
valid_times_hash: char(64) # unique value for each set of valid_times
"""
@classmethod
def make_hash(cls, valid_times):
# low collision hash function
return sha256(
b"".join(arr.tobytes() for arr in valid_times)
).hexdigest()
def make(self, key):
valid_times = (IntervalList() & key).fetch1("valid_times")
# Normalize different valid_times formats.
if not isinstance(valid_times, list):
valid_times = [valid_times]
first_element = valid_times[0] if valid_times else None
if isinstance(first_element, tuple):
valid_times = [np.array(arr) for arr in valid_times]
self.insert1(dict(key, valid_times_hash=self.make_hash(valid_times)))
@schema
class IntervalListHashCounts(dj.Computed):
definition = """
valid_times_hash: char(64)
---
count: int
valid_times: longblob
description = "" : varchar(64)
"""
class IntervalListEntry(dj.Part):
definition = """
-> master
-> IntervalList
---
is_orphaned: bool
"""
# Lazy load properties. These are cached after first call.
_orphans = None
_non_orphans = None
_orphan_status = None
_key_source = None
@property
def key_source(self):
"""Determine populate candidates without foreign key."""
if self._key_source:
return self._key_source
# Populate functionally upstream table
if len(IntervalList - IntervalListHashes) > 10:
IntervalListHashes.populate(display_progress=True)
if len(IntervalList - IntervalListHashes) > 10:
raise ValueError("IntervalListHashes table is not fully populated.")
# In-memory set difference. Only return hash bc primary key above.
logger.debug("Fetching HashCount key source")
self._key_source = (
dj.U("valid_times_hash").aggr(
IntervalListHashes - self, count="COUNT(*)"
)
& "count > 1" # only populate repeated valid_times
)
return self._key_source
@property
def part_format(self):
"""Empty in-memory table used to assemble orphan status."""
return dj.U(
"nwb_file_name",
"interval_list_name",
"valid_times_hash",
"is_orphaned",
)
@property
def orphans(self):
if self._orphans:
return self._orphans
logger.debug("Fetching orphans")
# Similar to `cleanup` logic. Ignores tables here under `cbroz`
interval_list_valid_children = [
t
for t in get_child_tables(IntervalList)
if "cbroz" not in t.full_table_name
]
self._orphans = self.part_format & (
IntervalListHashes - interval_list_valid_children
).proj(is_orphaned="1")
return self._orphans
@property
def non_orphans(self):
if self._non_orphans:
return self._non_orphans
logger.debug("Fetching non-orphans")
self._non_orphans = self.part_format & (
IntervalListHashes - self.orphans # All others are non-orphans
).proj(is_orphaned="0")
return self._non_orphans
@property
def orphan_status(self):
if not self._orphan_status: # Lazy load set union
self._orphan_status = self.orphans + self.non_orphans
return self._orphan_status
def is_orphaned(self, key) -> bool:
"""Check orphan status of a key. Used in list comprehension method."""
return True if key in self.orphans else False
def get_time_example(self, full_key):
"""Fetch a single example of valid_times for a given hash.
Times will be the same across all entries with the same hash.
"""
logger.debug("Fetching time example")
return (IntervalList * full_key).fetch("valid_times", limit=1)[0]
def make(self, key):
# Caching key_source is faster, but possible collisions if
# IntervalList is getting new entries. Counts may be out of date.
if key in self:
logger.debug("Key already exists")
return
full_key = IntervalListHashes & key
full_key_len = len(full_key)
logger.debug("Assembling master key")
master_key = dict(
valid_times_hash=key["valid_times_hash"],
count=full_key_len,
valid_times=self.get_time_example(full_key),
)
# Join method is faster for large numbers
# comprehension is faster for small numbers. Arbitrary cutoff at 20.
if full_key_len > 20:
logger.debug("Assembling part keys, join method")
part_keys = (self.orphan_status * full_key).fetch(as_dict=True)
else:
logger.debug("Assembling part keys, comprehension method")
part_keys = [
{**entry, "is_orphaned": self.is_orphaned(entry)}
for entry in full_key
]
logger.debug(f"Inserting {len(part_keys)} keys")
self.insert1(master_key)
self.IntervalListEntry.insert(part_keys)
def summary():
hash_counts = dj.U("valid_times_hash").aggr(
IntervalListHashes, count="COUNT(*)"
)
n_entries = len(IntervalList())
n_unique = len(hash_counts & "count=1")
unique_percent = n_unique / n_entries * 100
n_repeated = len(hash_counts & "count>1")
repeated_percent = n_repeated / n_entries * 100
n_repeat_10 = len(hash_counts & "count>10")
repeat_10_percent = n_repeat_10 / n_entries * 100
n_orphans = len(IntervalListHashCounts().orphans)
orphan_percent = n_orphans / n_entries * 100
possible_reduction = 100 - len(hash_counts) / n_entries * 100
print(f"Interval lists: {n_entries}")
print(f"... with unique valid_times: {n_unique}, {unique_percent:.4f}%")
print(
f"... sets of repeated valid_times: {n_repeated}, "
+ f"{repeated_percent:.4f}%"
)
print(
f"... sets with >10 repeated valid_times: {n_repeat_10}, "
+ f"{repeat_10_percent:.4f}%"
)
print(f"... with no downstream: {n_orphans}, " + f"{orphan_percent:.4f}%")
print(
f"... size reduction if no repeated valid_times: "
+ f"{possible_reduction:.4f}%"
)
return hash_counts
Alternatives
New methods, future breaking changes:
- Non-datajoint methods of standardizing interval lists as a data type. Python's dataclass module feels like a good fit. Abstract base classes or mixins can help support uniformity of interval lists across different pipelines.
-
IntervalList
could be a table pattern, like params tables, that outlines a solve and provides a template to be adapted for use in individual pipelines, which will use a set of primary keys to avoid redundant entries. - If interval list needs to be central, it could use a hash of the blob as the pk (see script above), and use part tables to associate the same set of times with multiple human-readable strings.
Adjusting the current model is trickier. Any departure is going to make interval list names less informative.
- The hash functionality in the python script above could be used to cross check a possible new entry against possible redundant matches. It would be up to each pipeline to implement a function for searching existing names for possible matches.
- Adjust artifact detection to first assume times will be redundant across electrodes and only register new entries if this is not the case.