strain RRID
Papers often have RRID tags for the strains of the animals they use, but it is currently difficult to get this data into NWB and DANDI metadata. Ultimately, I would like to be able to query DANDI for sessions of animals of a specific strain by querying based on an RRID. The DANDI Schema can support RRID for strains, but this information is rarely/never in the NWB files.
I would like to propose a new, optional attribute of subject.strain called "rrid" or "id". This would be accessible in pynwb with nwb.subject.strain_id or something like that. Once we add this, we can add logic to the DANDI CLI to automatically pull the strain ID information and add it to the asset metadata.
This is what HERD is designed to do -- annotate arbitrary freeform text fields with CURIE/URIs to entities in ontologies or other external resources and controlled vocabularies, in a centralized place.
Unfortunately, HERD integration with NWB is not yet finalized / polished. cc @oruebel . We need to finish this, specifically these three need to have errors addressed and then merged:
- https://github.com/NeurodataWithoutBorders/pynwb/pull/2111
- https://github.com/hdmf-dev/hdmf/pull/1292
- https://github.com/NeurodataWithoutBorders/nwb-schema/pull/646
If this is high-priority, I can move it up the todo list.
Here is an example of how to do it currently.
from datetime import datetime
from pynwb import NWBFile, NWBHDF5IO
from pynwb.file import Subject
from pynwb.resources import HERD
nwb = NWBFile(
session_description="a test NWB file",
identifier="NWB123",
session_start_time=datetime.now().astimezone(),
)
subject = Subject(
subject_id="sub-001",
age="10 months",
description="A test subject",
species="Mus musculus",
strain="C57BL/6J",
)
nwb.subject = subject
herd = HERD()
herd.add_ref(
file=nwb,
container=subject,
attribute="strain",
key="C57BL/6J",
entity_id="RRID:IMSR_JAX:000664",
entity_uri="https://www.jax.org/strain/000664"
)
nwb.link_resources(herd)
with NWBHDF5IO("strain_herd_example.nwb", mode="w", herd_path="strain_herd_example_herd.zip") as io:
io.write(nwb, herd=herd)
# this creates 2 files:
# strain_herd_example.nwb - the NWB file
# strain_herd_example_herd.zip - a zip file containing 6 tsv files with the normalized herd data
# these tables will be moved into the NWB file when https://github.com/NeurodataWithoutBorders/pynwb/pull/2111 is merged
io = NWBHDF5IO("strain_herd_example.nwb", mode="r", herd_path="strain_herd_example_herd.zip")
read_nwb = io.read()
print("Subject strain:", read_nwb.subject.strain)
print()
# show all the linked resources in the NWB file
print(read_nwb.get_linked_resources().to_dataframe())
print()
# get the entities for the subject's strain attribute
print(read_nwb.get_linked_resources().get_object_entities(file=read_nwb, container=read_nwb.subject, relative_path="strain"))
io.close()
Ideally it would look more like:
nwb.external_resources.add_ref(
container=subject,
attribute="strain",
key="C57BL/6J",
entity_id="RRID:IMSR_JAX:000664",
entity_uri="https://www.jax.org/strain/000664"
)
with NWBHDF5IO("strain_herd_example.nwb", mode="w") as io:
io.write(nwb)
io = NWBHDF5IO("strain_herd_example.nwb", mode="r")
read_nwb = io.read()
print("Subject strain:", read_nwb.subject.strain)
print()
# show all the linked resources in the NWB file
print(read_nwb.external_resources.to_dataframe())
print()
# get the entities for the subject's strain attribute
print(read_nwb.external_resources.get_object_entities(file=read_nwb, container=read_nwb.subject, relative_path="strain"))
io.close()
OK, though in the last line it's a bit awkward to have the read_nwb twice.
print(read_nwb.external_resources.get_object_entities(file=read_nwb, container=read_nwb.subject, relative_path="strain"))
file=read_nwb, container=read_nwb.subject
are you referring to this? If so, I think since the read io is be now cached on the container, we could allow file to have a default of None and use container.read_io by default.
read_nwb occurs 3 times: read_nwb.external_resources.get_object_entities(file=read_nwb, container=read_nwb.subject,
@rly thanks for the example code. I can confirm this works on my end.
I am trying to build a pipeline where this information makes its way to dandi asset metadata. I could imagine a path that could work using the existing implementation that stores the HERD information in external files. I suppose my issue now is that the DANDI CLI would need to know that the .zip file contains HERD contents. To proceed we could:
- wait for the internal HERD tables to be implemented
- fix the .zip HERD files to a specific naming convention like "annotations.herd.zip"
- allow for an arg to the
dandi organizecommand specifying the name of the herd file - Both 2 and 3
- Maybe waiting for LinkML is an option? Does that offer yet another way to represent this information?
@oruebel and I favor option 1. @yarikoptic favored internal HERD tables as well.
what if we want to annotate data that already exists on the archive? In that case, wouldn't it be better to have an external option?
Yes, we would still have an external option for that reason. However, zip files are currently not allowed to be uploaded to DANDI, and probably DANDI should not allow uploading any old .zip file. Maybe DANDI could allow *.herd.zip, or a combination of annotations.herd.zip and [nwb_file_stem].herd.zip? The [nwb_file_stem].herd.zip option is useful in case users want to create a HERD per file instead of one for the whole dandiset. @yarikoptic What do you think?