SLiM icon indicating copy to clipboard operation
SLiM copied to clipboard

[Feature request] record Individual.tag / Individual.tagF in treeseq individual metadata

Open clwgg opened this issue 2 years ago • 12 comments

Hi! The title pretty much says it all: I think it would be really cool if the tag/tagF properties of individuals would be propagated to the individual metadata in the tree sequence output. I don't know how feasible/difficult this is with how the metadata encoding is handled, but just wanted to put it out there. Please let me know what you think :)

clwgg avatar Jun 27 '22 18:06 clwgg

Yep, agreed. There's a lot of stuff that could be persisted, and we can't (i.e. won't :->) do it all, but persisting the tag/tagF values would probably be particularly useful for people. @petrelharp what do you think?

bhaller avatar Jun 27 '22 20:06 bhaller

I've wondered about this also! Maybe it's something we should discuss when discussing mutation metadata? In principle it wouldn't be hard to optionally include tag (optionally for everyone all at once), so that for instance when writing out the tree sequence there was an option like saveTags that if T would include the tag in the metadata schema. It'd complexify the code for reading in tree sequences a bit (check which schema is being used), but not too bad? We wouldn't have to do anything special on the pyslim side.

petrelharp avatar Jul 03 '22 14:07 petrelharp

Pondering this. So, the difficulty is that tag values are on most of SLiM's classes – not just Individual and Mutation, but all over the place. It would be weird to persist just some subset of those tag values (people would think they had all been persisted when only some actually had, and it would lead to bugs and complaints); but persisting them all requires that we find a spot in the .trees metadata for all of them, and support two different metadata formats for all of them (for backward compatibility, even if we don't support the no-tags metadata format for new saves). It's just a major headache, in terms of I/O code and file versioning and stuff. And it would still only take us a small fraction of the total distance; we still wouldn't be persisting values set with setValue(), for example, which are also supported on most classes. If we're going to do tag/tagF, maybe it would be best to do it for SLiM 4, and consider breaking backward compatibility. But SLiM 4 is already pretty much frozen for release (although nothing is really forcing it to be so), and pyslim has already been updated to match that putative SLiM 4 "final", and so forth. Thoughts?

bhaller avatar Jul 03 '22 15:07 bhaller

Huh. An alternative is to not do this at all, and provide an example of how people can write out arbitrary extra information into top-level metadata?

petrelharp avatar Jul 03 '22 18:07 petrelharp

Huh. An alternative is to not do this at all, and provide an example of how people can write out arbitrary extra information into top-level metadata?

That might be the best path. Does it solve the problem? Suppose a user wants to have a tag value associated with every individual kept by simplify – not just extant individuals, and not just individuals that have been explicitly remembered. That's quite hard to do now, isn't it? I don't know. This has been a problem forever, but it's also pretty hard to fix right. I'm not sure what to do about it.

bhaller avatar Jul 03 '22 18:07 bhaller

Right, good point. It would not save the tags for no-longer-extant individuals. It would let people save-and-restore state that depends on tags, though, which is a major use case.

Hm - and, people could stick a few floats' worth of information into spatial location, for no-longer-alive individuals, unless they're doing a 3d simulation.

On the tree sequence side, having different versions of the metadata that optionally have or don't have tags is no problem at all; no change needed (we'd just have to make sure to stick in the right metadata schema when we write out the tree sequence). So, the complexification would be all in the various metadata handling code in SLiM. Do-able, but I don't know if it's worth the bother? Other things seem more important, and once tag was in there, there'd be other information people want to get as well...

@clwgg - for your use case, how about sticking it in individual.z?

petrelharp avatar Jul 03 '22 20:07 petrelharp

individual.z only gets written out if the simulation is set to 3D, right? So you'd need to configure with an additional dimension, to get the storage you want to use. I don't think there would be any negative side effect to that; if interactions are configured "xy" they should just ignore the z. Kinda gross though. :->

Yes, saving tag values for extant individuals in top-level metadata certainly does get you something useful, even if it doesn't give you everything. Maybe it's good enough for the time being. I don't like the idea of putting in a half-solution that might turn out to the wrong path forward.

bhaller avatar Jul 03 '22 20:07 bhaller

It's always written out (and is 0.0 if the sim doesn't use a z), but you can't access it in SLiM if you don't set the simulation to 3D. It's a bit gross, but not too bad - you've even got an example of using z for a phenotype in the manual IIRC?

petrelharp avatar Jul 04 '22 14:07 petrelharp

It's always written out (and is 0.0 if the sim doesn't use a z), but you can't access it in SLiM if you don't set the simulation to 3D. It's a bit gross, but not too bad - you've even got an example of using z for a phenotype in the manual IIRC?

Huh – I thought the location data had 0/1/2/3 entries per individual, based upon the spatiality of the model. But it does look like the code always writes out all three. What am I remembering? Is there some other variable-number-of-entries column that I'm thinking of?

Yes, there's an example model that uses the third dimension to store phenotypes. It actually uses those in InteractionType, to get competition strength that depends on phenotype as well as spatial position; but if you made the interactions depend only on (x,y) then I guess you would indeed be able to use z for scratch storage that would be persisted. I'm not sure I like the idea of this becoming a standard/recommended technique, but it'll do for now. There's definitely a bug here that ought to be fixed, I'm just not sure when to fix it, or how far the fix should go. So I'm going to leave this issue open, but mark it long-term, since there's a workaround for you for now, @clwgg. Two, actually, since you could also throw the tag values into top-level metadata and restore them yourself. Sound good?

bhaller avatar Jul 04 '22 17:07 bhaller

Huh – I thought the location data had 0/1/2/3 entries per individual, based upon the spatiality of the model. But it does look like the code always writes out all three. What am I remembering? Is there some other variable-number-of-entries column that I'm thinking of?

The only variable-length thing we have is the number of SLiM-mutations in the mutation list for mutation metadata.

petrelharp avatar Jul 05 '22 14:07 petrelharp

The only variable-length thing we have is the number of SLiM-mutations in the mutation list for mutation metadata.

Weird – I could've sworn. Delusional I guess. :->

bhaller avatar Jul 05 '22 15:07 bhaller

Thanks for the discussion! I just wanted to say that putting the values into Individual.z works fine for my use case.

clwgg avatar Jul 05 '22 17:07 clwgg

OK, I pondered this and decided that this is a doc bug. Persisting tag information, and other such information, in the tree-sequence metadata is really quite easy. We just didn't document how to do it. I've remedied that with a revision to recipe 17.3 that shows the technique:

SLiM_Manual_17_3.pdf

I think this suffices to close the bug.

bhaller avatar Jul 26 '23 01:07 bhaller

I love it. I'd make a modification, though: currently you have

inds = sortBy(sim.subpopulations.individuals, "pedigreeID");
tags = rdunif(size(inds), 0, 100000);
inds.tag = tags;
metadataDict = Dictionary("tags", tags);

and so matching tags to individuals relies on the pedigreeID ordering. This doesn't provide any way to error-check, and also means that once the tree sequence is modified (eg simplified) the tags won't be match-able any more. It also requires knowing and being able to identify exactly which individuals (the Alive ones) SLiM is saving the info for.

So, instead I'd also store the pedigreeID of the saved individuals, so like

inds = sortBy(sim.subpopulations.individuals, "pedigreeID");
tags = rdunif(size(inds), 0, 100000);
inds.tag = tags;
metadataDict = Dictionary("tags", tags, "pedigreeIDs", inds.pedigreeID);

and also add to the read-back-in code something that checks whether the metadata IDs match the individual IDs. (However, this seems more important for downstream python analysis than the SLiM example shown.)

petrelharp avatar Jul 27 '23 18:07 petrelharp

OK, good points, will do.

bhaller avatar Jul 27 '23 22:07 bhaller

...I'd make a modification, though: ...

OK @petrelharp, here's a new draft for the section:

SLiM_Manual_17_3.pdf

I kept the previous version of the recipe as the "canonical" recipe, except for adding in a write of the pedigree ID vector to the metadata as suggested. In a "coda" to the recipe, I show how one could use that pedigree ID vector to correctly assign tag values back into individuals even if the extant individuals in the tree sequence have changed or been reordered. That version of the read code is substantially more complex, though, and much harder to understand, so I decided to keep the "canonical" recipe using the simple read code – which will suffice for 99.9% of users of this recipe, I suspect. I'd appreciate another read-through, especially of the new page-ish of stuff at the end.

Optimistically closing this issue again. :->

bhaller avatar Jul 29 '23 01:07 bhaller

Looks good; FYI the code could be a good bit simpler if we assume that all individuals in SLiM have recorded tags in metadata (but not necessarily vice-versa):

tags = metadataDict.getValue("tags");
ids = metadataDict.getValue("ids");
inds = sim.subpopulations.individuals;
matches = match(inds.pedigreeID, ids);
inds.tag = tags[matches];

Note the switch in order in the call to match( ).

petrelharp avatar Jul 29 '23 04:07 petrelharp

Yes; do you think that's a safe assumption? I thought you were saying before that you wanted the code to be rewritten to be robust against possible changes to the tree sequence. I guess that would usually involve removing individuals (simplification), but it could involve adding them, too, couldn't it? Or do you think that remote possibility should be disregarded, for purposes of greater clarity? It certainly does clean up the recipe.

bhaller avatar Jul 29 '23 14:07 bhaller

I think it's safe, as long as the possibility is mentioned. Workflows that add individuals seem less common. Gee, a consistency check could be added that raises an error if any of the matches are less than 0.

petrelharp avatar Jul 30 '23 19:07 petrelharp

OK, didn't add the consistency check; just explained the concepts in the text, which I think suffices. (The user presumably knows whether they have added new individuals or not. :->) Thanks for the review @petrelharp, and thanks @clwgg for the original nudge; I think this recipe will be a very useful addition!

bhaller avatar Aug 07 '23 02:08 bhaller