dplace-data icon indicating copy to clipboard operation
dplace-data copied to clipboard

Convert phylogenies taxa.csv to one overall mapping file

Open SimonGreenhill opened this issue 5 years ago • 14 comments

We should standardise the mapping system in the phylogenies directory to have an overall mapping of glottocode -> (soc_id, xd_id) as a single file.

So, all DPLACE society mappings are handled in ./phylogenies/taxa.csv, and within each phylogenies subdirectory taxa.csv maps the tree names (tip labels) to glottocodes only.

This means that changes to society :: glottocode mappings happens in one place, and cuts down the risk of changes to one dataset not being propagated to the others.

SimonGreenhill avatar Sep 30 '19 20:09 SimonGreenhill

That's mostly how it is now - we should just delete the soc_id and xd_id columns in taxa.csv. Right now, these two columns are more of a "checksum". The overall mapping of societies to xd_id and glottocode is not in a single file, though, but in datasets/*/societies.csv. This structure was chosen since it seemed to help with curation - where changes are typically isolated to one dataset at a time. I'd agree, though, that we might want to revisit that decision, since @kirbykat curates the societies in one aggregate spreadsheet anyway :)

xrotwang avatar Oct 01 '19 06:10 xrotwang

ok, so if we remove the soc_id and xd_id from ./phylogenies/*/taxa.csv, there will be no effects of loading etc? Let's do that as step one.

I'm happy to keep the mappings in datasets/*/societies.csv as in our experience these are pretty stable, while the phylogenies directory keeps getting additions.

Perhaps we can enhance the check function to make sure there's a 1:1 mapping of glottocodes:soc_ids to flag any conflicts?

SimonGreenhill avatar Oct 01 '19 07:10 SimonGreenhill

Ok, will do that. I think the reason for having soc_id (and xd_id) in taxa.csv was that one could imagine having phylogenies with a taxon that we know maps to a particular society, i.e. we could have a "better" mapping than with just glottocodes. But since we only (ever?) have phylogenies based on linguistic data, mapping to anything other than glottocodes does not really make sense.

xrotwang avatar Oct 01 '19 08:10 xrotwang

Thanks! Yes, initially a per-phylogeny mapping made a lot of sense, but I think now it's just risking incorrect/incomplete mappings to percolate. If we do encounter a phylogeny with a 'better' mapping, then we can revisit this then.

SimonGreenhill avatar Oct 01 '19 08:10 SimonGreenhill

Oh and there is an actual benefit of having soc_id in taxa.csv for checks: It makes it possible to detect cases where a glottocode may only have been changed in one place (which is possibly what you meant above?). But then, mapping for phylogeny and for society must not be identical (but should be compatible), so checking for obsolete mappings to bookkeeping might be enough.

xrotwang avatar Oct 01 '19 08:10 xrotwang

yes -- I was thinking we should check that glottocode X is always mapped to soc_id Y and xd_id Z

SimonGreenhill avatar Oct 01 '19 09:10 SimonGreenhill

During extensive work on Trinidat this past week I've stumbled across a number of dubious Glottocode mappings in various taxa.csv files - sometimes mapping phylogeny leaves to Glottocodes for subfamilies instead of languages or dialects, sometimes mapping to now retired Glottocodes. Is it safe for me to make changes to the individual taxa.csv files fixing these this afternoon, or should I just keep a list and wait until this issue has been solved?

lmaurits avatar Oct 04 '19 12:10 lmaurits

If you could batch these changes into one PR, that would be cool.

xrotwang avatar Oct 04 '19 12:10 xrotwang

Hi @lmaurits - I curate the xd_id to society, and xd_id to glottocode mappings. If you find things that don’t make sense, please let me know. However, as @xrotwang says, the mappings in the taxa.csv files haven’t been updated since ~2016. We are now (as I understand) mapping phylogeny tips to societies via glottocodes (I.e, we use the “glottocode-xd_id” and “xd_id-soc_id” mapping files)

To elaborate on what @xrotwang said about the taxa.csv files a few days ago, in case the history/logic is useful to understand:

The reason we originally mapped phylogeny tips to societies (and not just to glottocodes) is that there are cases where it makes sense to “priortise” a particular society-tip match.

For example, if a phylogeny was built using a word list for a particular language “variety*” (information that is sometimes published with the phylogeny), that variety may be a better match one of several societies within a d-place dataset that could technically be matched to the tip (e.g. one of several candidate matching societies within the EA, or within Binford). If those societies share a glottocode, and we use a tip->glottocode->xd_id->society mapping system, then all societies in a dataset with that glottocode will have an equal chance of being matched to the tip. (And, as far as I know, there would be no way to “note” the preference for one candidate society over another).

*sometimes this is an official glottolog dialect, but not always. For example, the variety is sometimes indicated by a community name, which happens to be the community to which the D-PLACE cultural data refer.

Another potentially relevant reason we listed candidate society matches in the taxa.csv files: sometimes a phylogeny tip is matched to a glottolog dialect that is not matched to a d-place society, but which is the sister of a dialect that is matched to a d-place society. In this case, we used the taxa.csv file to indicate that the tip should be matched to the sister-matched society, even if that was not a perfect match on the glottocode level.

If I remember right, an example of this second type of situation is that one of our phylogenies includes “Northern Haida” (a dialect)as a tip. While there are no cultural data for Northern Haida-speaking societies in d-place, there are for Southern Haida-speaking societies (also a dialect-level match). I mention this because I hope the current scripts take this scenario into account - it would be too bad to not match this dialect-level tip of the phylogeny to Southern Haida; i.e., to lose the potential to include this cultural data in an analysis where most other tips are linked to language-level glottocodes and thus to cultural data for any societies also matched to that language, or matched to one of that language’s child dialects.

I hope that makes sense (and is relevant).

kirbykat avatar Oct 04 '19 13:10 kirbykat

@xrotwang @SimonGreenhill - I think it would be great to replace the disparate language-> xd_id mapping files with a single file, as that is indeed how I maintain this info.

I think it still makes sense to keep society -> xd_id mappings in the individual society files, as these should almost never need to be updated (in theory only when a fundamental correction to cross-dataset mappings is needed - the xd_ids indicate approx. correspondences among societies in different datasets, and this is the reason they all are assigned the same dialect/language).

kirbykat avatar Oct 04 '19 13:10 kirbykat

Hi @kirbykat! I'm not working directly on anything to do with xd_ids (although I appreciate the extra insight into how things are linked up). Basically the majority of the unusual situations I am noting are where the tips of language phylogenies are mapped to glottocodes which are not either languages or dialects. This doesn't make sense for my purposes, and in some cases it seems like we really can get a language or dialect level mapping (e.g. the taxa.csv for the Honkola et al 2013 Uralic phylogeny currently maps the "Nenets" language to the glottocode for the Nenets subfamily, but Terhi's paper makes it very clear the data is for the Tundra Nenets language, which has its own glottocode). Has this sort of thing been done deliberately for principled reasons to do with society mappings? Or should I feel free to fix these when I can? I'm happy to request your approval on the PR so you can double check all the changes.

lmaurits avatar Oct 04 '19 13:10 lmaurits

We do make sure that the same xd_id for each society is always mapped to the same glottocode (at least in datasets/*/societies.csv). As I said above, xd_id in taxa.csv is/should be more or less ignored.

xrotwang avatar Oct 04 '19 13:10 xrotwang

Basically the majority of the unusual situations I am noting are where the tips of language phylogenies are mapped to glottocodes which are not either languages or dialects. Has this sort of thing been done deliberately for principled reasons to do with society mappings? Or should I feel free to fix these when I can? I'm happy to request your approval on the PR so you can double check all the changes.

I think you can assume that any glottocode matches for any tip not currently linked to a D-PLACE society/xd_id has not been looked at closely (there is probably no principled reason for the match).

If there is a society-xd_id match (not sure if you have this info at your fingertips, but maybe if you are using the taxa.csv files), then more attention was likely paid to the glottocode assignment.

So, IMO you should feel free to change them (but @SimonGreenhill and @xrotwang have worked with these files more recently - I don't manage them). Feel free to tag me if you find something matched to D-PLACE that seems incorrect, or I can try to keep up with the changes!

kirbykat avatar Oct 04 '19 14:10 kirbykat

that first line should read *any glottocode matches

kirbykat avatar Oct 04 '19 14:10 kirbykat

The curation of cross-dataset IDs has basically stopped for the time being. If it is resumed at some point, it should result in a separate CLDF dataset, refering to the societies in the now separated D-PLACE constituent datasets.

xrotwang avatar Nov 21 '23 14:11 xrotwang