citeproc-rs icon indicating copy to clipboard operation
citeproc-rs copied to clipboard

Demo disambiguation favors disambiguate-add-names

Open fbennett opened this issue 5 years ago • 16 comments

In the current disambiguation demo, citeproc-rs is adding names where disambiguation can be done by adding names or initials to existing partners. Steps to reproduce:

  1. Align the names, changing "Amadeus" to "Ariadne" to yield:

    K. Camembert et al. / K. Camembert et al.

  2. Add a char to the first given name in the first item to yield:

    Kurtz Camembert et al. / Kurt Camembert et al.

  3. Add a char to the last family name in the first item.

Expected Further disambiguation is unnecessary, and with the style set to etal-use-first="1" and etal-min="1" or etal-min="2", a single name should be preferred. The result should remain:

Kurtz Camembert et al. / Kurt Camembert et al.

Åctual Result The first given name is reverted to an initial, and two names are added:

K. Camembert, A. Rossi, I. Irrelevantz / K. Camembert, A. Rossi, I. Irrelevant

fbennett avatar Feb 21 '20 23:02 fbennett

It favours disambiguate-add-names because the spec demands adding names first, and stopping if that works. In this case, it does. “Adding names” here does not start with the previous attempt at disambiguation (ie Kurt/Kurtz), it starts from zero every time, which is the only deterministic approach. So it looks like, for the first reference:

K. Camembert et al (ambiguous, of degree 2 i.e. could have been produced by either citekey/citekey2)

Perform step 1 of disambiguation, add names until no further names yield a reduction in degree. It expands out to

K. Camembert, A. Rossi, I. Irrelevantz (unambiguous, degree 1)

And stops as successful. The same happens to the second cite, but it resolves to the second reference/citekey2 without the z. If adding names does not reduce the degree, then the name count bumper is reset to the last value that reduced it, or 0 if it was never reduced. Hence disambiguate-add-givenname can start with any number of additional names, as long as adding them reduced the degree.

(You’ll note we use different language to describe the process, and citeproc-rs doesn’t have “sets of ambiguous cites”, only sets of references that could have produced a cite. If you want me to expound a little, then I can, but I think it’s clear what’s going on in this example.)

cormacrelf avatar Feb 22 '20 01:02 cormacrelf

Sticking with Kurt/Kurtz is the expected behaviour if you have givenname-disambiguation-rule="all-names", and this works. This is because those given names have been globally expanded before add-names comes into play. But the demo and the default are by-cite, which means the givenname disambiguation only occurs in step 2 of the process described in the spec, and not globally or at any point before add-names.

cormacrelf avatar Feb 22 '20 01:02 cormacrelf

You're right that it follows the specification exactly, but it's not what a copy editor would expect (and not what citeproc-js currently produces), and I'm pretty sure that users will complain.

fbennett avatar Feb 22 '20 01:02 fbennett

Are you saying citeproc-js reverses the first two disambiguation passes, because it’s simply better?

cormacrelf avatar Feb 22 '20 02:02 cormacrelf

Yes, with by-cite it attempts given name addition/expansion before adding a name, to keep the cite as compact as possible.

fbennett avatar Feb 22 '20 02:02 fbennett

Only with by-cite? I don’t immediately see why the others wouldn’t also benefit, e.g. primary-name getting the second name Kurtzed. Right?

cormacrelf avatar Feb 22 '20 03:02 cormacrelf

By second name, do you mean the A.Rossi names in the example? My understanding has been that styles that apply primary-name disambiguation do that and only that. The ones that I have seen mostly have year-suffix as a fallback, in case the limited disambiguation by name fails to resolve everything.

fbennett avatar Feb 22 '20 03:02 fbennett

(Following up on the speculation above, I checked the CSL repo. Out of 217 styles that use primary-name, just one (swedish-legal.csl) uses it without year-suffix as a disambiguation fallback. There were previously two such styles, but the other international-journal-of-spatial-data-infrastructures-research) was amended on August 22 last year specifically to add the attribute.)

fbennett avatar Feb 22 '20 03:02 fbennett

Ah yes, my mistake, I think you’re right only by-cite needs it. In all other GNDRs, names that would have initials and/or given names added during cite disambiguation have already been added globally. I think this also means that the description of all-names as being for both name and cite disambiguation is redundant. If all names are as unique as possible already, you can’t disambiguate a cite by expanding them.

The upshot is you ~~can always choose exactly one~~ never have to do more than one of add-names and add-givennames, and you choose which based only on the GNDR: by-cite needs add-givennames, which includes an add-names pass; for the rest, add-givennames is guaranteed to do nothing, so you can skip it. This will make the program faster!

cormacrelf avatar Feb 22 '20 03:02 cormacrelf

I think that's right, yes. Impressive and handy demo, by the way.

fbennett avatar Feb 22 '20 03:02 fbennett

Thanks for this discussion folks! Did either of you happen to write tests for the disambiguation behavior?

bwiernik avatar Nov 26 '20 01:11 bwiernik

No but I'll take it

cormacrelf avatar Nov 27 '20 06:11 cormacrelf

Cool. Did you see the clarified spec linked above?

bwiernik avatar Nov 27 '20 13:11 bwiernik

https://gist.github.com/cormacrelf/84bc9592cd10602d05a52bed938adece

While I was at it I wrote a better tool to convert between the two test case formats in case anyone feels like going full YAML anytime. I'll publish that tomorrow, it does have a caveat where it writes out fully parsed names and also won't handle ALL the weird sections/modes, but at least it does multiline strings for the csl: | field. I think you guys have some new sections (VERSION? I thought CSL had a version field for exactly that! 😄 ) that you can figure out.

cormacrelf avatar Nov 27 '20 18:11 cormacrelf

One note is that (it may be my setup that's wrong, but) citeproc-js fails that test, by not writing out the et al.. Source is in https://github.com/cormacrelf/citeproc-rs/tree/master/citeproc-js-runner, I already updated my citeproc/citeproc-test-runner packages.

cormacrelf avatar Nov 27 '20 18:11 cormacrelf

There is nothing wrong with your setup, and the test fails here as well. At first I thought, "WTF?" but on a closer look I realized it's caused by et-al-min="1"/et-al-use-first="1". With et-al-min="2" the logical outcome is the same, but "et al." is included. It's by intention, and me being off-spec bad boy again. Suppressing "et al." in the former case was a hack to support bibliographies of individual members of staff. It should not be default behavior, and the test should pass. Thanks for flagging it.

fbennett avatar Nov 28 '20 00:11 fbennett