documentation
documentation copied to clipboard
Spec language for multiple numbers
Currently, the specification provides for recognition only of a single number through a cs:number node:
http://citationstyles.org/downloads/specification.html#number
This is a proposal to permit multiple values on cs:number. Proposed language for the specification:
- //If a variable rendered with cs:number contains no characters other than numbers, at least one space separating each number, and optionally one or more connecting comma, hyphen or ampersand characters, the variable is treated as a list of numbers. In this case, the intervening punctuation is ignored, and the list is sorted and rendered with appropriate connecting punctuation (e.g. "4, 3 & 5" rendered with form="ordinal" becomes "3rd-5th").//
- //If the variable contains no numbers or no spaces, it is rendered verbatim.//
- //In all other cases, the first number encountered is used for rendering (e.g. "12th edition" becomes "12").//
This change would permit the use of plural-form labels with the variables edition, volume, and number.
- Bitbucket: https://bitbucket.org/bdarcus/csl-docs/issue/5
- Originally Reported By: Frank Bennett
- Originally Created At: 2010-11-20 07:57:15
A small adjustment to the proposal:
- If a variable rendered with cs:number contains no characters other than numbers, at least one space separating each number, and optionally one or more connecting comma, hyphen or ampersand characters, the variable is treated as a list of numbers. In this case, the intervening punctuation is ignored, and the list is sorted and rendered with appropriate connecting punctuation (e.g. "4, 3 & 5" rendered with form="ordinal" becomes "3rd-5th").
- If the variable contains (1) no numbers, or (2) no spaces, or (3) both numbers and spaces, plus characters that are not hyphen, ampersand or comma, then it is rendered verbatim (e.g. "1 vol + 1 CD" is rendered as "1 vol + 1 CD").
- In all other cases, the first number encountered is used for rendering (e.g. "12th edition" becomes "12").
Original Comment By: Frank Bennett
- Are en-dashes and em-dashes also recognized (in addition to hyphens)?
- What happens when you sort on variable values that are parsed as number lists? Does "4, 3 & 5" (parsed: "3-5") sort before "3, 4 & 6" (parsed: "3, 4, 6")?
- I don't understand your "(2) no spaces". In CSL 1.0, "12th" is parsed as "12" when cs:number is used, right? If so, I'd like to keep that behavior.
Original Comment By: Rintze Zelle
Re en- and em-dashes, what do you think? I'm open either way.
Re sorting, that's a good point. Using the first number encountered is probably adequate. I'm not sure whether numeric variables sort numerically in citeproc-js yet, actually, so that will be the first port of call.
Re (2) no spaces, that's a bad description. Should be something like "numbers separated solely by non-space characters".
Original Comment By: Frank Bennett
Apparently citeproc-js already does sort numeric variables numerically, and it's not broken by multiple-value variables. Haven't checked the details, but it passes a test okay.
Original Comment By: Frank Bennett
As for the parsing of values with a single number, I think it makes sense to render a value like "4a-c" verbatim. What would be the best logic to detect something like this? Should we scan for hyphens (or dashes) between the number and the nearest space (if present), so we still parse "12th Yellow-tailed Woolly Monkey" as "12"?
Original Comment By: Rintze Zelle
Could do that. Have added to the test for inspection.
js/src/tip/tests/fixtures/local/number_EditionOrdinalWithMultiple.txt
Original Comment By: Frank Bennett
Based on Frank's proposal, maybe this would work?
If a variable displayed with cs:number contains both digit and non-digit characters, an attempt is made to extract the numeric data. If the variable contains multiple numbers that are separated by spaces (e.g. "2 4), optionally with commas (e.g. "2, 4"), ampersands (e.g. "2 & 4") or hyphens/em-dashes/en- dashes (for number ranges, e.g. "2 - 4"), the numbers are extracted, sorted and rendered with connecting commas and hyphens in the selected form (e.g. "1, 4, 3 & 5" becomes "1st, 3rd-5th" when rendered with form="ordinal").
Variables that contain (1) no numbers (e.g. "first edition"), (2) a hyphen, em-dash or en-dash in a word containing at least one digit (e.g. "4a-c"), (3) two or more numbers without a separating space (e.g. "2a6") or (4) two or more numbers and any character other than digits, spaces, hyphens, em-dashes, en- dashes, ampersands or commas (e.g. "1 vol + 1 CD"), are not parsed and rendered verbatim. In all other cases, the first number that is encountered is extracted (e.g. "12" for "12th edition").
Variables can be tested for numeric content with the is-numeric conditional, e.g. "12th edition" tests "true" whereas "third edition" tests "false" (see Choose).
Original Comment By: Rintze Zelle
@fbennett, I assume things have changed a bit over time with regard to multiple-number-recognition. Is it much work for you to update us to the current status (or point me to the relevant tests)? (that is, if you think this should go into the spec for 1.0.1)
A little bit, but it does need a full description. The test linked above currently passes, and covers all the cases I could think of. I've dropped the idea of collapsing sequential numbers to a range, and of inserting commas and ampersands and whatnot; you basically get any credibly-numberic string back with ordinalization (or affixes or whatever) applied to the numbers, with the original punctuation joins in place. If it doesn't look like a number, then is-numeric will test false in cs:if, and the string will return unchanged in cs:number.
If can try to write up a description sometime, if it will help.
@fbennett, I've looked at the unit test, and have an amended proposal for the CSL specification. There are some deviations from the test, but I tried to come up with the simplest rule set for the recognition of numbers via cs:number and the is-numeric conditional that still captures most of the behavior in your test. Also, I think the specification shouldn't concern itself with the rescue of crappy metadata. Finally, trying to recognize labels (e.g. "edition" in "2nd edition") seems like a bad idea because of potential localization issues and the risk of overcomplicating stuff.
So, my new rules are:
Variables can be tested for numeric content with the is-numeric conditional. Content is considered numeric if it solely consists of numbers. Numbers may have prefixes and suffixes ("D2", "2b", "L2d"), and may be separated by a comma, hyphen, or ampersand, with or without spaces ("2, 3", "2-4", "2 & 4"). For example, "2nd" tests "true" whereas "second" and "2nd edition" test "false" (see Choose).
If a variable is rendered with cs:number, has numeric content (as determined by the rules for is-numeric) and contains multiple numbers, the content is formatted as:
- numbers separated by a hyphen are stripped from intervening spaces ("2 - 4" becomes "2-4"). Numbers separated by commas receive a space after the comma ("2,3" and "2 , 3" become "2, 3"), while numbers separated by ampersands receive a space before and after the ampsersand ("2&3" becomes "2 & 3").
- numbers with prefixes or suffixes are never ordinalized or rendered in roman numerals. Numbers without affixes are individually transformed ("2, 3" can become "2nd, 3rd", "second, third" and "ii, iii").
- cs:label renders the plural ("multiple") form of the term if it uses a number variable with numeric content and multiple numbers ("2nd & 3rd editions")
With these rules, I only get different results for the corner cases Editions 1–6th --- ‘Editions 1 - 6’ (would become "Editions 1 - 6") 42nd edition --- ‘“42 editionX”’ (would become "“42 editionX”") 42nd–47th editions --- ‘“42 - 47 editionz”’ (would become "“42 - 47 editionz”") 12 13 edition --- ‘12 13’ (would become "12 13")
I don't know what I think of this proposal, but like the precise spec writing!
On Wed, Apr 25, 2012 at 12:31 PM, Rintze M. Zelle [email protected] wrote:
@fbennett, I've looked at the unit test, and have an amended proposal for the CSL specification. There are some deviations from the test, but I tried to come up with the simplest rule set for the recognition of numbers via cs:number and the is-numeric conditional that still captures most of the behavior in your test. Also, I think the specification shouldn't concern itself with the rescue of crappy metadata. Finally, trying to recognize labels (e.g. "edition" in "2nd edition") seems like a bad idea because of potential localization issues and the risk of overcomplicating stuff.
So, my new rules are:
Variables can be tested for numeric content with the is-numeric conditional. Content is considered numeric if it solely consists of numbers. Numbers may have prefixes and suffixes ("D2", "2b", "L2d"), and may be separated by a comma, hyphen, or ampersand, with or without spaces ("2, 3", "2-4", "2 & 4"). For example, "2nd" tests "true" whereas "second" and "2nd edition" test "false" (see Choose).
If a variable is rendered with cs:number, has numeric content (as determined by the rules for
is-numeric) and contains multiple numbers, the content is formatted as:
- numbers separated by a hyphen are stripped from intervening spaces ("2 - 4" becomes "2-4"). Numbers separated by commas receive a space after the comma ("2,3" and "2 , 3" become "2, 3"), while numbers separated by ampersands receive a space before and after the ampsersand ("2&3" becomes "2 & 3").
- numbers with prefixes or suffixes are never ordinalized or rendered in roman numerals. Numbers without affixes are individually transformed ("2, 3" can become "2nd, 3rd", "second, third" and "ii, iii").
- cs:label renders the plural ("multiple") form of the term if it uses a number variable with numeric content and multiple numbers ("2nd & 3rd editions")
With these rules, I only get different results for the corner cases Editions 1–6th --- ‘Editions 1 - 6’ (would become "Editions 1 - 6") 42nd edition --- ‘“42 editionX”’ (would become "“42 editionX”") 42nd–47th editions --- ‘“42 - 47 editionz”’ (would become "“42 - 47 editionz”") 12 13 edition --- ‘12 13’ (would become "12 13")
Reply to this email directly or view it on GitHub: https://github.com/citation-style-language/documentation/issues/6#issuecomment-5335668
Finally chiming in, sorry for the delay. A few tiny niggles and one suggestion, but the simplicity is good, and I agree on letting bad data lie.
In the first bullet point, "stripped from intervening spaces" should be "stripped of intervening spaces". There might be a slight increase in clarity if "receive a space" were changed to "receive exactly one space" (I found myself skipping back to the input description to be sure that spaces were permitted in input).
Converting a hyphen to en-dash (or whatever the localized range delimiter is) is friendly and good, but there are cases in which an explicit hyphen is desired. I have implemented \- as an escape for that purpose. Not sure if you want that in the specification, but I offer it up for what it's worth.
I took into account Frank's comments and reworked the specification:
https://github.com/citation-style-language/documentation/commit/ed9c9ec1b5bd9d4b6de2f36d91a222831ecd1019
That's done, right? So shouldn't we close this? (If not, the 1.0 label is not really helpful here...)