citeproc icon indicating copy to clipboard operation
citeproc copied to clipboard

from yaml to csljson converted references show different output of variable note when colon is present

Open maybegeek opened this issue 2 years ago • 20 comments

Hi there,

given the references in yaml (bjork.yaml):

---
references:
- id: theid
  author:
    - literal: Björk
  issued:
    - year: 2019
  note: >-
    Bla: Blupp ... Foo: Bar
  title: The Title
  type: motion_picture
...

and converting this to csl json with:

pandoc -s bjork.yaml -f markdown -t csljson -o bjork.json

I get bjork.json with:

[
  {
    "author": [
      {
        "literal": "Björk"
      }
    ],
    "id": "theid",
    "issued": {
      "date-parts": [
        [
          2019
        ]
      ]
    },
    "note": "Bla: Blupp … Foo: Bar",
    "title": "The Title",
    "type": "motion_picture"
  }
]

my mwe csl style is (test.csl):

<?xml version="1.0" encoding="utf-8"?>
<style xmlns="http://purl.org/net/xbiblio/csl" class="note" version="1.0" demote-non-dropping-particle="sort-only" default-locale="de-DE">

<info>
  <title>test</title>
  <title-short>test</title-short>
  <id></id>
  <author>
    <name>Hans Dampf</name>
  </author>
  <category citation-format="note"/>
  <category field="humanities"/>
  <summary>test</summary>
  <updated>2021-10-13T11:00:00+02:00</updated>
</info>

<citation>
  <sort/>
  <layout/>
</citation>

<bibliography>
  <sort></sort>
  <layout>
    <group suffix="." delimiter=" /|\ ">
      <text variable="title" font-style="italic"/>
      <date variable="issued" prefix=" (" suffix=")">
        <date-part name="year" form="long"/>
      </date>
      <names variable="author">
        <name/>
      </names>
      <text variable="note"/>
    </group>
  </layout>
</bibliography>
</style>

Where I want to show a difference in the output of the note variable, in case there is a colon inside the note value for the json reference. The yaml reference is handled as expected:

my test.md file:

---
lang: de-DE
csl: test.csl
nocite: |
  @*
...

Test YAML and JSON with CSL

as command for pandoc I use:

pandoc test.md --citeproc --output=ref-json.htm -s --metadata title="test" --bibliography=bjork.json

and

pandoc test.md --citeproc --output=ref-yaml.htm -s --metadata title="test" --bibliography=bjork.yaml

For YAML I get:

The Title /|\ (2019) /|\ Björk /|\ Bla: Blupp … Foo: Bar.

The JSON ref brings:

The Title /|\ (2019) /|\ Björk.

If no colon is present, the note value gets output.

The YAML-file brings the expected output, the csljson file not.

To make a long post longer: I use Zotero (BBT) and the extra fields with some cheater syntax. The handling of the colon seperator and key: value handling should already be finished if yaml or json reference files exist.

thanks for looking into this, best regards

maybegeek avatar Oct 13 '21 17:10 maybegeek

Here's a short demonstration of the issue:

% pandoc -s -f csljson -t csljson
[
{ "id": "a",
  "note": "a: b c" }
]
^D
[
  {
    "a": "b c",
    "id": "a",
    "type": ""
  }
]

jgm avatar Oct 13 '21 20:10 jgm

Pure citeproc repro:

λ>  decode "[{\"id\":\"a\",\"note\":\"a: b c\"}]" :: Maybe [Reference (CslJson Text)]
Just [Reference {referenceId = ItemId {unItemId = "a"}, referenceType = "", referenceDisambiguation = Nothing, referenceVariables = fromList [(Variable "a",FancyVal (CslConcat (CslText "b") (CslConcat (CslText " ") (CslText "c"))))]}]

jgm avatar Oct 13 '21 20:10 jgm

OK, I see that this is due to the following code in Citeproc.Types (l. 872):

    | k == "note" = do
        t' <- parseJSON v
        let (kvs, rest) = parseNote t'
         in (if T.null rest
                then id
                else \(Reference i' t'' d' m') ->
                       Reference i' t'' d' (M.insert "note" (TextVal rest) m'))
             <$> foldM go (Reference i t d m) (consolidateNameVariables kvs)

where

parseNote :: Text
          -> ([(Variable, Text)], Text)
parseNote t =
  either (const ([],t)) id $
    P.parseOnly ((,) <$> P.many' pNoteField <*> P.takeText) t
 where
  pNoteField = pBracedField <|> pLineField
  pLineField = do
    name <- pVarname
    _ <- P.char ':'
    val <- P.takeWhile (/='\n')
    () <$ P.char '\n' <|> P.endOfInput
    return (Variable $ CI.mk name, T.strip val)
  pBracedField = do
    _ <- P.string "{:"
    name <- pVarname
    _ <- P.char ':'
    val <- P.takeWhile (/='}')
    _ <- P.char '}'
    return (Variable $ CI.mk name, T.strip val)
  pVarname = P.takeWhile1 (\c -> isLetter c || c == '-')

So, it's intentional. For background, see https://github.com/jgm/pandoc-citeproc/issues/192

jgm avatar Oct 13 '21 20:10 jgm

ALso https://citeproc-js.readthedocs.io/en/latest/csl-json/markup.html#cheater-syntax-for-odd-fields If we haven't implemented this correctly, we can revisit.

jgm avatar Oct 13 '21 20:10 jgm

thanks @jgm , definitely shorter : )

but is it a bug?

https://github.com/citation-style-language/schema/issues/277

If we will have "old" yaml oder csljson files in the future, perhaps there will need to be a way of splitting embedded key:values in note to future custom fields?

What I do not understand, is the difference in handling for yaml and csljson, at the moment.

maybegeek avatar Oct 13 '21 20:10 maybegeek

Maybe @denismaier or @bwiernik or @bdarcus can comment. From the linked issue, it sounds as if CSL has added support for something like

"custom": {"a": "one", "b": "two"}

but I'm not sure what version it's in, and I'm not sure whether this change is supposed to go along with no longer parsing fields in the "note" field...

jgm avatar Oct 13 '21 22:10 jgm

I'm not sure what version it's in ...

It's in the 1.1 branch.

and I'm not sure whether this change is supposed to go along with no longer parsing fields in the "note" field...

It's definitely intended as a better solution to the same requirement. But I don't think parsing the note field was anything "official"?

bdarcus avatar Oct 13 '21 23:10 bdarcus

Yeah, parsing from note was always a citeproc-js hack that we have wanted to phase out.

@jgm BetterBibTeX moves CSL fields out of note when it generates CSL JSON or YAML, so Zotero users have an option to get cleaned note fields when outputting to pandoc. I am not sure if RStudio does similar cleaning, but I think they might or at least would likely be responsive to adding that. Those would probably be the major ways that pandoc would encounter CSL variables in note. So, it might be possible to retire parsing note from pandoc or at least move it to an optional flag.

bwiernik avatar Oct 14 '21 01:10 bwiernik

I am not sure if RStudio does similar cleaning, but I think they might or at least would likely be responsive to adding that. Those would probably be the major ways that pandoc would encounter CSL variables in note.

I'm not sure, but won't RStudio users create their csl json or yaml files using some external tool, e.g. Zotero? If so, then that wouldn't be much of an RStudio issue, right?

denismaier avatar Oct 14 '21 07:10 denismaier

Hi there,

if I'm allowed to sum up:

A

pandocing with a *.yaml or *.json file of references should output the result.

for

note: >-
  Bla: Blupp ... Foo: Bar

and

"note": "Bla: Blupp … Foo: Bar",

B

An optional flag for converting and parsing

"note": "Bla: Blupp … Foo: Bar",

to custom key:values could be a way of handling old exports of BBT json/yaml to whichever new structure there will be.

?

maybegeek avatar Oct 14 '21 11:10 maybegeek

@jgm RStudio's visual editor has native integration from Zotero and can create a bib, json, or YAML file as users add citations from Zotero or DOIs

bwiernik avatar Oct 14 '21 17:10 bwiernik

Not sure what is best here. The cleanest option would be to add support for custom disable note parsing. This would have the disadvantage that some people's existing workflows may break, and in ways that aren't obvious to them (a pretty big disadvantage).

We could think about adding an option to disable note parsing (maybe checking a metadata field). But people would have to know about this to use it. It's only going to affect people who want to use colons in note fields, and such people aren't going to know about it, in most cases.

jgm avatar Oct 15 '21 16:10 jgm

Wish I knew how common it was for pandoc users to use this note-parsing trick, and how common it is to want to use a note field for other purposes.

jgm avatar Oct 15 '21 16:10 jgm

people using reference managers would have their tooling and thereby parsing already with Zotero oder Zotero and BBT, by going with cheater syntax for getting their key:values stored there.

If on the other hand someone would write the bibliographic data by hand in yaml or csl json, they would have their data as is written. Writing Original date: in Zoteros extra field is not by choice, writing in yaml/json in note would be by choice, where one could write the actual csl-usable key:value.

Parsing in the reference manager / Zotero export one time would be enough. If one would want to parse in pandoc again, we could make that optional ... -t csljson+custom -o happy-new.json or ... -t csljson+cheater -o happy-new.json perhaps?

On the matter of different output from yaml/json bibliography files with : and the same CLS, well, I was surprised as the same files (yet in different structure) resulted in different handling of the note.

tough choice : )

maybegeek avatar Oct 15 '21 16:10 maybegeek

I don't recall why we added custom to 1.1 and not 1.0.

Would adding to 1.0 help?

bdarcus avatar Oct 15 '21 17:10 bdarcus

I don't think there's a problem adding it to 1.0

bwiernik avatar Nov 12 '21 20:11 bwiernik

I don't think there's a problem adding it to 1.0

Let's do it then?

bdarcus avatar Nov 12 '21 20:11 bdarcus

Following up on this: was "custom" ever added to 1.0?

jgm avatar Jul 28 '22 19:07 jgm

No; I just created a linked issue to make sure we don't cause any problems if we do.

I can't imagine we do, but just in case.

bdarcus avatar Jul 28 '22 20:07 bdarcus

Turns out it's been there for awhile! Not sure how I missed that.

https://github.com/citation-style-language/schema/commit/fde9bd61264c9bfab71f95dba4dc4ddbf8158561

bdarcus avatar Aug 06 '22 19:08 bdarcus