anystyle icon indicating copy to clipboard operation
anystyle copied to clipboard

"type": null isn't valid csl

Open urspx opened this issue 4 years ago • 4 comments

Similar to #110, or rather concerning how that was solved – according to the CSL-JSON docs, the type field is

Required. The type field is a simple field containing a string value. CSL-JSON constrains the possible for values of the type field to a limited set of possible values (e.g., “book” or “article”). The type must be a valid CSL type under the schema of the installed style. See the schemata of CSL and CSL-M for their respective lists of valid types.

In terms of potential workflows, I've noticed that some APA7-style references processed into CSL-JSON on anystyle.io therefore fail to import into Zotero, which complains that a csl-json file containing "type": nulls is "not a supported format". Crudely replacing all of those nulls with "book"s fixes that problem, though.

(Also, thank you for the very valuable software!)

urspx avatar Jul 20 '20 22:07 urspx

Can you post some example references where no type is detected?

inukshuk avatar Jul 21 '20 07:07 inukshuk

Sure, here are some examples:

Department for Digital, Culture, Media & Sport (DCMS). (2019). Online Harms White Paper. https://www.gov.uk/government/consultations/online-harms-white-paper Doteveryone. (2018). People, Power, and Technology - The 2018 Digital Understanding Report. https://www.doteveryone.org.uk/report/digital-understanding/ Doteveryone. (2018). People, Power, and Technology - The 2020 Digital Attitudes Report. https://www.doteveryone.org.uk/report/peoplepowertech2020/. Forbrukerrådet. (2018). Deceived by Design: How tech companies use dark patterns to discourage us from exercising our rights to privacy. https://fil.forbrukerradet.no/wp-content/uploads/2018/06/2018-06-27-deceived-by-design-final.pdf Hargittai, E. (2001). Second-level digital divide: Mapping differences in people's online skills. arXiv. https://arxiv.org/abs/cs/0109068 ICO. (2019). Adtech - Market Research Report. https://ico.org.uk/media/about-the-ico/documents/2614568/ico-ofcom-adtech-research-20190320.pdf ITU. (2019). ITU-D Digital Inclusion. https://www.itu.int/en/ITU-D/Digital-Inclusion/Pages/default.aspx Web Foundation. (2020). The web can help more in the fight against Covid-19. Here’s what we must do [Blog post]. https://webfoundation.org/2020/03/the-web-can-help-more-in-the-fight-against-covid-19-heres-what-we-must-do/.

Perhaps it would make sense to have some kind of way to force or override what gets outputted as 'type'? Although either way, having "type": null is never valid according to CSL, so it should never happen, right?

Since the type field should, if I understand it correctly, contain one of these types, perhaps "document" is the most general fallback option?

urspx avatar Jul 22 '20 09:07 urspx

Also, if I add [Report] to all except the Hargittai and Web Foundation entries, like so:

Department for Digital, Culture, Media & Sport (DCMS). (2019). Online Harms White Paper [Report]. https://www.gov.uk/government/consultations/online-harms-white-paper Doteveryone. (2018). People, Power, and Technology - The 2018 Digital Understanding Report [Report]. https://www.doteveryone.org.uk/report/digital-understanding/ Doteveryone. (2018). People, Power, and Technology - The 2020 Digital Attitudes Report [Report]. https://www.doteveryone.org.uk/report/peoplepowertech2020/. Forbrukerrådet. (2018). Deceived by Design: How tech companies use dark patterns to discourage us from exercising our rights to privacy [Report]. https://fil.forbrukerradet.no/wp-content/uploads/2018/06/2018-06-27-deceived-by-design-final.pdf Hargittai, E. (2001). Second-level digital divide: Mapping differences in people's online skills. arXiv. https://arxiv.org/abs/cs/0109068 ICO. (2019). Adtech - Market Research Report [Report]. https://ico.org.uk/media/about-the-ico/documents/2614568/ico-ofcom-adtech-research-20190320.pdf ITU. (2019). ITU-D Digital Inclusion [Report]. https://www.itu.int/en/ITU-D/Digital-Inclusion/Pages/default.aspx Web Foundation. (2020). The web can help more in the fight against Covid-19. Here’s what we must do [Blog post]. https://webfoundation.org/2020/03/the-web-can-help-more-in-the-fight-against-covid-19-heres-what-we-must-do/.

And then make sure to have the [Report] tagged as 'genre' in the token editor and the rest of the title as title, I can bring the "type":null entries down to two – to Harigittai and Web Foundation. So differently formatted references do help, but:

  1. From my understanding of APA 7 one can't expect all reports / other grey literature to be labeled with [Report]
  2. That still doesn't address the parser ultimately producing invalid CSL

urspx avatar Jul 22 '20 10:07 urspx

CSL always expects a type, yes, but we try to guess a reference's type based on certain criteria: I think it's much cleaner to leave the type empty if a reference can't be classified instead of using some fallback type, because this way it is very easy to add missing types as an extra step (it many cases it will be much easier to define a safe fallback type, for example, if you know what kind of references you're parsing or what styles are used) -- that is, pretty much as you suggest to override the default type. Obviously, we can tweak and improve the type classification -- which is trivial as you can see here -- but there will always be cases where no type can be detected (e.g., imagine a reference which yields only a title, or only a year... such references don't make much sense, but they are possible, and I much prefer to not classify them instead of insisting, say that the reference '2020' must be of type 'book').

That said, and looking at your references, yes reports (white paper, tech-report, etc.) are really difficult to detect properly because of internal conventions. It would be worthwhile to add a dedicated 'report' normalizer to add fields like 'genre'; and additional training data that is labelled consistently would help, too. How did you tag "Online Harms White Paper" for example? Looking at the way we're using 'genre' in the core data (just search for 'genre') I think it would make sense to label the whole segment as 'genre' and then we could add additional criteria to the type classifier: 'report' should already be detected there, but something like 'white paper' is missing.

Web Foundation, I think, is something we just need to add to the training data (a handful of references with [Blog post]) and a corresponding entry in the type classifier to detect them as 'website'. Would you like to add a few samples to the core data?

With Hargittai I don't think there is anything we can do really. It's Author, year, Title, arXiv link -- there's no way to tell from this what kind of publication it is, right? (Other than looking up the relevant info at arXiv; but I'd say that''s a valid post-processing approach anyway: use AnyStyle to detect individual fields, use those to search databases for matching fully structured data).

inukshuk avatar Jul 22 '20 12:07 inukshuk