scholia Assuming DOI is Scholarly Article

What is the issue? Not all DOIs are scholarly articles, but it appears this is the default assumption. We could improve the data through some title parsing where not explicitly stated in some cases.

There are DOI for non scholarly articles and while Crossref has a content-type field, it isn't always used. There are some content types and patters we could detect for better statement on P31 "instance of".

Example Case: 10.1002/pssr.202570011 is just a cover art while the article with the same title and associated to this is 10.1002/pssr.202400306

Example Case: 10.1161/CIRC.150.SUPPL_1.4144717 is

Why is this a problem? The data isn't specific and sometimes it can be further specified.

How could this be addressed? Mixed types are importable and some can be detected by title prefix, others are harder issues. Some can be detected with a title prefix (inconsistent, but would apply to some):

Publisher content could be verified by resolving the DOI and looking for certain text, but some prefixes could help in the detection and could ask the user to verify if we don't find this reliable and make corrections.

Graphics or other image, but I'm not sure what Q is the right fit. -> an image Q135274959 which differentiates from the actual article Q135274964 which means more accurate results and associations.

"Abstract *:" where * could be anything could appear between the abstract string and colon. Q333291 (or a subclass) -> Q137260266

"Book Review:" Q637866

10.1177/02656914241287107h What are good places to discuss this?

Dec 08 '25 21:12 brierjonOU

I'm not sure how much of this can be done automatically in Citation.js. The cover picture example specifies a type of journal-article (which incidentally is invalid, but accounted for already), and I don't see anything conclusive in the data suggesting it's not a scholarly article (apart from perhaps missing the page field, but that's probably not a reliable heuristic). There are a several other avoidable problems in Crossref data (some because of publishers, some because of content negotiation) making the import of data difficult in the past, including botched metadata for almost 2 million Science articles (reviews, news, opinions, etc.). At some point, I think it's on the user to check the quickstatements instead, and therefore I do agree with flagging suspicious metadata and allowing editing of the quickstatements before submitting.

However, directly fixing the data based on heuristics as the one you describe is probably well suited for the publisher-specific scrapers (though I'm not involved in that part of Scholia).

"Abstract *:" where * could be anything could appear between the abstract string and colon. Q333291 (or a subclass)

(Although there are many items like that in Wikidata, there are also a nontrivial amount of actual articles with titles like that (Q108441393, Q111096924).)

Dec 09 '25 09:12 larsgw

I hear what you're saying about reliable heuristics for automatic correction. I just don't think we should treat an item as a scholarly journal where a heuristic might indicate a need for further scrutiny and confirmation as part of the quickstatement generation workflow.

I'm also saying for these heuristics inject some friction to prompt a review. If we are not confident because of a heuristic with a good chance of something else or missing metadata don't just generate a clean statement for quick import.

Suggest the user review that specific field or statements often affected by that heuristic. This could be adding a warning or highlight and force a user review of the content.

Abstract * : - ensure the instance of is a scholarly article, we could provide a link to the "abstract types" to review for best fit. Book review: - ask they check the instance of as it may be Q637866

Dec 11 '25 22:12 brierjonOU