scholia icon indicating copy to clipboard operation
scholia copied to clipboard

Creating an item for a DOI assumes English as a language

Open stuartyeates opened this issue 1 month ago • 7 comments

Describe the bug The item https://www.wikidata.org/wiki/Q136927894 was created by entering the doi into scholar and letting it send me to quickstatements to approve the output.

The crossref metadata https://api.crossref.org/works/10.54389/wjnt5711_29 doesn't claim that this is an English-language title, but now wikidata does.

stuartyeates avatar Nov 23 '25 08:11 stuartyeates

Is und (undetermined language) better?

larsgw avatar Nov 30 '25 13:11 larsgw

There's quite a few of publishers that apparently do not send language info to Crossref or DataCite, many of which in English. Half of the (6) test cases I've collected do not have language info, in fact. Is it a net benefit to set those to und too?

larsgw avatar Dec 02 '25 15:12 larsgw

@stuartyeates @brierjonOU any opinions?

larsgw avatar Dec 08 '25 18:12 larsgw

If the language is not provided, definitely default to undetermined language for the title.

It is more useful to do that than to default to English given DOI content is not exclusive to English. The simple solution would be to reflect what the metadata provider entered and use undetermined if not provided, cleanup and updates would have a clear task to verify language and clear work set for a bot to follow-up with an estimated language if the confidence interval ie high enough. Not labeling undetermined language actually means there should be a data quality issue created on this behavior to identify any imports relying on an English assumption for review/cleanup.

A more complex solution might be to use either the browser's language estimation (or API/server side solution) as an intermediate step to assign a language and the confidence value for that match (note estimation and confidence values as a qualifier) or add a step to ask the user to select the language. Mozilla has a limited neural net model for the browser - https://mozilla.github.io/translations/ and there are other approaches.

Related to this language handling - Multi-language material and translations/associations - https://www.crossref.org/documentation/principles-practices/best-practices/multi-language/ looking for existing relations of "hasTranslation" to associate the content being imported.

brierjonOU avatar Dec 08 '25 21:12 brierjonOU

If crossref doesn't have the language and if the wikidata item for the journal has a singular value for P407 (Language of work or name), can we use that?

cheers stuart

...let us be heard from red core to black sky

On Tue, 9 Dec 2025 at 10:04, Jonathan Brier @.***> wrote:

brierjonOU left a comment (WDscholia/scholia#2732) https://github.com/WDscholia/scholia/issues/2732#issuecomment-3628994272

If the language is not provided, definitely default to undetermined language for the title.

It is more useful to do that than to default to English given DOI content is not exclusive to English. The simple solution would be to reflect what the metadata provider entered and use undetermined if not provided, cleanup and updates would have a clear task to verify language and clear work set for a bot to follow-up with an estimated language if the confidence interval ie high enough. Not labeling undetermined language actually means there should be a data quality issue created on this behavior to identify any imports relying on an English assumption for review/cleanup.

A more complex solution might be to use either the browser's language estimation (or API/server side solution) as an intermediate step to assign a language and the confidence value for that match (note estimation and confidence values as a qualifier) or add a step to ask the user to select the language. Mozilla has a limited neural net model for the browser - https://mozilla.github.io/translations/ and there are other approaches.

Related to this language handling - Multi-language material and translations/associations - https://www.crossref.org/documentation/principles-practices/best-practices/multi-language/ looking for existing relations of "hasTranslation" to associate the content being imported.

— Reply to this email directly, view it on GitHub https://github.com/WDscholia/scholia/issues/2732#issuecomment-3628994272, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAACRZYWPLKVD2MWMIZNOQL4AXRUJAVCNFSM6AAAAACM53PXLSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTMMRYHE4TIMRXGI . You are receiving this because you were mentioned.Message ID: @.***>

stuartyeates avatar Dec 08 '25 21:12 stuartyeates

Nice, If we are relying on the journal lang as the fallback, a qualifier would still be good to indicate it was inferred from that journal's P407 for provenance reasons

brierjonOU avatar Dec 08 '25 22:12 brierjonOU

Maybe a reference with pr:P887 wd:Q105771503; pr:P518 wd:Q34770 (based on heuristic: inferred from DOI database lookup; applies to part: language)?

larsgw avatar Dec 09 '25 16:12 larsgw

Maybe a reference with pr:P887 wd:Q105771503; pr:P518 wd:Q34770 (based on heuristic: inferred from DOI database lookup; applies to part: language)?

If we're inferring the language from the journal then Q124004081 (inferred from publication venue/journal) seems more appropriate than the DOI lookup Q105771503.

There are cases where the journal is providing the same DOI for multiple language support and I don't think it is in the JSON. https://www.wikidata.org/wiki/Q135261244 https://revistas.uned.ac.cr/index.php/revistacalidad/user/setLocale/es_ES?source=%2Findex.php%2Frevistacalidad%2Farticle%2Fview%2F4876 https://revistas.uned.ac.cr/index.php/revistacalidad/user/setLocale/en_US?source=%2Findex.php%2Frevistacalidad%2Farticle%2Fview%2F4876

The default metadata language would be correct,

brierjonOU avatar Dec 11 '25 20:12 brierjonOU