Flagging and detection of questionable metadata patterns
Is your feature request related to a problem? Please describe. I keep encountering issues with the import script or metadata source and would like a way to flag a specific DOI and track metadata sources to possible flag entire DOI ranges for review and pause of import if enough issues are flagged while importing or while viewing content.
Describe the solution you'd like A means of coordinating and flagging for any others working on the same DOI or DOI source to prevent tools from creating a large import needing to be cleaned up.
When using the DOI lookup which does not have a Q it would be great to flag questionable metadata to prevent import until the metadata source or bug is addressed. This could just be a simple flag and wait in a queue for a retry to address the flagged content with human review.
Describe alternatives you've considered Compile patterns of issues and create rules for evaluating content to the rules to flag metadata patterns.
Additional context Math titles have many symbols and need more robust testing, so I'd flag. If the issue is Scholia specific prompt for a bug report. If the issue is metadata specific we could auto draft a report to the doi metadata source from the email field pulled from the DOI Name Values page ie https://doi.org/10.4016/ and could create a process to alert the reporter and give the email contacted a link to import once the metadata issue is corrected. This may need a text field entry or use mailto and subject/body template for the user to send. Building this flagging into Scholia we might be able to identify classes of issues to look for more so than what have been reported.
Some metadata issues to evaluate might be repetitive content ie author name strings known to be issues ie "et al.", "al et.", and the same name string repeated. Example issue found on a test:
We currently do not have any backend to store "problems".
Possible some kind of test on the "et al." problem could be implemented more easily and with a warning. Also on repeated names.
I am wondering how wide the problem is?
Yeah, I get how not having a backend complicates this request.
An alternative implementation would be to create a way to inform the user how to create a bug report for a particular DOI and document from the DOI import page where a user an go and how to add a new pattern to check if they wish to attempt a fix. I for sure don't know where to add a pattern as I'm not oriented to Scholia's code and all a pointer to documentation on how and where to add a pattern or check from the DOI lookup page would be a helpful orientation to contribute fixes like these.
- For et al. / al et I found a handful - see #2722
- Others that might fit in the types of issues: #2720 #2723
- There are symbols and other import issues reported in titles in the bug list.
- Lots of issues with math symbols especially from Physics and Math - example: APS Physical Review Journals - 10.1103/PhysRev.132.1819
- https://doi.org/10.1103/PhysRevC.83.054605 processes where spaces are missing in the title - https://www.wikidata.org/wiki/Q63387978
- https://arxiv.org/abs/2105.06541 processes where characters are bounded by $ - https://www.wikidata.org/wiki/Q136172654 similar to https://www.wikidata.org/wiki/Q136172659 while https://www.wikidata.org/wiki/Q136172633 imported without $.
I'm open to alternatives, but a feature allowing coordinated flagging without technical skills to code could broaden identifying data issues.
Add title modifications to be flagged for retracted/withdrawn and use a statement instead of the title modification when detected.
Examples: Title modified by adding "RETRACTED ARTICLE: " string. https://scholia.toolforge.org/doi/10.1007/s40615-014-0010-x
Additionally this is not flagged as retracted on lookup.
Title Modified by adding "WITHDRAWN—" see https://scholia.toolforge.org/work/Q128114984
Crosscheck of import data: Crossref display vs json vs Import - In this case the JSON had more data than displayed in terms of number of authors while the wikidata import didn't get the first names. Metadata supplier or Crossref issue, but data quality thing to validate. I've reported this specific issue to Crossref, but it may call for validation checks to ensure author string imports are complete.
Pull display authors and JSON author strings, compare if same length and flag for import is length do not match for potential source issue?
https://www.wikidata.org/w/index.php?title=Q137188037&oldid=2437305526
Crosscheck of import data: Crossref display vs json vs Import - In this case the JSON had more data than displayed in terms of number of authors while the wikidata import didn't get the first names. Metadata supplier or Crossref issue, but data quality thing to validate. I've reported this specific issue to Crossref, but it may call for validation checks to ensure author string imports are complete.
Pull display authors and JSON author strings, compare if same length and flag for import is length do not match for potential source issue?
https://www.wikidata.org/w/index.php?title=Q137188037&oldid=2437305526
![]()
![]()
According to Crossref only the surname is required in the metadata and the display issue discrepancy on their end is a known issue, but it highlights some assumptions that should be accounted for.
A much broader data handing issue than #2731 is a Crossref trove of unescaped items: &, and so many more https://search.crossref.org/search/works?q=%26amp%3B&from_ui=yes
Issue reported to Crossref asking to flag in the metadata json at the very least.
https://github.com/WDscholia/scholia/issues/2726#issuecomment-3607349056 Crossref has identifed some parsing issue on their side for escape and it will fix some of the escape issues.
