inspire-next
inspire-next copied to clipboard
CitationAnalysis: one example
I don't know what the underlying problem is in this case. So I can't specify a general error.
The Data
recid:1596893 has one reference in the pdf for 2 papers:
[19] Geant4 collaboration, J. Allison et al., Geant4 developments and applications, IEEE Trans. Nucl. Sci. 53 (2006) 270;
Geant4 collaboration, S. Agostinelli et al., Geant4: A simulation toolkit, Nucl. Instrum. Meth. A506 (2003) 250.
in legacy this is split to 2 references:
001596893 999C5 $$adoi:10.1109/TNS.2006.869826$$cGeant4 Collaboration$$djournal$$hJ. Allison$$o19$$sIEEE Trans.Nucl.Sci.,53,270$$y2006
001596893 999C5 $$adoi:10.1016/S0168-9002(03)01368-8$$cGeant4 Collaboration$$djournal$$hS. Agostinelli$$o19$$sNucl.Instrum.Methods Phys.Res., Sect.,A506,250$$y2003
There is exactly 1 record matching
rawref:"Nucl.Instrum.Methods Phys.Res., Sect.,A506,250"
or
doi:10.1016/S0168-9002(03)01368-8
.
Current behavior
In new-citations.tsv are 2 corresponding lines
1596893 715388 715388 {'record': {'$ref': 'http://localhost:5000/api/literature/715388'}, 'reference': {'authors': [{'full_name': 'Allison, J.'}], 'publication_info': {'page_start': '270', 'journal_title': 'IEEE Trans.Nucl.Sci.', 'year': 2006, 'journal_volume': '53', 'artid': '270'}, 'label': '19', 'dois': ['10.1109/TNS.2006.869826'], 'collaborations': ['Geant4 Collaboration']}, 'recid': 715388, 'curated_relation': False}
1596893 0 593382 {'reference': {'authors': [{'full_name': 'Agostinelli, S.'}], 'publication_info': {'page_start': 'A506', 'journal_title': 'Nucl.Instrum.Methods Phys.Res.', 'year': 2003, 'journal_volume': ' Sect.', 'artid': 'A506'}, 'label': '19', 'dois': ['10.1016/S0168-9002(03)01368-8'], 'collaborations': ['Geant4 Collaboration']}}
Expected behavior
The second line should be
1596893 593382 593382 {'record': {'$ref': 'http://localhost:5000/api/literature/593382'}, 'reference': {'authors': [{'full_name': 'Agostinelli, S.'}], 'publication_info': {'page_start': '250', 'journal_title': 'Nucl.Instrum.Meth.A', 'year': 2003, 'journal_volume': '506', 'artid': '250'}, 'label': '19', 'dois': ['10.1016/S0168-9002(03)01368-8'], 'collaborations': ['Geant4 Collaboration']}}
I.e. the publication_info should be parsed as
{'page_start': '250', 'journal_title': 'Nucl.Instrum.Meth.A', 'year': 2003, 'journal_volume': ' 506', 'artid': '250'}
there should be 593382 instead of 0 in the second column of the second line. The record should be found via DOI despite the wrong publication_info.
Comment
Are the metadata used on labs for the citation analysis those given in new-citations.tsv? I don't know whether this is just a single case or a more common mistake. Most of the 0's for citations that are in the INSPIRE index are due to several records matching one reference.
Are the metadata used on labs for the citation analysis those given in new-citations.tsv?
Yes
In this case, legacy wasn't able to recognize the reference for recid 593382 , the labs reference matcher seems to be working fine
legacy@legacy is able to recognize the reference for recid 593382. As I said there is exactly one record matching.
Maybe it is the parsing error which causes the legacy-re-indexing of that messed-up info to fail. If you want to compare legacy to labs performance you have to get both right.
@salmanmaq ?