inspire-next icon indicating copy to clipboard operation
inspire-next copied to clipboard

CitationAnalysis: one example

Open ksachs opened this issue 6 years ago • 3 comments

I don't know what the underlying problem is in this case. So I can't specify a general error.

The Data

recid:1596893 has one reference in the pdf for 2 papers:

 [19] Geant4 collaboration, J. Allison et al., Geant4 developments and applications, IEEE Trans. Nucl. Sci. 53 (2006) 270; 
Geant4 collaboration, S. Agostinelli et al., Geant4: A simulation toolkit, Nucl. Instrum. Meth. A506 (2003) 250.

in legacy this is split to 2 references:

001596893 999C5 $$adoi:10.1109/TNS.2006.869826$$cGeant4 Collaboration$$djournal$$hJ. Allison$$o19$$sIEEE Trans.Nucl.Sci.,53,270$$y2006
001596893 999C5 $$adoi:10.1016/S0168-9002(03)01368-8$$cGeant4 Collaboration$$djournal$$hS. Agostinelli$$o19$$sNucl.Instrum.Methods Phys.Res., Sect.,A506,250$$y2003

There is exactly 1 record matching rawref:"Nucl.Instrum.Methods Phys.Res., Sect.,A506,250" or doi:10.1016/S0168-9002(03)01368-8 .

Current behavior

In new-citations.tsv are 2 corresponding lines

1596893	715388	715388	{'record': {'$ref': 'http://localhost:5000/api/literature/715388'}, 'reference': {'authors': [{'full_name': 'Allison, J.'}], 'publication_info': {'page_start': '270', 'journal_title': 'IEEE Trans.Nucl.Sci.', 'year': 2006, 'journal_volume': '53', 'artid': '270'}, 'label': '19', 'dois': ['10.1109/TNS.2006.869826'], 'collaborations': ['Geant4 Collaboration']}, 'recid': 715388, 'curated_relation': False}
1596893	0	593382	{'reference': {'authors': [{'full_name': 'Agostinelli, S.'}], 'publication_info': {'page_start': 'A506', 'journal_title': 'Nucl.Instrum.Methods Phys.Res.', 'year': 2003, 'journal_volume': ' Sect.', 'artid': 'A506'}, 'label': '19', 'dois': ['10.1016/S0168-9002(03)01368-8'], 'collaborations': ['Geant4 Collaboration']}}

Expected behavior

The second line should be

1596893	593382	593382	{'record': {'$ref': 'http://localhost:5000/api/literature/593382'}, 'reference': {'authors': [{'full_name': 'Agostinelli, S.'}], 'publication_info': {'page_start': '250', 'journal_title': 'Nucl.Instrum.Meth.A', 'year': 2003, 'journal_volume': '506', 'artid': '250'}, 'label': '19', 'dois': ['10.1016/S0168-9002(03)01368-8'], 'collaborations': ['Geant4 Collaboration']}}

I.e. the publication_info should be parsed as {'page_start': '250', 'journal_title': 'Nucl.Instrum.Meth.A', 'year': 2003, 'journal_volume': ' 506', 'artid': '250'}

there should be 593382 instead of 0 in the second column of the second line. The record should be found via DOI despite the wrong publication_info.

Comment

Are the metadata used on labs for the citation analysis those given in new-citations.tsv? I don't know whether this is just a single case or a more common mistake. Most of the 0's for citations that are in the INSPIRE index are due to several records matching one reference.

ksachs avatar Apr 09 '18 13:04 ksachs

Are the metadata used on labs for the citation analysis those given in new-citations.tsv?

Yes

In this case, legacy wasn't able to recognize the reference for recid 593382 , the labs reference matcher seems to be working fine

StellaCh avatar Apr 18 '18 14:04 StellaCh

legacy@legacy is able to recognize the reference for recid 593382. As I said there is exactly one record matching.

Maybe it is the parsing error which causes the legacy-re-indexing of that messed-up info to fail. If you want to compare legacy to labs performance you have to get both right.

ksachs avatar Apr 23 '18 10:04 ksachs

@salmanmaq ?

StellaCh avatar Apr 24 '18 08:04 StellaCh