Multiple issues with analyses on countries
Hi!
As I'm checking the results of the analysis produced for the database of 4872 documents, I noticed some inconsistencies when it comes to country-based analysis.
- There're more countries recognized in the analyses on most cited countries, corr author analysis and countries' collab network than in Countries' Scientific Production. In particular, this is the only analysis that doesn't recognize Hong Kong. I identified issues with other countries too but I'll stick to HK for simplicity. In both columns C1 and RP, Hong Kong is searchable. I also assume that for most cited countries, countries' collab network and Countries' Scientific Production analyses data is retrieved from C1 (I'm certain for the first two as I checked the code but I couldn't find the code for the latter). Therefore, I conclude that the problem is in the list of countries embedded in the code - it's simply different for most cited countries and Countries' Scientific Production. I have no other explanation. What do you think? Where could I find the index of countries used in these functions to make sure that code recognizes them properly?
- A quick manual search in excel filtered by the C1 column identified a significantly higher number of total citations for single countries than a function in biblioshiny/bibiometrix does. When filtered by RP column, the number is closer to the one produced by the package but still higher. I looked into the code and couldn't figure out why. Does this function look into the n of documents or authors (i.e., C1 or RP column)? What could be the reason for such inconsistency in the results?
Thank you very much in advance!
P.S. Thanks a lot for the amazing package!
A little update on the first issue Manual experimentation with data in Excel proved that:
- information for the Country production analysis is taken from C1 (addresses of all authors)
- Hong Kong is not on the list of countries inside the function, nor it's counted for China
- Taiwan, on the contrary, is counted as China
- most of the manual searches in excel produced the same result as the one by Bilblioshiny, but for a few countries with minor differences in numbers
- however, results on USA and UK still differ a lot - manually I can find almost 500 more counts for the USA and about 50 less for the UK than analysis in Bilioshiny.
Conclusion: having a look into the list of countries and how they're identified in the database would help a lot but I can't find it myself anywhere. I would really appreciate it if you could share it.
The second issue is not solved yet too but I'm certain that it actually takes information from the RP column (after the same manual experimentation with data in Excel)
An update on the second issue I played with the database in Excel too and came to the following conclusions:
- information is taken from RP column
- it doesn't double count (if more than 1 corr author is from one country the citations would be scored for this country only once)
- Hong Kong here is a separate country and is recognized, whereas Taiwan is still counted as China; Macau is not recognized by any of those analyses
- obviously, code also wouldn't recognize a country when not deliminated properly (e.g., UNIVERSITY OF TEHRANIRAN)
- if there're more than two corr authors from different countries, the citations will be counted for the country that is the first in the alphabetical order (e.g., if corr authors are from Iran and Canada (it doesn't matter in which order), Germany will get the citations from this paper scored, not Iran)
I find the last point to be a huge limitation of this package because, basically, the name of the country defines how many total citations it will score in the end (e.g., Austria vs. China or even more "unlucky" Turkey). Therefore, you can't rely on the produced results.
Generally, I don't really understand the reason why only corresponding authors are included in this analysis. If the scientific productivity of countries includes all nationalities of all authors, why not do the same here (without double counting)? The fact that a corresponding author comes from the USA, for example, doesn't mean that all the citation scores should count only for the USA when other authors from other countries contributed to the research too.
Further issues
- country production doesn't see Russia in "Russian Federation", Congo in "Democratic Republic of Congo" (and generally doesn't recognize Congo even if it's just written "Congo") and Brunei in "Brunei Darussalam", whereas most cited countries sees them if they're written this way; Viet Nam is not counted for Vietnam by country production --> It's really necessary that the index of countries embedded in the code of these functions are publicly available
- most cited countries counts HUNTER, C.A.; NATIONAL RENEWABLE ENERGY LABORATORYUNITED STATES; EMAIL: CHAD[email protected] for Chad
Thanks a lot for this detailed report about country recognition. Country data we use to identify author affiliation's countries are stored in the data.frame countries. You can access it using the following command: data("countries")
We are aware of some inconsistencies in the identification of some countries and are working to resolve them. Unfortunately, however, the country field does not exist in most databases and must be extracted from the strings through heuristic rules.