scholar icon indicating copy to clipboard operation
scholar copied to clipboard

Years with zero citations cause get_article_cite_history() to fail

Open joelmcg opened this issue 3 years ago • 3 comments

When using get_article_cite_history(), an article with years with zero citations will cause one of two errors. First, there may be an error message indicating that the length of years is incompatible with vals:

get_article_cite_history("wSXViPYAAAAJ", "KlAtU1dfN6UC") Error in data.frame(year = years, cites = vals) : arguments imply differing number of rows: 17, 16

Second, years that should be zero may be filled in with the incorrect values:

get_article_cite_history("wSXViPYAAAAJ", "9ZlFYXVOiuMC") year cites pubid 1 2005 1 9ZlFYXVOiuMC 2 2006 1 9ZlFYXVOiuMC 3 2007 1 9ZlFYXVOiuMC 4 2008 1 9ZlFYXVOiuMC 5 2009 1 9ZlFYXVOiuMC 6 2010 1 9ZlFYXVOiuMC 7 2011 1 9ZlFYXVOiuMC 8 2012 1 9ZlFYXVOiuMC 9 2013 1 9ZlFYXVOiuMC 10 2014 1 9ZlFYXVOiuMC 11 2015 1 9ZlFYXVOiuMC 12 2016 1 9ZlFYXVOiuMC

The correct citation history for this article contains many zeros:

https://scholar.google.com/citations?view_op=view_citation&hl=en&user=wSXViPYAAAAJ&cstart=20&pagesize=80&citation_for_view=wSXViPYAAAAJ:9ZlFYXVOiuMC

Thanks for looking into this!

Cheers, Joel

joelmcg avatar Sep 09 '21 17:09 joelmcg

I am having the same issue!

get_article_cite_history("QtuhiVMAAAAJ", "IjCSPb-OGe4C")

Error in data.frame(year = years, cites = vals) : 
  arguments imply differing number of rows: 16, 15

Have confirmed by looking at the google scholar page that it is articles that have years with no citations that is the problem.

Thank you so much!

Also having the same issue ... took a while to figure it out - any year with zero citations causes get_article_cite_history to die.

rmwaterhouse avatar Nov 07 '21 13:11 rmwaterhouse

I'm pretty sure the issue has to do with a dependency and/or conflict upstream. If I modify get_article_cite_history() such that the only thing I change is to make the rvest namespace explicit for related functions, everything works as intended.

For example, here is the original get_article_cite_history() function:

get_article_cite_history <- function(id, article) {
{
    site <- getOption("scholar_site")
    id <- tidy_id(id)
    url_base <- paste0(site, "/citations?", "view_op=view_citation&hl=en&citation_for_view=")
    url_tail <- paste(id, article, sep = ":")
    url <- paste0(url_base, url_tail)
    res <- get_scholar_resp(url)
    if (is.null(res)) 
        return(NA)
    httr::stop_for_status(res, "get user id / article information")
    doc <- read_html(res)
    years <- doc %>% html_nodes(".gsc_oci_g_t") %>% html_text() %>% 
        as.numeric()
    vals <- doc %>% html_nodes(".gsc_oci_g_al") %>% html_text() %>% 
        as.numeric()
    df <- data.frame(year = years, cites = vals)
    if (nrow(df) > 0) {
        df <- merge(data.frame(year = min(years):max(years)), 
            df, all.x = TRUE)
        df[is.na(df)] <- 0
        df$pubid <- article
    }
    else {
        df$pubid <- vector(mode = mode(article))
    }
    return(df)
}

Here is my modified function (called get_article_cite_history_2()):

get_article_cite_history_2 <- function (id, article) {
    
    site <- getOption("scholar_site")
    id <- tidy_id(id)
    url_base <- paste0(site, "/citations?",
                       "view_op=view_citation&hl=en&citation_for_view=")
    url_tail <- paste(id, article, sep=":")
    url <- paste0(url_base, url_tail)
    
    res <- get_scholar_resp(url)
    httr::stop_for_status(res, "get user id / article information")
    doc <- rvest::read_html(res)
    
    ## Inspect the bar chart to retrieve the citation values and years
    years <- doc %>%
        rvest::html_nodes(".gsc_oci_g_a") %>% 
        rvest::html_attr('href') %>% 
        stringr::str_match("as_ylo=(\\d{4})&") %>% 
        "["(,2) %>% 
        as.numeric()
    vals <- doc %>%
        rvest::html_nodes(".gsc_oci_g_al") %>% 
        rvest::html_text() %>% 
        as.numeric()
    
    df <- data.frame(year = years, cites = vals)
    if(nrow(df)>0) {
        ## There may be undefined years in the sequence so fill in these gaps
        df <- merge(data.frame(year=min(years):max(years)),
                    df, all.x=TRUE)
        df[is.na(df)] <- 0
        df$pubid <- article
    } else {
        # complete the 0 row data.frame to be consistent with normal results
        df$pubid <- vector(mode = mode(article))
    }
    return(df)
}

The output from running each of these:

> scholar::get_article_cite_history("eD9_J3wAAAAJ", "_FxGoFyzp5QC")
Error in data.frame(year = years, cites = vals) : 
  arguments imply differing number of rows: 6, 5
> get_article_cite_history_2("eD9_J3wAAAAJ", "_FxGoFyzp5QC")
  year cites        pubid
1 2016     3 _FxGoFyzp5QC
2 2017     1 _FxGoFyzp5QC
3 2018     0 _FxGoFyzp5QC
4 2019     1 _FxGoFyzp5QC
5 2020     1 _FxGoFyzp5QC
6 2021     5 _FxGoFyzp5QC

A suboptimal workaround right now is to simply replace the get_article_cite_history() function with the one I made above after calling in library(scholar) but this seems like something a dev can patch quickly.

mkiang avatar Nov 25 '21 08:11 mkiang