eurostat icon indicating copy to clipboard operation
eurostat copied to clipboard

Reproducible bibliographic references

Open antaldaniel opened this issue 5 years ago • 18 comments

I am thinking on a way how to create .bib files for the data that is downloaded by the eurostat package. I have a code that downloads my most important data, and updates my bib files that cite the data, i.e. data accessed, but it not a fully general solution.

I use the following template and add this to a collected bib. file

`@misc{eurostat_sbs_na_dt_r2_year, title = {Annual detailed enterprise statistics for trade {(NACE Rev. 2 G)} [sbs_na_dt_r2]},

url = {https://ec.europa.eu/eurostat/web/products-datasets/-/sbs_na_dt_r2},

language = {en},

    year = {year},

urldate = {not_dated},

publisher = {{Eurostat}},

author = {{Eurostat}},

keywords = {structural business indicators, dataset, statistics, Eurostat}

}`

I change the statistics product code sbs_na_dt_r2 in the unique identifier, use the current date for urldate, replace the year with the year component of the download date.

I think that the title could be created by get_eurostat_dic, but I have no idea how to create an url to the data. I wonder if there is any metadata directory that may be used to create a permanent reference either to a reproducible download address or metadata description?

I think that in the spirit of truly reproducible research, it would be reasonable not only to update Eurostat statistics in an RMarkdown document, but also update the details of the .bib file. I had a misfortune that Eurostat removed completely an earlier data product, and I think that a full documentation would be good.

Of course, I just used a simple bib template from Zotero, but maybe using some Datacite metadata best practices could help. I'd gladly create a new function if somebody can put me into direction with the url issue.

antaldaniel avatar Nov 10 '18 22:11 antaldaniel

This is a really neat idea. @jhuovari and @pbiecek are more familiar with this part of the pkg, let's first see if they have a comment.

antagomir avatar Nov 11 '18 11:11 antagomir

I think you get the title best with:

label_eurostat_tables("sbs_na_dt_r2")

url to data you can get with identifier. Bulk data is in: https://ec.europa.eu/eurostat/estat-navtree-portlet-prod/BulkDownloadListing?file=data/sbs_na_dt_r2.tsv.gz

But I think a more user friendly link could be: https://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=sbs_na_dt_r2&lang=en

Hope, this helps.

jhuovari avatar Nov 12 '18 06:11 jhuovari

Yes, I came to the same conclusion. I am just wondering if there are exceptions to the

https://appsso.eurostat.ec.europa.eu/nui/show.do?dataset=sbs_na_dt_r2&lang=en

link. So far I have not seen data that would not open this way, and in this case the task is very easy. I'll create a pull request later the week.

antaldaniel avatar Nov 12 '18 09:11 antaldaniel

Cool idea. What about such function:

get_bibentry <- function(code = "t2020_rk310", toBibtex = FALSE) {
  toc <- get_eurostat_toc()
  toc <- toc[toc$code == code, ]

  if (nrow(toc) == 0) {
    warning(paste0("Code ",code, "not found"))
    return()
  }  
  
  entry <- bibentry(
    bibtype = "misc",
    title = paste0(toc$title[1]," [",code,"]"),
    url = paste0("https://ec.europa.eu/eurostat/web/products-datasets/-/",code),
    language = "en",
    year = paste0(toc$`last update of data`[1]),
    publisher = "Eurostat",
    author = "Eurostat"   
  )
  if (toBibtex) {
    toBibtex(entry)
  } else {
    entry
  }
}

Then you can do such things:

> get_bibentry("sbs_na_dt_r2")
Eurostat (12.11.2018). “Annual detailed enterprise statistics for trade
(NACE Rev. 2 G) [sbs_na_dt_r2].” <URL:
https://ec.europa.eu/eurostat/web/products-datasets/-/sbs_na_dt_r2>.

> get_bibentry("sbs_na_dt_r2", toBibtex = TRUE)
@Misc{,
  title = {Annual detailed enterprise statistics for trade (NACE Rev. 2 G) [sbs_na_dt_r2]},
  url = {https://ec.europa.eu/eurostat/web/products-datasets/-/sbs_na_dt_r2},
  language = {en},
  year = {12.11.2018},
  publisher = {Eurostat},
  author = {{Eurostat}},
}

pbiecek avatar Nov 12 '18 23:11 pbiecek

Beautiful. How about replacing "toBibtex" argument with "format" (or similar)? This would then become: get_bibentry("sbs_na_dt_r2", format = "bibtex") or get_bibentry("sbs_na_dt_r2", format = "plaintext"). Later it would be possible to add other formats (RIS etc) if needs arise.

antagomir avatar Nov 13 '18 08:11 antagomir

Very nice ,much simpler, than I thought, I was trying to figure out how the url changes in the interactive data viewer, but your solution is far better and more elegant.

I'd probably add optional keywords, and the url date, where keywords can be a parameter of the function as a vector, or have some default like c("Eurostat", "statistics", "dataset")

if ( length(keywords)>1) {
  keywords <- paste0('{', paste(my_keywords, collapse=', '), '}')

} 

urldate <- paste0('{', as.character(Sys.Date()), '}')

paste0("@misc_eurostat_", code, "_", substr(as.character(Sys.Date()), 1, 4))


entry <- bibentry(
  bibtype = "misc",
  title = paste0(toc$title[1]," [",code,"]"),
  url = paste0("https://ec.europa.eu/eurostat/web/products-datasets/-/",code),
  language = "en",
  year = paste0(toc$`last update of data`[1]),
  publisher = "Eurostat",
  author = "Eurostat" ,  
  urldate = urldate,
  keywords = keywords
)

I know that the urldate is superfluous logically, but may be a requirement in many formatting guides. Furthermore, I wonder how it is possible to add unique identifiers to the bib entries, so that they can immediately be used in knitr, which means adding

paste0("@misc_eurostat_", code, "_", substr(as.character(Sys.Date()), 1, 4))

to the bibentry.

antaldaniel avatar Nov 13 '18 08:11 antaldaniel

My take on the issue. This would depend on the rOpenSci package RefManageR, but creates a Biblatex output that can be attached to a journal article or bookdown book immediately, or imported to Zotero.

My only concern is the last comma after the last metadata field, I don't know if it will cause any issue. Any further comments?

Compared to @pbiecek 's function this adds three extras,

  • use of keywords,
  • creating a unique ID key for Biblatex,
  • three format choices (bibentry, bibtex, biblatex)
get_bibentry <- function(code = c("tran_hv_frtra", "t2020_rk310","tec00001"), 
                                        keywords = list ( c("railways", "freight", "transport"), 
                                                                   c("railways", "passengers", "modal split") ),
                                       format = "Biblatex") {

    toc <- get_eurostat_toc()
    toc <- toc[toc$code %in% code, ]
    toc <- toc[! duplicated(toc), ]
    
  urldate <- as.character(Sys.Date())
    
    if (nrow(toc) == 0) {
      warning(paste0("Code ",code, "not found"))
      return()
    }  
    
    eurostat_id <- paste0( toc$code, "_", 
                           gsub("\\.", "-",  toc$`last update of data`)) 

    for ( i in 1:nrow(toc) ) {
      
      if ( !is.null(keywords) ) {                             #if user entered keywords
        if ( length(keywords)<i ) {                           #last keyword not entered
          keyword_entry <- NULL } else if ( nchar(keywords)[i] > 0 ) {         #not empty keyword entry
            keyword_entry <- paste( keywords[[i]], collapse = ', ' )  
          } 
      } else {
        keyword_entry <- NULL
      }
      
      entry <- RefManageR::BibEntry(
        bibtype = "misc",
        key = eurostat_id[i],
        title = paste0(toc$title[i]," [",code[i],"]"),
        url = paste0("https://ec.europa.eu/eurostat/web/products-datasets/-/",code[i]),
        language = "en",
        year = paste0(toc$`last update of data`[1]),
        publisher = "Eurostat",
        author = "Eurostat", 
        keywords = keyword_entry,
        urldate = urldate
      )  

    if ( i > 1 ) {
        entries <- c(entries, entry) 
      } else {
        entries <- entry
      }
    }
    
    if (format == "Bibtex") {
      
      entries <- toBibtex(entries)
    } else if ( format == "Biblatex") {
      entries <- toBiblatex ( entries )
    }

  entries 
}

antaldaniel avatar Jan 17 '19 12:01 antaldaniel

I created a pull request, with the new function, documentation and unit tests. However, if you can, take a look at my last comment, the superflous comma.

antaldaniel avatar Jan 17 '19 13:01 antaldaniel

Thanks, excellent. Let us try to get this merged asap.

antagomir avatar Jan 17 '19 13:01 antagomir

It seems to me that: in package eurostat (version 3.3.5) in function get_bibentry is error. On line 16 of function code is code = c("tran_hv_frtra", "t2020_rk310", "tec00001") that rewrite user request of code

pompm avatar May 10 '19 09:05 pompm

I just tried with default is 'Biblatex', alternatives are 'bibentry' or 'Bibtex' and worked for me on a Window computer well. Can you somehow reproduce the error?

antaldaniel avatar May 10 '19 09:05 antaldaniel

Hi, there is an example of my problem. (If I copy definition of function get_bibentry and remove line16, all is ok) Marek

version _
platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 3.3
year 2017
month 03
day 06
svn rev 72310
language R
version.string R version 3.3.3 (2017-03-06) nickname Another Canoe
packageVersion("eurostat") [1] ‘3.3.5’ get_bibentry(code="sbs_na_dt_r2") @misc{tran_hv_frtra_30-04-2019, title = {Volume of freight transport relative to GDP [tran_hv_frtra]}, url = {https://ec.europa.eu/eurostat/web/products-datasets/-/tran_hv_frtra}, year = {30.04.2019}, publisher = {Eurostat}, author = {{Eurostat}}, month = {kvě}, note = {Last visited on 05/10/2019}, } @misc{tec00001_08-05-2019, title = {Gross domestic product at market prices [t2020_rk310]}, url = {https://ec.europa.eu/eurostat/web/products-datasets/-/t2020_rk310}, year = {30.04.2019}, publisher = {Eurostat}, author = {{Eurostat}}, month = {kvě}, note = {Last visited on 05/10/2019}, }

@misc{t2020_rk310_21-03-2019, title = {Modal split of passenger transport [tec00001]}, url = {https://ec.europa.eu/eurostat/web/products-datasets/-/tec00001}, year = {30.04.2019}, publisher = {Eurostat}, author = {{Eurostat}}, month = {kvě}, note = {Last visited on 05/10/2019}, }

pompm avatar May 10 '19 11:05 pompm

Indeed, there is a line left that is hardcoding the data. Sorry. I will correct a.s.a.p and create a pull request.

antaldaniel avatar May 10 '19 11:05 antaldaniel

@pompm thanks for the report! Bibtex and Biblatex entries are anyway can be tricky, let me know if you have other issues using them.

antaldaniel avatar May 10 '19 11:05 antaldaniel

Can we close this one?

antagomir avatar Jan 26 '20 09:01 antagomir

Yes, we can close this.

antaldaniel avatar Feb 06 '20 11:02 antaldaniel

I just got info from CRAN that RefManageR will be deprecated and removed from CRAN on 2020-10-21 due to lack of maintenance. If this will happen, this part of eurostat R pkg will go defunct.

We can either remove this functionality, or implement the necessary parts directly in our pkg. The RefManageR pkg is with GPL2/3 license, therefore we could not borrow the code from there directly without changing the eurostat R pkg license.

antagomir avatar Oct 08 '20 15:10 antagomir

@antaldaniel if you have an opinion about this it would be good to hear - the DL is Wednesday (Oct 21).

However I just noticed that RefManageR allows also BSD3 license (we have BSD2). I think BSD2 allows us to switch to BSD3 (or even GPL2/3). I think will just switch to BSD3 and copy the missing functions in our (eurostat) package before RefManageR is deprecated, and then inform all authors about the change. If anyone objects, we can switch back to BSD2 license and remove bib functionality.

antagomir avatar Oct 17 '20 15:10 antagomir