readxl
readxl copied to clipboard
Support reading from more general inputs
Right now readxl reads only from xlsx and xls files.
In the fullness of time, some functionality for supporting more general inputs will be pulled out of readr, at which point readxl can exploit that. We might implement an interim solution for some of these in the meantime.
This issue will cover all related feature requests (all of which I'm now closing):
- #192 input via rawConnection
- #183 URL support?
- #115 align all read_* decompression and download behavior with read_delim()
- #93 Read file directly from memory
- #518 Feature Request: ability to read from HDFS
Question: Along with this support for more general inputs, will that also signal a transition to readr style col_types? Having that kind of unity between the packages would be welcome.
The col_types transition is #198. I can't think of any reason that implementation of these two things would be inextricably tied together. They'd more likely happen around the same time because it meant I was working on a new release.
fread's approach is to download the file & proceed to use plaid-old fread on that.
@jennybc If you're happy with that logic, I'd be happy to file a PR.
@MichaelChirico Yeah, that is an option we've contemplated re: "faking" read from a URL. I'm definitely interested in a PR that downloads to a temp file, then reads that.
Does this include reading files from within a zip file using the unz() command like below:
zzz <- read_xlsx(unz("data/myzip.zip", "one_of_many.xlsx"))
This would be a handy option for allowing users to access compressed files.
@jknowles I rather doubt it. It's especially hard to see that working for xlsx, where we explicitly unpack that into individual XML files.
I started to tackle the URL piece of this as part of #454 and there is also another separate PR for URL download in #426. But Slack discussion has convinced me I really should tackle all of these together and soon-ish. But not for this week's release, which gets the security-patched libxls out there.
Notes from team discussion:
- Even though
download.file()is better behaved now wrt https, it may still be wise to prefercurl::curl_download(). - Switch over to accepting a connection vs. just a file. Consult what readr and xml2 do: "if given a connection they just use that, otherwise they create the appropriate connection first based on the protocol / file extension(s)". Do it like so so that curl can go in Suggests.
- This probably means that curl will handle
file://URLs, so can adopt this suggestion at same time. - Relevant code in the IDE
- R source re: reading from
file://
Would appreciate if you could reproduce/summarize the part of the discussion regarding why to keep using curl for https as this might inform the choice in data.table::fread as well.
Thanks again!
why to keep using curl for https
It was a rather vague suggestion, based on past experience with things "just working" as expected and consistently across OSes.
Has there been any update here? What's the typical workaround people are using? (not considering writing to disk)
[Update, for others who land here after getting stuck] A possible workaround (if files are NOT large) is to create a temporary file as defined in one of the comments here.
tmp <- tempfile(fileext = ".xlsx")
httr::GET(url = "url/to/file",
write_disk( tmp) )
xljh <- read_excel(tmp, skip = n)
Sorry. I am a little confused. I saw a commit that added the ability to read from URL in April 2018: https://github.com/tidyverse/readxl/blob/839d0230832f98fd4f0db38e4a9284983aca0604/R/utils.R#L6-L10 Somehow that edit never got integrated. Was it taken out for a specific reason?
@igordot That commit was in a PR that I ultimately did not merge https://github.com/tidyverse/readxl/pull/454.
@jennybc Thanks for clarifying. I am just not sure if you are postponing URL handling or giving up on it entirely.
No, definitely not giving up on it. Looking back at that, it looks like I did not want it wrapped up with some of the other things I was doing in that PR.
I think this is the deal: to do the temp file thing to "fake" reading from URL is easy and perhaps I should just do that. But the declared goal here is noticeably higher, which is why it hasn't been handled yet.
@jennybc Thank you for clarifying again. Dealing with remote data is complicated, so I understand the concern. If my opinion makes any difference, you can add the quick temp file fix for now. If anyone runs into any issues with it, they can always just not use a URL (same as they would now).
right now I'm reading excel files from a URL using
tf = tempfile(fileext = ".xlsx")
curl::curl_download(url, tf)
readxl::read_excel(tf, ...)
But I could also see a more generic approach using a .con method
con = curl::curl(url)
read_excel.con(con, ...)
close(con)
or a .raw method
raw = curl::curl_fetch_memory(url)$content
read_excel.raw(raw, ...)
library(openxlsx)
url <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vQH8DfJhj4-fiEw91FkcQXB47cCNSPMJo6AnehnGinq1pALxuQXqVwnZQcKtN0Ulw/pub?output=xlsx"
data <- read.xlsx(url)
This works well.
If you are reading a public Google Sheet, you can also use googlesheets4 without having it transit through the xlsx format. Or a private Sheet, for that matter.
@jennybc thank you. It works when I converted the Excel file to google sheets format.
I used googlesheets4 package
googlesheets4::read_sheet("https://docs.google.com/spreadsheets/d/1wyhnEPoRB-JMssCWD_xslh_9G1JQiPXRuT_I0ibW600/edit#gid=74876023")
Hi,
Based upon the thread content, I am not sure if anything has been done to allow input via rawConnection objects. Can you please confirm if that is the case or not?
If it is possible to read from raw connections, can you please indicate how to properly do this?
Thanks
readxl cannot read from rawConnection objects.
Am I correct that the main roadblock to reading from, say, rawConnection (i.e. from memory), is the unzip() which only works with file paths? It looks like libxls can read from memory via xls_open_buffer while rapidxml is already reading from memory after unzip() has done its work.