readxl Support reading from more general inputs

Right now readxl reads only from xlsx and xls files.

In the fullness of time, some functionality for supporting more general inputs will be pulled out of readr, at which point readxl can exploit that. We might implement an interim solution for some of these in the meantime.

This issue will cover all related feature requests (all of which I'm now closing):

#192 input via rawConnection
#183 URL support?
#115 align all read_* decompression and download behavior with read_delim()
#93 Read file directly from memory
#518 Feature Request: ability to read from HDFS

Mar 01 '17 23:03 jennybc

Question: Along with this support for more general inputs, will that also signal a transition to readr style col_types? Having that kind of unity between the packages would be welcome.

Jul 31 '17 20:07 jjcad

The col_types transition is #198. I can't think of any reason that implementation of these two things would be inextricably tied together. They'd more likely happen around the same time because it meant I was working on a new release.

Jul 31 '17 20:07 jennybc

fread's approach is to download the file & proceed to use plaid-old fread on that.

@jennybc If you're happy with that logic, I'd be happy to file a PR.

Feb 19 '18 10:02 MichaelChirico

@MichaelChirico Yeah, that is an option we've contemplated re: "faking" read from a URL. I'm definitely interested in a PR that downloads to a temp file, then reads that.

Feb 19 '18 16:02 jennybc

Does this include reading files from within a zip file using the unz() command like below:

zzz <- read_xlsx(unz("data/myzip.zip", "one_of_many.xlsx"))

This would be a handy option for allowing users to access compressed files.

Mar 22 '18 15:03 jknowles

@jknowles I rather doubt it. It's especially hard to see that working for xlsx, where we explicitly unpack that into individual XML files.

Mar 22 '18 15:03 jennybc

I started to tackle the URL piece of this as part of #454 and there is also another separate PR for URL download in #426. But Slack discussion has convinced me I really should tackle all of these together and soon-ish. But not for this week's release, which gets the security-patched libxls out there.

Notes from team discussion:

Even though download.file() is better behaved now wrt https, it may still be wise to prefer curl::curl_download().
Switch over to accepting a connection vs. just a file. Consult what readr and xml2 do: "if given a connection they just use that, otherwise they create the appropriate connection first based on the protocol / file extension(s)". Do it like so so that curl can go in Suggests.
This probably means that curl will handle file:// URLs, so can adopt this suggestion at same time.
Relevant code in the IDE
R source re: reading from file://

Apr 16 '18 19:04 jennybc

Would appreciate if you could reproduce/summarize the part of the discussion regarding why to keep using curl for https as this might inform the choice in data.table::fread as well.

Thanks again!

Apr 17 '18 00:04 MichaelChirico

why to keep using curl for https

It was a rather vague suggestion, based on past experience with things "just working" as expected and consistently across OSes.

Apr 17 '18 01:04 jennybc

Has there been any update here? What's the typical workaround people are using? (not considering writing to disk)

[Update, for others who land here after getting stuck] A possible workaround (if files are NOT large) is to create a temporary file as defined in one of the comments here.

tmp <- tempfile(fileext = ".xlsx")

httr::GET(url = "url/to/file",
            write_disk( tmp) )

xljh <- read_excel(tmp, skip = n)

Nov 13 '18 05:11 electricstorm

Sorry. I am a little confused. I saw a commit that added the ability to read from URL in April 2018: https://github.com/tidyverse/readxl/blob/839d0230832f98fd4f0db38e4a9284983aca0604/R/utils.R#L6-L10 Somehow that edit never got integrated. Was it taken out for a specific reason?

Oct 31 '19 00:10 igordot

@igordot That commit was in a PR that I ultimately did not merge https://github.com/tidyverse/readxl/pull/454.

Oct 31 '19 12:10 jennybc

@jennybc Thanks for clarifying. I am just not sure if you are postponing URL handling or giving up on it entirely.

Oct 31 '19 15:10 igordot

No, definitely not giving up on it. Looking back at that, it looks like I did not want it wrapped up with some of the other things I was doing in that PR.

Oct 31 '19 15:10 jennybc

I think this is the deal: to do the temp file thing to "fake" reading from URL is easy and perhaps I should just do that. But the declared goal here is noticeably higher, which is why it hasn't been handled yet.

Oct 31 '19 15:10 jennybc

@jennybc Thank you for clarifying again. Dealing with remote data is complicated, so I understand the concern. If my opinion makes any difference, you can add the quick temp file fix for now. If anyone runs into any issues with it, they can always just not use a URL (same as they would now).

Oct 31 '19 15:10 igordot

right now I'm reading excel files from a URL using

  tf = tempfile(fileext = ".xlsx")
  curl::curl_download(url, tf)
  readxl::read_excel(tf, ...)

But I could also see a more generic approach using a .con method

con = curl::curl(url)
read_excel.con(con, ...)
close(con)

or a .raw method

raw = curl::curl_fetch_memory(url)$content
read_excel.raw(raw, ...)

Feb 26 '20 17:02 mkoohafkan

library(openxlsx)

url <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vQH8DfJhj4-fiEw91FkcQXB47cCNSPMJo6AnehnGinq1pALxuQXqVwnZQcKtN0Ulw/pub?output=xlsx"

data <- read.xlsx(url)

This works well.

Mar 10 '20 21:03 gbganalyst

If you are reading a public Google Sheet, you can also use googlesheets4 without having it transit through the xlsx format. Or a private Sheet, for that matter.

Mar 10 '20 23:03 jennybc

@jennybc thank you. It works when I converted the Excel file to google sheets format.

I used googlesheets4 package

googlesheets4::read_sheet("https://docs.google.com/spreadsheets/d/1wyhnEPoRB-JMssCWD_xslh_9G1JQiPXRuT_I0ibW600/edit#gid=74876023")

Mar 11 '20 18:03 gbganalyst

Hi,

Based upon the thread content, I am not sure if anything has been done to allow input via rawConnection objects. Can you please confirm if that is the case or not?

If it is possible to read from raw connections, can you please indicate how to properly do this?

Thanks

May 29 '20 18:05 sbihorel

readxl cannot read from rawConnection objects.

Aug 11 '20 15:08 jennybc

Am I correct that the main roadblock to reading from, say, rawConnection (i.e. from memory), is the unzip() which only works with file paths? It looks like libxls can read from memory via xls_open_buffer while rapidxml is already reading from memory after unzip() has done its work.

Mar 02 '22 02:03 stephematician