readxl icon indicating copy to clipboard operation
readxl copied to clipboard

Support reading from more general inputs

Open jennybc opened this issue 8 years ago • 23 comments

Right now readxl reads only from xlsx and xls files.

In the fullness of time, some functionality for supporting more general inputs will be pulled out of readr, at which point readxl can exploit that. We might implement an interim solution for some of these in the meantime.

This issue will cover all related feature requests (all of which I'm now closing):

  • #192 input via rawConnection
  • #183 URL support?
  • #115 align all read_* decompression and download behavior with read_delim()
  • #93 Read file directly from memory
  • #518 Feature Request: ability to read from HDFS

jennybc avatar Mar 01 '17 23:03 jennybc

Question: Along with this support for more general inputs, will that also signal a transition to readr style col_types? Having that kind of unity between the packages would be welcome.

jjcad avatar Jul 31 '17 20:07 jjcad

The col_types transition is #198. I can't think of any reason that implementation of these two things would be inextricably tied together. They'd more likely happen around the same time because it meant I was working on a new release.

jennybc avatar Jul 31 '17 20:07 jennybc

fread's approach is to download the file & proceed to use plaid-old fread on that.

@jennybc If you're happy with that logic, I'd be happy to file a PR.

MichaelChirico avatar Feb 19 '18 10:02 MichaelChirico

@MichaelChirico Yeah, that is an option we've contemplated re: "faking" read from a URL. I'm definitely interested in a PR that downloads to a temp file, then reads that.

jennybc avatar Feb 19 '18 16:02 jennybc

Does this include reading files from within a zip file using the unz() command like below:

zzz <- read_xlsx(unz("data/myzip.zip", "one_of_many.xlsx"))

This would be a handy option for allowing users to access compressed files.

jknowles avatar Mar 22 '18 15:03 jknowles

@jknowles I rather doubt it. It's especially hard to see that working for xlsx, where we explicitly unpack that into individual XML files.

jennybc avatar Mar 22 '18 15:03 jennybc

I started to tackle the URL piece of this as part of #454 and there is also another separate PR for URL download in #426. But Slack discussion has convinced me I really should tackle all of these together and soon-ish. But not for this week's release, which gets the security-patched libxls out there.

Notes from team discussion:

  • Even though download.file() is better behaved now wrt https, it may still be wise to prefer curl::curl_download().
  • Switch over to accepting a connection vs. just a file. Consult what readr and xml2 do: "if given a connection they just use that, otherwise they create the appropriate connection first based on the protocol / file extension(s)". Do it like so so that curl can go in Suggests.
  • This probably means that curl will handle file:// URLs, so can adopt this suggestion at same time.
  • Relevant code in the IDE
  • R source re: reading from file://

jennybc avatar Apr 16 '18 19:04 jennybc

Would appreciate if you could reproduce/summarize the part of the discussion regarding why to keep using curl for https as this might inform the choice in data.table::fread as well.

Thanks again!

MichaelChirico avatar Apr 17 '18 00:04 MichaelChirico

why to keep using curl for https

It was a rather vague suggestion, based on past experience with things "just working" as expected and consistently across OSes.

jennybc avatar Apr 17 '18 01:04 jennybc

Has there been any update here? What's the typical workaround people are using? (not considering writing to disk)

[Update, for others who land here after getting stuck] A possible workaround (if files are NOT large) is to create a temporary file as defined in one of the comments here.

tmp <- tempfile(fileext = ".xlsx")

httr::GET(url = "url/to/file",
            write_disk( tmp) )

xljh <- read_excel(tmp, skip = n)

electricstorm avatar Nov 13 '18 05:11 electricstorm

Sorry. I am a little confused. I saw a commit that added the ability to read from URL in April 2018: https://github.com/tidyverse/readxl/blob/839d0230832f98fd4f0db38e4a9284983aca0604/R/utils.R#L6-L10 Somehow that edit never got integrated. Was it taken out for a specific reason?

igordot avatar Oct 31 '19 00:10 igordot

@igordot That commit was in a PR that I ultimately did not merge https://github.com/tidyverse/readxl/pull/454.

jennybc avatar Oct 31 '19 12:10 jennybc

@jennybc Thanks for clarifying. I am just not sure if you are postponing URL handling or giving up on it entirely.

igordot avatar Oct 31 '19 15:10 igordot

No, definitely not giving up on it. Looking back at that, it looks like I did not want it wrapped up with some of the other things I was doing in that PR.

jennybc avatar Oct 31 '19 15:10 jennybc

I think this is the deal: to do the temp file thing to "fake" reading from URL is easy and perhaps I should just do that. But the declared goal here is noticeably higher, which is why it hasn't been handled yet.

jennybc avatar Oct 31 '19 15:10 jennybc

@jennybc Thank you for clarifying again. Dealing with remote data is complicated, so I understand the concern. If my opinion makes any difference, you can add the quick temp file fix for now. If anyone runs into any issues with it, they can always just not use a URL (same as they would now).

igordot avatar Oct 31 '19 15:10 igordot

right now I'm reading excel files from a URL using

  tf = tempfile(fileext = ".xlsx")
  curl::curl_download(url, tf)
  readxl::read_excel(tf, ...)

But I could also see a more generic approach using a .con method

con = curl::curl(url)
read_excel.con(con, ...)
close(con)

or a .raw method

raw = curl::curl_fetch_memory(url)$content
read_excel.raw(raw, ...)

mkoohafkan avatar Feb 26 '20 17:02 mkoohafkan

library(openxlsx)

url <- "https://docs.google.com/spreadsheets/d/e/2PACX-1vQH8DfJhj4-fiEw91FkcQXB47cCNSPMJo6AnehnGinq1pALxuQXqVwnZQcKtN0Ulw/pub?output=xlsx"

data <- read.xlsx(url)

This works well.

gbganalyst avatar Mar 10 '20 21:03 gbganalyst

If you are reading a public Google Sheet, you can also use googlesheets4 without having it transit through the xlsx format. Or a private Sheet, for that matter.

jennybc avatar Mar 10 '20 23:03 jennybc

@jennybc thank you. It works when I converted the Excel file to google sheets format.

I used googlesheets4 package

googlesheets4::read_sheet("https://docs.google.com/spreadsheets/d/1wyhnEPoRB-JMssCWD_xslh_9G1JQiPXRuT_I0ibW600/edit#gid=74876023")

gbganalyst avatar Mar 11 '20 18:03 gbganalyst

Hi,

Based upon the thread content, I am not sure if anything has been done to allow input via rawConnection objects. Can you please confirm if that is the case or not?

If it is possible to read from raw connections, can you please indicate how to properly do this?

Thanks

sbihorel avatar May 29 '20 18:05 sbihorel

readxl cannot read from rawConnection objects.

jennybc avatar Aug 11 '20 15:08 jennybc

Am I correct that the main roadblock to reading from, say, rawConnection (i.e. from memory), is the unzip() which only works with file paths? It looks like libxls can read from memory via xls_open_buffer while rapidxml is already reading from memory after unzip() has done its work.

stephematician avatar Mar 02 '22 02:03 stephematician