opendatatoronto icon indicating copy to clipboard operation
opendatatoronto copied to clipboard

Strange error with get_resource for a .csv file = EOF within quoted string

Open jamiedtor opened this issue 3 years ago • 3 comments

First, excellent, super useful package. Thanks very much.

Second, I have hit one small snag. When I use get_resource using the following code, the .csv file ends up being parsed incorrectly.

active_building_permits <- search_packages("Active permits") %>% list_package_resources() %>% dplyr::filter(name == "Active permits (CSV)") %>% get_resource()

I have far fewer records than I should and information appears in the wrong columns. I get the following warning:

In scan(file = file, what = what, sep = sep, quote = quote, dec = dec, : EOF within quoted string

I know of similar things happening when directly using read.csv rather than read.table instead of read.csv because of stray quotes in the data, See https://kodlogs.com/33766/in-scan-file-file-what-what-sep-sep-quote-quote-dec-dec-eof-within-quoted-string

But I'm not sure what is happening here.

jamiedtor avatar Sep 09 '21 23:09 jamiedtor

Hi, thanks for the issue!

I have actually run into this problem myself, with this exact data set! The issue is definitely with the underlying CSV - read.csv doesn't seem to parse it properly, but readr::read_csv() does. Unfortunately right now ckanr (the package that opendatatoronto uses to access the portal) uses read.csv and not readr::read_csv().

I'll open an issue over on ckanr with this - I'm the maintainer on that too so will have a think about how to handle it.

In the meantime, you can access the file more manually by using ckanr functions and reading the CSV yourself - here is some code to do that:

library(opendatatoronto)
library(ckanr)
#> Loading required package: DBI
library(readr)

active_building_permits <- search_packages("Active permits") %>% 
  list_package_resources() %>% dplyr::filter(name == "Active permits (CSV)")

active_building_permits_id <- active_building_permits[["id"]]
  
# Get URL of resource
resource <- resource_show(active_building_permits_id, url = "https://ckan0.cf.opendata.inter.prod-toronto.ca/", as = "list")

# Make a directory to save into
dir <- tempdir()
resource_dir <- fs::dir_create(paste0(dir, "/", active_building_permits_id))

# Save the ZIP file
save_path <- ckan_fetch(resource[["url"]], store = "disk", path = paste0(dir, "/", active_building_permits_id, "/", "res.zip"))

# Unzip it
csv_files <- unzip(save_path[["path"]], exdir = resource_dir)

# Read it in 
res <- read_csv(csv_files)
#> Rows: 246434 Columns: 30
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr (16): PERMIT_NUM, REVISION_NUM, PERMIT_TYPE, STRUCTURE_TYPE, WORK, STREE...
#> dbl (13): GEO_ID, APPLICATION_DATE, ISSUED_DATE, DWELLING_UNITS_CREATED, DWE...
#> lgl  (1): COMPLETED_DATE
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

dim(res)
#> [1] 246434     30

# Compare to via read.csv()
bad_res <- read.csv(csv_files)
#> Warning in scan(file = file, what = what, sep = sep, quote = quote, dec = dec, :
#> EOF within quoted string

dim(bad_res)
#> [1] 135718     30

Hope this is helpful in the meantime!

sharlagelfand avatar Sep 15 '21 21:09 sharlagelfand

This works perfectly. Thanks so much for the quick fix (and the great package). Shall I close the issue since the manual code works or do you want me to leave it open as a placeholder to think about?

jamiedtor avatar Sep 20 '21 14:09 jamiedtor

Great, so glad it worked for you! Let's leave it open as a placeholder - thanks!

sharlagelfand avatar Sep 20 '21 18:09 sharlagelfand