gcamdata icon indicating copy to clipboard operation
gcamdata copied to clipboard

Performance considerations; shift away from readr?

Open bpbond opened this issue 7 years ago • 13 comments

As I noted in #1073 , it's not clear that we need the readr dependency at this point; could simply pass the column class information read.csv.

@ashiklom noted on Slack as well:

Have we ever considered using data.table::fread instead (especially since we already import it)?

Two other thoughts I had:

  • Since data.table is so fast, we may be able to achieve speed gains without much re-writing by using dtplyr for at least some functions. E.g. dtplyr::left_join -- https://github.com/hadley/dtplyr/blob/master/R/joins.R
  • One package to keep an eye on is vroom, which uses R 3.5's new ALTREP framework (which, as far as I understand, is black magic), to read data lazily and therefore claims to achieve blisteringly fast speeds. Currently in active development, and not on CRAN, but maybe worth a GitHub star: https://github.com/jimhester/vroom

bpbond avatar Jan 05 '19 13:01 bpbond

Re performance see also #1068

bpbond avatar Jan 05 '19 13:01 bpbond

fread does not currently support comment characters (see https://github.com/Rdatatable/data.table/issues/856) so we'd need to work around that.

bpbond avatar Jan 11 '19 15:01 bpbond

fread offers a pre-processing step that makes this easy

fread(cmd = paste("sed '/^#/d'", fn))

...but Windows? This might be a deal-breaker.

bpbond avatar Jan 11 '19 15:01 bpbond

Since we are already have a function for reading the comment header, how about something like this?

header <- read_comment_header(file)
data <- data.table(fread, skip = length(header))

ashiklom avatar Jan 11 '19 16:01 ashiklom

Thanks. Yeah I'm just not sure if there are comments elsewhere in some files.

bpbond avatar Jan 11 '19 16:01 bpbond

Just checked and there are, but only a handful:

aglu/Mueller_crops.csv:6:# Source: Mueller, N. D.; Gerber, J. S.; Johnston, M.; Ray, D. K.; Ramankutty, N.; Foley, J. A. Closing yield gaps through nutrient and water management. Nature 2012, 490 (7419), 254ñ257.,
energy/A54.tranSubsector_shrwt.csv:59:# Passthrough tranSubsectors are listed below
energy/A54.tranSubsector_logit.csv:58:# Passthrough tranSubsectors are listed below,,,,absolute-cost-logit
energy/A54.tranSubsector_VOTT.csv:31:# Pass-through tranSubsectors
energy/A54.tranSubsector_VOTT_ssp1.csv:31:# Pass-through tranSubsectors

If we wanted to continue to support comments in the file bodies (which I personally think is a bad idea, but I understand the rationale), then that does rule out fread for now. But if we fix these 5 files, then the skip = length(header) approach should work.

ashiklom avatar Jan 11 '19 16:01 ashiklom

Huh, thanks for checking! As usual, it's too bad about Windows because the cmd = thing above is super slick and easy.

bpbond avatar Jan 11 '19 16:01 bpbond

@bpbond As discussed earlier, the readr tries to parse integers written in scientific notations as doubles. Changing the coltypes to ‘i’ does not solve this problem when there are mixed data types in a column. You get the following error:

Error : (converted from warning) 1 parsing failure. row col expected actual file 7 2009 no trailing characters .14E+05 'C:/Users/nara836/Documents/R/win-library/3.6/gcamdata/extdata/aglu/FAO/FAO_Fert_Prod_tN_RESOURCESTAT.csv'

Error in load_csv_files(unfound_chunk$input, unfound_chunk$optional, quiet = TRUE) : Error or warning while reading FAO_Fert_Prod_tN_RESOURCESTAT.csv

The workaround I have been using, is gathering all values into 1 column and converting entire "Value" column to an integer using as.type and then spreading back.

kanishkan91 avatar Aug 27 '19 21:08 kanishkan91

@kanishkan91 what is the dataset that is reporting data with scientific notation that should be read as integer? I can't think of any case where that would be desirable, and don't see any in the current gcamdata repo.

pkyle avatar Aug 27 '19 21:08 pkyle

@pkyle I found this in the FAO data. Where a number is too large, the FAO dataset will present that as a scientific notation as opposed to a number. This needs to be converted back to an integer, or the driver crashes with the above mentioned error.

kanishkan91 avatar Aug 27 '19 21:08 kanishkan91

I'm sorry I missed the error; what is the problem with having all of these data as doubles? Does that cause an error? Quantities (e.g. tonnes of fertilizer) are not really integers; they're just rounded to the nearest unit.

pkyle avatar Aug 27 '19 21:08 pkyle

@pkyle Thanks for helping with the debug on this. If you use a coltype of 'n' in the metadata, the problem goes away. But we have historically been reading in data from FAO as integers. I will update that. But others may run into this error elsewhere where they actually want to read data in as integers.

kanishkan91 avatar Aug 29 '19 20:08 kanishkan91

Great! Just to clarify and put in writing what we discussed, the reason for data columns input as i is simply that we carried readr's estimates (where they worked, anyway!) when we first rolled out the manually specified column types. However the integer column type should be reserved for things like years and item codes that can't have decimal values. Value data that is rounded to the nearest unit should be specified as n, not i.

pkyle avatar Aug 29 '19 20:08 pkyle