noctua icon indicating copy to clipboard operation
noctua copied to clipboard

Support datatypes: array, map, struct

Open DyfanJones opened this issue 3 years ago • 20 comments

Currently noctua doesn't support AWS Athena data types [array, map, struct]

DyfanJones avatar Jan 26 '21 16:01 DyfanJones

Branch data-types attempt to parse AWS Athena arrays, maps and row data types.

library(DBI)

con <- dbConnect(noctua::athena(), convert_array = T)

query1 <- "SELECT
ARRAY [CAST(4 AS VARCHAR), CAST(5 AS VARCHAR)]
AS items"

query2 <- 
"SELECT
ARRAY[CAST(MAP(ARRAY['a1', 'a2', 'a3'], ARRAY[1, 2, 3]) AS JSON)] ||
  ARRAY[CAST(MAP(ARRAY['b1', 'b2', 'b3'], ARRAY[4, 5, 6]) AS JSON)]
AS items"

query3 <- "SELECT
CAST(
  ROW('Bob', 38) AS ROW(name VARCHAR, age INTEGER)
) AS users"

query4 <- "SELECT ARRAY[
  CAST(ROW('Bob', 38) AS ROW(name VARCHAR, age INTEGER)),
  CAST(ROW('Alice', 35) AS ROW(name VARCHAR, age INTEGER)),
  CAST(ROW('Jane', 27) AS ROW(name VARCHAR, age INTEGER))
  ] AS users"

query5 <- "SELECT
    CAST(
      ROW('aws.amazon.com', ROW(true)) AS ROW(hostname VARCHAR, flaggedActivity ROW(isNew BOOLEAN))
    ) AS sites"

query6 <- 
  "SELECT
    sites.hostname,
    sites.flaggedactivity.isnew
    FROM (
      SELECT
      CAST(
        ROW('aws.amazon.com', ROW(true)) AS ROW(hostname VARCHAR, flaggedActivity ROW(isNew BOOLEAN))
      ) AS sites
    ) as dataset"

query7 <- "SELECT MAP(
  ARRAY['first', 'last', 'age'],
  ARRAY['Bob', 'Smith', '35']
) AS user"

df1 <- dbGetQuery(con, query1)
# Info: (Data scanned: 0 Bytes)

df2 <- dbGetQuery(con, query2)
# Info: (Data scanned: 0 Bytes)

df3 <- dbGetQuery(con, query3)
# Info: (Data scanned: 0 Bytes)

df4 <- dbGetQuery(con, query4)
# Info: (Data scanned: 0 Bytes)
# Warning message:
# Column `users` was unable to be converted. Returning column `users` as character 

df5 <- dbGetQuery(con, query5)
# Info: (Data scanned: 0 Bytes)

df6 <- dbGetQuery(con, query6)
# Info: (Data scanned: 0 Bytes)

df7 <- dbGetQuery(con, query7)
# Info: (Data scanned: 0 Bytes)

AWS Athena SQL Query Resource:

DyfanJones avatar Jan 26 '21 17:01 DyfanJones

What the above AWS Athena Queries return from AWS Athena:

query1:

+-------+
| items |
+-------+
| [4,5] |
+-------+

query2:

+--------------------------------------------------+
| items                                            |
+--------------------------------------------------+
| [{"a1":1,"a2":2,"a3":3}, {"b1":4,"b2":5,"b3":6}] |
+--------------------------------------------------+

query3:

+--------------------+
| users              |
+--------------------+
| {NAME=Bob, AGE=38} |
+--------------------+

query4:

+-----------------------------------------------------------------+
| users                                                           |
+-----------------------------------------------------------------+
| [{NAME=Bob, AGE=38}, {NAME=Alice, AGE=35}, {NAME=Jane, AGE=27}] |
+-----------------------------------------------------------------+

query5:

+----------------------------------------------------------+
| sites                                                    |
+----------------------------------------------------------+
| {HOSTNAME=aws.amazon.com, FLAGGEDACTIVITY={ISNEW=true}}  |
+----------------------------------------------------------+

query6:

+------------------------+
| hostname       | isnew |
+------------------------+
| aws.amazon.com | true  |
+------------------------+

query7:

+---------------------------------+
| user                            |
+---------------------------------+
| {last=Smith, first=Bob, age=35} |
+---------------------------------+

Note as map and row types seem to return formats: {object=something}. jsonlite::parse_json wouldn't be able to parse these formats and return errors. noctua will focus on array data types and leave the other types as characters

DyfanJones avatar Jan 26 '21 17:01 DyfanJones

Returning R formats when convert_array = TRUE

df1:
#        items
# 1: <list[2]>

df1$items[[1]]
# [[1]]
# [1] 4
# 
# [[2]]
# [1] 5

df2:
#        items
# 1: <list[2]>

df2$items[[1]]
# [[1]]
# [[1]]$a1
# [1] 1
# 
# [[1]]$a2
# [1] 2
# 
# [[1]]$a3
# [1] 3
# 
# 
# [[2]]
# [[2]]$b1
# [1] 4
# 
# [[2]]$b2
# [1] 5
# 
# [[2]]$b3
# [1] 6

df3:
#                 users
# 1: {name=Bob, age=38}

df4:
#                                                              users
# 1: [{name=Bob, age=38}, {name=Alice, age=35}, {name=Jane, age=27}]

df5:
#                                                     sites
# 1: {hostname=aws.amazon.com, flaggedactivity={isnew=true}}

df6:
#          hostname isnew
# 1: aws.amazon.com  TRUE

df7:
#                              user
# 1: {last=Smith, first=Bob, age=35}

DyfanJones avatar Jan 26 '21 17:01 DyfanJones

As Json can be parsed different ways, it might be worth letting the user provide custom parsers if they don't like the default.

For example:

library(DBI)

# use default parse: jsonlite::parse_json
con1 <- dbConnect(noctua::athena())

# leave arrays as characters
con2 <- dbConnect(noctua::athena(), convert_array = NULL)

# use custom json parser
con3 <- dbConnect(noctua::athena(), convert_array = jsonlite::fromJSON)

DyfanJones avatar Jan 26 '21 18:01 DyfanJones

Another question: Should convert_array parameter be in dbConnect? Would it be better to have in noctua_options so that arrays can be parsed in multiple ways .... or even both?

DyfanJones avatar Jan 26 '21 18:01 DyfanJones

Added support to more AWS Athena Data types: varbinary, ipaddress and json.

I am pretty happy with the binary method currently proposed in branch: data-types

library(DBI)
library(data.table)

# default binary conversion connection
con1 <- dbConnect(noctua::athena())
query <- "SELECT to_utf8('helloworld') as hi"

dt1 = dbGetQuery(con1, query)

dt1
#                       hi
# 1: 68,65,6c,6c,6f,77,...

sapply(dt1, class)
    hi 
"list" 

con2 <- dbConnect(noctua::athena(), binary = "character")
dt2 = dbGetQuery(con2, query)

dt2
#                               hi
# 1: 68 65 6c 6c 6f 77 6f 72 6c 64

sapply(dt2, class)
#          hi 
# "character"

The default binary data type conversion method, returns a list of raw vectors. This makes it really easy for R users to convert it back to what ever.

dt1[, string := lapply(hi, rawToChar)]
dt1
#                       hi     string
# 1: 68,65,6c,6c,6f,77,... helloworld

DyfanJones avatar Jan 27 '21 15:01 DyfanJones

For completeness here is the dplyr method for above data.table:

library(DBI)
library(dplyr)

noctua::noctua_options("vroom")

# default binary conversion connection
con1 <- dbConnect(noctua::athena())
query <- "SELECT to_utf8('helloworld') as hi"

dt1 = dbGetQuery(con1, query)
dt1
# A tibble: 1 x 1
#  hi        
#  <list>    
#1 <raw [10]>

con2 <- dbConnect(noctua::athena(), binary = "character")
dt2 = dbGetQuery(con2, query)

dt2
# A tibble: 1 x 1
#  hi                           
#  <chr>                        
#1 68 65 6c 6c 6f 77 6f 72 6c 64

For the dplyr approach a nested column of raw vectors is provide. To get the data users can do the following:

library(purrr)
dt1 %>% mutate(string = map_chr(hi, rawToChar))
# A tibble: 1 x 2
#  hi         string    
#  <list>     <chr>     
#1 <raw [10]> helloworld

note: sapply can be used instead of map_chr in this example.

DyfanJones avatar Jan 27 '21 15:01 DyfanJones

Wow, looks amazing! You sure work fast with these :) A quick first note on your question: I would vote for having the convert_array parameter in noctua_options. This would allow switching without establishing a new connection (can be pretty slow with an extensive glue catalogue in rstudio). More importantly, this somehow feels related to the data.parser option, so having these in the same place would make sense to me.

OssiLehtinen avatar Jan 27 '21 17:01 OssiLehtinen

@OssiLehtinen been pondering on it all last night thinking it over :P I am guessing you would say the same for the binary and bigint parameters?

I am more than happy to have it in noctua_options as well to make it more flexible.

DyfanJones avatar Jan 27 '21 17:01 DyfanJones

Added parameters for bigint, binary and json conversion within noctua_options.

noctua_options allows for conversion methods to be changed with the bonus of not affecting other conversion methods if a user is only change 1 method. For example:

library(DBI)
library(noctua)

# default conversion methods
con <- dbConnect(noctua::athena())

# change json conversion method
noctua_options(json = "character")
noctua:::athena_option_env$json
# [1] "character"

# change json conversion to custom method
noctua_options(json = jsonify::from_json)
noctua:::athena_option_env$json
# function (json, simplify = TRUE, fill_na = FALSE, buffer_size = 1024) 
# {
#   json_to_r(json, simplify, fill_na, buffer_size)
# }
# <bytecode: 0x7f823b9f6830>
#   <environment: namespace:jsonify>

# change bigint conversion without affecting custom json conversion methods
noctua_options(bigint = "numeric")
noctua:::athena_option_env$json
# function (json, simplify = TRUE, fill_na = FALSE, buffer_size = 1024) 
# {
#   json_to_r(json, simplify, fill_na, buffer_size)
# }
# <bytecode: 0x7f823b9f6830>
#   <environment: namespace:jsonify>

noctua:::athena_option_env$bigint
# [1] "numeric"

# change binary conversion without affect, bigint or json methods
noctua_options(binary = "character")
noctua:::athena_option_env$json
# function (json, simplify = TRUE, fill_na = FALSE, buffer_size = 1024) 
# {
#   json_to_r(json, simplify, fill_na, buffer_size)
# }
# <bytecode: 0x7f823b9f6830>
#   <environment: namespace:jsonify>

noctua:::athena_option_env$bigint
# [1] "numeric"

noctua:::athena_option_env$binary
# [1] "character"

Note convert_array parameter has been rename for json. It feels more approperiate as noctua already has bigint conversion within dbConnect

DyfanJones avatar Jan 27 '21 17:01 DyfanJones

Looks great after some testing too!

It is a shame that athena returns the weirdly formatted stuff with structs and maps. We tried brainstorming a bit with some colleagues on how to parse those but couldn't come up with anything even remotely elegant.

One note on the bigint part: If I use:

noctua_options(bigint = "integer")

and query a table with big bigints (which overwflow the 32 bit ints), I get the following message and the column remain int64:

Attempt to override column 1 <<bigint_col>> of inherent type 'int64' down to 'int32' ignored. Only overrides to a higher type are currently supported. If this was intended, please coerce to the lower type afterwards.

So the behaviour and returned data types depend on the data entries, which makes things a bit unpredictable. Of course it is a silly thing to do to cast an overflowing int64 to in, but still.

With the vroom parser, I get consistently int-columns, but the overflown entries come out as NAs. I'm not sure which is better, to be honest.

OssiLehtinen avatar Jan 28 '21 13:01 OssiLehtinen

Ah thanks @OssiLehtinen I am currently looking at how to improve performance for the currently conversion methods, possible some improvements with binary and json methods 😄 .

I will have a look at fread to see if we can force integers.

DyfanJones avatar Jan 28 '21 14:01 DyfanJones

3 different methods for parsing a string of json objects.

# method 1
# collapse vector of json strings and create 1 big json:
method_1 <- function(string) jsonlite::parse_json(paste0("[", paste(string, collapse = ","), "]"))

# method 2
# parse each json object
method_2 <- function(string) lapply(string, jsonlite::parse_json)

# method 3
# chunk up json strings then collapse json chunks before parsing them 
method_3 <- function(string, fun=jsonlite::parse_json, min_chunk = 10000L){
  if(length(string) < min_chunk){
    output <- fun(paste0("[", paste(string, collapse = ","), "]"))
  } else {
    len <- max(ceiling(length(string)/20), min_chunk)
    split_string <- split_vec(string, len)
    output <- unlist(
      lapply(split_string, function(i) fun(create_json_string(i))),
      recursive = FALSE
    )
  }
  return(output)
}

split_vec <- function(vec, len, max_len = length(vec)){
  chunks <- seq(1, max_len, len)
  ll <- Map(function(i) list(), 1:length(chunks))
  for (i in seq_along(chunks))
    ll[[i]] <- vec[chunks[i]:min(chunks[i]+(len-1), max_len)]
  return(ll)
}

create_json_string <- function(string){paste0("[", paste(string, collapse = ","), "]")}

method 1: pros:

  • Only calls json parser once cons:
  • Can be at risk of reaching memory limit for characters

method 2: pros:

  • simple to implement cons:
  • json parser is called for each json string within the character vector

method 3: pros:

  • safe implementation of method 1 as the character vector is split into chunks to reduce the chance of reaching memory limitations cons:
  • more difficult to implement
  • has an over head of splitting initial vector into chunks

DyfanJones avatar Jan 28 '21 18:01 DyfanJones

library(microbenchmark)
library(ggplot2)

n <- 1e6 # 10, 10,000 , 1,000,000
string <- rep(jsonlite::toJSON(iris[1,]), n)
bench <- microbenchmark(
  "paste method" = method_1(string),
  "lapply method" = method_2(string),
  "chunk paste method" = method_3(string),
  times = 100
)

autoplot(bench) +
  labs(title = "Speed of parsing json string",
       subtitle = sprintf("Number of json strings: %s", n))

edit: updated charts to reflect update method 3

json_parser_10 json_parser_10000 json_parse_1000000

From these benchmarks it looks like paste method and chunk paste method are the winners. Also the over head of chunking the vector doesn't seem noticeable. From these resultsnoctua will opt for the chunk paste method as it provides extra safety of memory limitation, with the benefit of no noticeable over head for chunking the vector.

DyfanJones avatar Jan 28 '21 19:01 DyfanJones

Quick benchmark with json parser and binary parser inplace

library(data.table)

test_data <- function(N = 10000L, seed = 142L){
  set.seed(seed)
  data.table(
    id=1:N,
    original=sapply(1:N, function(x) paste(sample(letters, sample(5:10)), collapse = ",")),
    json = jsonlite::toJSON(iris[1,]))
}

output_test_data <- function(N = 10000L, temp_file){
  dt_list <- test_data(N)
  dt_list[, raw := lapply(original, charToRaw)]
  dt_list[, raw_string := sapply(raw, function(x) paste(x, collapse = " "))]
  fwrite(dt_list[,.(id, original, raw_string, json)],
         temp_file,
         quote = T)
}

test_file <- tempfile()
size <- 1e6
output_test_data(size, test_file)

data_type <- list(list(Name = c("id", "original", "raw_string", "json"),
                       Type = c("integer", "string", "varbinary", "json")))

method <- "method"
class(method) <- "athena_data.table"

# noctua::noctua_options(json="auto")
# noctua::noctua_options(binary="raw")
system.time({
  dt <- noctua:::athena_read.athena_data.table(
    method,
    test_file,
    data_type)
})
# user  system elapsed 
# 23.470   0.659  24.166 

# noctua::noctua_options(binary="raw")
noctua::noctua_options(json="character")
system.time({
  dt <- noctua:::athena_read.athena_data.table(
    method,
    test_file,
    data_type)
})
# user  system elapsed 
# 14.639   0.232  15.268 

noctua::noctua_options(binary="character")
noctua::noctua_options(json="character")
system.time({
  dt <- noctua:::athena_read.athena_data.table(
    method,
    test_file,
    data_type)
})
# user  system elapsed 
# 7.824   0.053   7.899 

The slowest parser is converting "raw" string to actual raw. This is due to the raw conversion needing to be called on each string. I don't believe it can be called in chunks, however if it could then that would speed up that conversion

Overall the conversions doesn't slow down reading that much, however users can always turn them off and implement their own if desired :)

DyfanJones avatar Jan 28 '21 20:01 DyfanJones

I think I have thought of away to improve the speed of the raw conversion:

library(data.table)

N <- 1e6
set.seed(142)
dt_list <- data.table(
  id=1:N,
  original=sapply(1:N, function(x) paste(sample(letters, sample(5:10)), collapse = ",")))
dt_list[, raw := lapply(original, charToRaw)]
dt_list[, raw_string := sapply(raw, function(x) paste(x, collapse = " "))]

# method 1:
# Takes a string and converts it to raw
hex2raw <- function(string){
  split_str = strsplit(string, split = " ", fixed = TRUE)
  return(lapply(split_str, function(x) as.raw(as.hexmode(x))))
}

# method 2:
hex2raw_v2 <- function(string){
  split_str <- strsplit(string, split = " ", fixed = TRUE)
  output <- as.raw(as.hexmode(unlist(split_str)))
  split_raw_v1(output, sapply(split_str, length))
}

# helper function to split raw vector back into list format
split_raw_v1 <- function(vec, splits){
  ll <- Map(function(i) list(), 1:length(splits))
  Slices <- cumsum(c(1, splits))
  for (i in seq_along(splits))
    ll[[i]] <- vec[Slices[i]:(Slices[i+1]-1)]
  return(ll)
}

# method 3:
hex2raw_v3 <- function(string){
  split_str <- strsplit(string, split = " ", fixed = TRUE)
  output <- as.raw(as.hexmode(unlist(split_str)))
  split_raw_v2(output, sapply(split_str, length))
}

split_raw_v2 <- function(vec, splits){
  start <- cumsum(c(1, splits))
  lapply(seq_along(splits), function(i) vec[start[i]:(start[i+1]-1)])
}

# method 4:
hex2raw_v4 <- function(string){
  split_str <- strsplit(string, split = " ", fixed = TRUE)
  output <- as.raw(as.hexmode(unlist(split_str)))
  split_raw_v3(output, sapply(split_str, length))
}

# helper function to split raw vector back into list format
split_raw_v3 <- function(vec, splits){
  start <- cumsum(c(1, splits))
  end <- start[-1]-1
  lapply(seq_along(splits), function(i) vec[start[i]:end[i]])
}

method 1: This method applies raw conversion on each element of the list using an lapply

method 2: This method applies raw conversion on the list (so it is only called once). Then the raw vector is split back into the correct list format using a for loop.

method 3: This method applies raw conversion on the list (so it is only called once). Then the raw vector is split back into the correct list format using a lapply.

method 4: This method applies raw conversion on the list (so it is only called once). Then the raw vector is split back into the correct list format using a lapply. The only difference between method 3 and 4 is that the end split is done before sending it to the lapply (problem virtually no improve but should help with readability).

DyfanJones avatar Jan 29 '21 14:01 DyfanJones

raw_conversion

I am really happy to see that this method of converting all to raw the splitting the raw vector later is proving to the fastest. From this noctua will use method 4.

DyfanJones avatar Jan 29 '21 14:01 DyfanJones

Merged PR #138

DyfanJones avatar Jan 29 '21 16:01 DyfanJones

Not a 100% sure what to do with data.table and bigint as I can't find a method to force integer.

Looks great after some testing too!

It is a shame that athena returns the weirdly formatted stuff with structs and maps. We tried brainstorming a bit with some colleagues on how to parse those but couldn't come up with anything even remotely elegant.

One note on the bigint part: If I use:

noctua_options(bigint = "integer")

and query a table with big bigints (which overwflow the 32 bit ints), I get the following message and the column remain int64:

Attempt to override column 1 <<bigint_col>> of inherent type 'int64' down to 'int32' ignored. Only overrides to a higher type are currently supported. If this was intended, please coerce to the lower type afterwards.

So the behaviour and returned data types depend on the data entries, which makes things a bit unpredictable. Of course it is a silly thing to do to cast an overflowing int64 to in, but still.

With the vroom parser, I get consistently int-columns, but the overflown entries come out as NAs. I'm not sure which is better, to be honest.

Possibly will just document this behaviour so users are aware of what will happen if they cast a big integer as just integer when integers will simply fail.

DyfanJones avatar Jan 29 '21 16:01 DyfanJones

Sounds reasonable. It also helps that data.table alerts the user if casting fails.

Possibly will just document this behaviour so users are aware of what will happen if they cast a big integer as just integer when integers will simply fail.

OssiLehtinen avatar Feb 01 '21 08:02 OssiLehtinen