jsonlite
jsonlite copied to clipboard
simplifyDataFrame can produce columns that don't line up with the rest of the data.frame
I need to parse json that has the following form:
[
{
"key": "item 1",
"points": [
{
"nested_key": "item 1,1",
"nested_value": 11,
"nested_string": "point 1 of item 1"
},
{
"nested_key": "item 1,2",
"nested_value": 12,
"nested_string": "point 2 of item 1"
},
{
"nested_key": "item 1,3",
"nested_value": 13,
"nested_string": "point 3 of item 1"
}
],
"attribute_counts": {
"color": {"blue": 57, "red": 3012},
"size": {"12": 2354, "15": 103},
"name": {"Linda": 234, "Julie": 1}
}
},
{
"key": "item 2",
"points": [
{
"nested_key": "item 2,1",
"nested_value": 21,
"nested_string": "point 1 of item 2"
},
{
"nested_key": "item 2,2",
"nested_value": 22,
"nested_string": "point 2 of item 2"
},
{
"nested_key": "item 2,3",
"nested_value": 23,
"nested_string": "point 3 of item 2"
},
{
"nested_key": "item 2,4",
"nested_value": 24,
"nested_string": "point 4 of item 2"
}
],
"attribute_counts": {
"color": {"blue": 123, "red": 232},
"size": {"12": 623, "15": 143},
"name": {"Bob": 234, "Sam": 1}
}
}
]
The problem is with the attribute_counts
field. It seems that since this is an object, but not an array of objects (like points
is, which fromJSON
handles with ease) it does not get lined up according to row (i.e., in the data.frame
row corresponding to the top level object it comes from) but rather the attribute counts from all top level fields are rbind
ed together in a way that does not seem intentional.
All is well if I use simplifyVector=FALSE
(except that this data would be much easier to work with as a data.frame
- but from a "correctness" perspective)
lst <- fromJSON("path/to/pasted/json/above", simplifyVector = FALSE)
length(lst)
# [1] 2
lapply(lst, length)
# [[1]]
# [1] 3
#
# [[2]]
# [1] 3
The transpose of this also looks good
tlst <- jsonlite:::transpose_list(lst, names(lst[[1]]))
length(tlst)
# [1] 3
lapply(tlst, length)
# [[1]]
# [1] 2
#
# [[2]]
# [1] 2
#
# [[3]]
# [1] 2
But things go wrong when I use simplifyVector=TRUE
df <- fromJSON("path/to/pasted/json/above", simplifyVector = TRUE)
colnames(df)
# [1] "key" "points" "attribute_counts"
lapply(df, length)
# $key
# [1] 2
#
# $points
# [1] 2
#
# $attribute_counts
# [1] 3 # <- yikes!
lapply(df, class)
# $key
# [1] "character"
#
# $points
# [1] "list"
#
# $attribute_counts
# [1] "data.frame"
tibble::as.tibble(df)
# Error: Column `attribute_counts` must be a 1d atomic vector or a list
The attribute_counts
for both items 1 and 2 are lumped together in their own data.frame
. The column breaks what a data.frame
is since the column lengths do not match.
Isolating the attribute_counts
data.frame
does not help either because each field contains a data.frame
instead of a list of nested data.frame
s. Base data.frame
will print it in an unhelpful way, but tibble::as.tibble
and data.table::as.data.table
won't work with it.
df$attribute_counts
# color.blue color.red size.12 size.15 name.Linda name.Julie name.Bob name.Sam
# 1 57 3012 2354 103 234 1 NA NA
# 2 123 232 623 143 NA NA 234 1
lapply(df$attribute_counts, class)
# $color
# [1] "data.frame"
#
# $size
# [1] "data.frame"
#
# $name
# [1] "data.frame"
tibble::as.tibble(df$attribute_counts)
# Error: Columns `color`, `size`, `name` must be 1d atomic vectors or lists
data.table::as.data.table(df$attribute_counts)
# Error in FUN(X[[i]], ...) :
# Invalid column: it has dimensions. Can't format it. If it's the result of data.table(table()), use as.data.table(table()) instead.
I'm not sure what the correct behavior ought to be, but either parsing the attribute_counts
for each item into a 1-row data.frame
or leaving it as a nested list
seem like fine options to me.
Thanks
This is expected. You cannot convert attribute_counts
to a tibble because it has nested data. There is no obvious way to flatten it into a rectangle.
One thing you can do is convert the nested fields:
out <- jsonlite::fromJSON(json)
tibble::as.tibble(out$attribute_counts$color)
tibble::as.tibble(out$attribute_counts$size)
tibble::as.tibble(out$attribute_counts$name)
But for such complex data you may be better of using simplifyVector=FALSE
and manually writing code to convert the data into the required structure.
Thanks for the quick response. Unfortunately I'm after a more general solution (I have lots of different data types that combine nested arrays and non-array objects at the top level which I need to work with).
I see your point about not being able to coerce everything into a rectangle, but given the relationship between data.frame
rows and top level json objects, it seems to me that leaving attribute_counts
(and similar structures) as list
columns would be more usable than the current return value. Thus, df$attribute_counts
would be a list
of length nrow(df)
and df$attribute_counts[[1]]
would be a list
(not a data.frame
) containing the attribute counts from the original json corresponding to item 1.
Perhaps I should ask, is the behavior here expected because it is useful in another context or is it expected because jamming all json objects into data.frame
s doesn't always work?