jsonlite
jsonlite copied to clipboard
fromJSON: Manually set how null is handled
Can you include an option for setting how JSON null is handled? I'd prefer they be transformed to NA rather than NULL.
E.g. fromJSON("{\"Missing_Data\":null}", null=NA)
should give me
$Missing_Data
NA
rather than
$Missing_Data
NULL
I have the same issue. Having the missing data as NULL results in me not being able to coerce into a data frame. See my gist for an example: https://gist.github.com/safferli/846e91a2cc3bef6cee06
resp1
has a NULL value, and chockes on coercion.
The other json packages have nullValue
as an option, something like that is missing for jsonlite.
I also think this would be a useful feature. Also the link by safferli to the gist should be: https://gist.github.com/safferli/5cdf28465fd7cec7ef6b
Trying to reproduce this but the API only gives me You need to subscribe to a plan before consuming the API
. Can you give me an example of the full JSON structure as returned by the server?
Here is a simple example (I am working with nested json files that are a hundred megs): {"balance":1000.21,"num":100,"nickname":null,"is_vip":true,"name":"foo"}
If you read this in, you will see 5 elements. If you do an unlist on this, it will drop to 4 elements. The unlist command doesn't like a null in the json: http://stackoverflow.com/questions/2991514/r-preventing-unlist-to-drop-null-values
My quick reading on the Internet shows that there are different schools of thoughts with handling nulls in json, so I am just pointing out what isn't working for me.
http://www.lively-web.org/R-libraries/RJSONIO/doc/missingValues.html
Thanks for trying to figure this out
You probably shouldn't be using unlist
in the fist place. If the JSON objects appear in an array, they turn into missing values automatically:
json <- '[
{"balance":1000.21,"num":100,"nickname":null,"is_vip":true,"name":"foo"},
{"balance":999,"num":100,"nickname":"jimmy","is_vip":false,"name":"bar"},
{"balance":1012,"num":100,"nickname":null,"is_vip":true,"name":"baz"}
]'
fromJSON(json)
## balance num nickname is_vip name
## 1 1000.21 100 <NA> TRUE foo
## 2 999.00 100 jimmy FALSE bar
## 3 1012.00 100 <NA> TRUE baz
Also within an array the null
values become na
:
fromJSON('[1,2,null,4]')
## [1] 1 2 NA 4
Can you give me a complete example of real world data where things go wrong for you?
So here is the more complicated workflow I am using. I want to get NBA Movement data into a data frame. I found some great work and sample code here: http://tcbanalytics.com/blog/nba-movement-data-R.html#.VnnHGYRiOYV
This is just for one event in a game. I want to be able to look at all events in a game. I grabbed a JSON for an entire game from here: wget https://github.com/neilmj/BasketballData/blob/master/2016.NBA.Raw.SportVU.Game.Logs/12.18.2015.DET.at.CHI.7z?raw=true
I then try to run the script against it and get stymied because there are nulls in this dataset. What happens is that when the unlist commands run, I end up with different lengths for quarters,game.clock, and shot.clocks. This makes it difficult to combine everything into one dataframe. If there were NAs, then it would make it easier to combine.
The way I currently handle it is doing ifelse to change the nulls to NA
I readily concede there are probably better ways to handle the JSON. I was just following the script and trying to get this JSON to work within it.
I hope this helps and appreciate you trying to figure this out.
I am trying to do the same with game log data from nba.com/stats web site and jsonlite::fromJSON seems to handle NULL values very well:
#### Player game logs URL: one record per player per game played ####
gameLogsURL <- paste("http://stats.nba.com/stats/leaguegamelog?Counter=1000&Direction=DESC&LeagueID=00&PlayerOrTeam=P&Season=2016-17&SeasonType=Regular+Season&Sorter=PTS")
#### Import game logs data from JSON ####
# use jsonlite::fromJSON to handle NULL values
gameLogsData <- jsonlite::fromJSON(gameLogsURL, simplifyDataFrame = TRUE)
# Save into a data frame and add column names
gameLogs <- data.frame(gameLogsData$resultSets$rowSet)
colnames(gameLogs) <- gameLogsData$resultSets$headers[[1]]
I think that this would help resolve some behavior that is showing up in purrr, eg #181 - map_chr
will 'die trying' if it encounters NULL, but can convert NA to NA_character
.
Any movement on this? A lot of things choke on NULL
s, would be great to be able to coerce them to NA
s
I find the null
coercion that jsonlite
is doing a bit odd. I'm using jq
to query a single large JSON object and return a much smaller JSON object to be coerced into a tibble. If I structure the results of my jq
query as an object, the null
values in my JSON object are not coerced, and the coercion to tibble fails. If I just wrap that object in an array, then the coercion to NA
happens.
In short, what's the motivation for this behavior:
> fromJSON('{"key":null}')
$key
NULL
> fromJSON('[{"key":null}]')
key
1 NA
Same here!
Same here!
Same here!
df = jsonlite::fromJSON(rjson::toJSON(YOUR_R_List), simplifyDataFrame = TRUE) OR: df = jsonlite::fromJSON(YOUR_JSON, simplifyDataFrame = TRUE)
Then, in the data frame (df) the null data have replaced with "NA" automatically.
has there been any moves here? this issue, and the solution of looping though all elements of a list and replacing NULL
s with NA
s is too time consuming in each request.
To be clear, what I'm receiving looks like this
{
"debtor_attributes":{
"external_id":null,
"phone_number":"034987345",
"email":"[email protected]"
},
"delivery_address":{
"house":"15",
"street":"My street",
"city":"Berlin",
"postal_code":"11111"
},
"order_attributes":{
"amount":42,
"created_at":"2021-04-28 12:43:06",
},
"other_attributes":{
"A":99,
"B":null,
"C":null,
"D":null
}
}
which I need to handle as either four dataframes or four lists. The important bit is, those null
s must all be cast to na
.
I'm having this issue with a Plumber API. Is ther an alternative?
I would also like to see a parameter added to fromJSON that allows setting nulls's to NA with simplifyVector = FALSE
.
Most of my data is nested lists of lists of lists and I don't want the whole structure as a data.frame. Occasionally I will want to transform some of the nested lists into data.frames however due to a probable bug in dplyr (https://github.com/tidyverse/dplyr/issues/5882, https://github.com/tidyverse/dplyr/issues/5749) the NULL's are not handled well at all.
For those of you stuck with this issue, a quick and dirty fix function (relying on jsonlite internals):
null_to_na_recurse <- function(obj) {
if (is.list(obj)) {
obj <- jsonlite:::null_to_na(obj)
obj <- lapply(obj, null_to_na_recurse)
}
return(obj)
}
jsonlite::null_to_na isn't exported from jsonlite....
@chrisknoll you need :::
(3 colons). So yes it is not a public function exposed by the package, but thanks to how R works you can still use it, albeit under risk the devs will change the function from under you and then you have a problem. It seems to be the only solution for this though.
Sorry, bad eyes! I didn't see the three :
Question to @jeroen , I think there's some disconnect between what you are explaining what works vs. what people are saying the problem they are having. Your above examples say that it works by converting javascript null
to R NA
when the item is found in an array. But, that's not the case that people are having issues with. For example, if you use a webservice to get contact information for a person, you may have a object with 3 fields, an id, name, and contact_number:
{"id":1, "name":"customer name", contact_number: null}
so calling jsonlite::fromJSON with this:
> jsonlite::fromJSON('{"id":1, "name":"customer name", "contact_number": null}')
$id
[1] 1
$name
[1] "customer name"
$contact_number
NULL
But instead if we supported an array of contact numbers:
> jsonlite::fromJSON('{"id":1, "name":"customer name", "contact_numbers": [1,null,3]}')
$id
[1] 1
$name
[1] "customer name"
$contact_numbers
[1] 1 NA 3
And if we put the initial request into an array:
> jsonlite::fromJSON('[{"id":1, "name":"customer name", "contact_number": null}]')
id name contact_number
1 1 customer name NA
This is actually returned in the form of a dataframe. I think all of this is captured in the jsonlite documentation as the expected behavior, and I don't think that's what people take issue with.
The simple issue is that a null javascript value becomes NA in some contexts (in the contexts that R doesn't like NULLs in such as a dataframe column value or a value in a vector) but in other places it's set as a NULL. Can you explain why you couldn't just associate javascript nulls to R's NA?
As you explained in another thread, R's NULL is actually a zero pairwise list, while NA is a logical(0). It seems that R's NA has a much closer relationship to javascript's null, so is there a technical reason why you can just assign R's NA to javascript's null in all cases?
There is a fourth case:
> jsonlite::fromJSON('[{"id":1, "name":"customer name", "contact_number": null}]', simplifyVector = FALSE)
[[1]]
[[1]]$id
[1] 1
[[1]]$name
[1] "customer name"
[[1]]$contact_number
NULL
...which is particularly necessary for many of my use cases and for complex nested elements in general where the top N levels should not be a dataframe but at some point there is an object which SHOULD be convertable into a dataframe, however you can't do so because of the NULL's.
On a side note:
NA is a logical(0)
This is incorrect, NA is actually a vector with a type of "logical" and also has a length of 1 (you also have NA_character_ for length 1 character vector for example). logical(0) is a vector of type logical with a length of 0 (zero). NULL (which incidentally is typeless and a special case in R) also has a length of 0 (zero). This means that NULL actually IS a better fit for json null since we don't know the value's type but do know that it exists and has no value. That said, there are cases when you don't care if something exists but isn't set, you just need a placeholder "no value" - hence where NA comes in.
I stand corrected about the technical definition of NA, but still feel NA's meaning (as in a missing value) is closer to null in other languages than NULL (which seems to have double-meaning in R since NULL promotes to an empty list in some cases). I think there's a lot of problems with R's interpretation of what NULL is (ie: length(NULL) should probably return error or NULL and not '0' indcating it's something that exists that you can get a length of...but i digress).
At the core if the problem here is that there's a technical limitation in R where you can't put R NULL in a vector:
> c(1,null,2)
[1] 1 2
So nulls are lost in while NA's are not. So if we need to retain the 'fact' that something is 'null' in a list, you have no other option than to use NA. NAs also happen to work outside of a vector, so using NAs everywhere would be universal, although it sounds like this would break your usecases @avsdev-cw .
@avsdev-cw just for the sake of context: can you explain in your use case where you're depending on some behavior of NULL that precludes the use of NA? Put another way, if jsonlite was using NA universally for nulls, would you no longer be able to use jsonlite?
@avsdev-cw just for the sake of context: can you explain in your use case where you're depending on some behavior of NULL that precludes the use of NA? Put another way, if jsonlite was using NA universally for nulls, would you no longer be able to use jsonlite?
The key point of my request for a flag rather than doing it by default is to preserve default behavior. But sure, an example might be a question's answer: Q: Which of the following items do you own? .... null = not answered, NA = none x,y,z = chosen options
Arguably you could use an empty vector. But this use case is handled fine by NOT forcing the type into a data.frame and retaining it as a list. Like I said, in list form, you could just check for existence of a name within the list to distinguish not set from empty value.
For reference, an example complex data return (ignore that it's a rubbish schema...but you do get a lot of them from json api's) which used to be broken but now works (dplyr::bind_rows has been fixed to handle NULL):
> utils::str(jsonlite::fromJSON('[{"respondee": {"name": "fred"}, "surveys": [{"name": "Survey 1...", "details": [{"q": "q1", "a": "a1"}, {"q": "q2", "a": "a2", "other": null}, {"q": "q3", "a": "a3", "other": "something here"}]}, {"name": "Survey 2..."}]}]'))
'data.frame': 1 obs. of 2 variables:
$ respondee:'data.frame': 1 obs. of 1 variable:
..$ name: chr "fred"
$ surveys :List of 1
..$ :'data.frame': 2 obs. of 2 variables:
.. ..$ name : chr "Survey 1..." "Survey 2..."
.. ..$ details:List of 2
.. .. ..$ :'data.frame': 3 obs. of 3 variables:
.. .. .. ..$ q : chr "q1" "q2" "q3"
.. .. .. ..$ a : chr "a1" "a2" "a3"
.. .. .. ..$ other: chr NA NA "something here"
.. .. ..$ : NULL
> utils::str(jsonlite::fromJSON('[{"respondee": {"name": "fred"}, "surveys": [{"name": "Survey 1...", "details": [{"q": "q1", "a": "a1"}, {"q": "q2", "a": "a2", "other": null}, {"q": "q3", "a": "a3", "other": "something here"}]}, {"name": "Survey 2..."}]}]', simplifyVector = FALSE))
List of 1
$ :List of 2
..$ respondee:List of 1
.. ..$ name: chr "fred"
..$ surveys :List of 2
.. ..$ :List of 2
.. .. ..$ name : chr "Survey 1..."
.. .. ..$ details:List of 3
.. .. .. ..$ :List of 2
.. .. .. .. ..$ q: chr "q1"
.. .. .. .. ..$ a: chr "a1"
.. .. .. ..$ :List of 3
.. .. .. .. ..$ q : chr "q2"
.. .. .. .. ..$ a : chr "a2"
.. .. .. .. ..$ other: NULL
.. .. .. ..$ :List of 3
.. .. .. .. ..$ q : chr "q3"
.. .. .. .. ..$ a : chr "a3"
.. .. .. .. ..$ other: chr "something here"
.. ..$ :List of 1
.. .. ..$ name: chr "Survey 2..."
> data.frame(res[[1]]$surveys[[1]]$details)
Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, :
arguments imply differing number of rows: 1, 0
# This used to be broken:
> dplyr::bind_rows(res[[1]]$surveys[[1]]$details)
# A tibble: 3 x 3
q a other
<chr> <chr> <chr>
1 q1 a1 NA
2 q2 a2 NA
3 q3 a3 something here
Sight digression: take a look at this for a brain bender:
> jsonlite::toJSON(list(choices = vector(), something = NULL))
{"choices":[],"something":{}}
>jsonlite::fromJSON(jsonlite::toJSON(list(choices = vector(), something = NULL)))
$choices
list()
$something
named list()
Is the new value of "something" right or wrong? Should it be NA or should it be NULL?
Is the new value of "something" right or wrong? Should it be NA or should it be NULL?
So, you're showing what I did in my first example, but with the added step of demonstrating that fromJSON
and toJSON
is not symmetrical (and this has been discussed in other issues, and @jeroen asserts that this is by design).
But, your question brings me back to the argument about which is the closer approximation to a javascript null: NA or NULL: In R, NULL is defined as a pairwise list, and so the reason why you're getting the result you're getting is that the R NULL is being interpreted as a empty pairwise list, aka {}
in the toJSON, which then deserializes into an an empty list() during fromJSON(). To take your example further:
> jsonlite::toJSON(list(choices = vector(), something = NA))
{"choices":[],"something":[null]}
The array containing null isn't what we want (because something isn't a collection but rather a single value, but again, this is by design in jsonlite). You need to unbox the value:
> jsonlite::toJSON(list(choices = vector(), something = jsonlite::unbox(NA)))
{"choices":[],"something":null}
That's what you'd get, and is why my position is that we should avoid R NULLs in processing JSON and treat javascript nulls as NA.
Re: your survey example: 'none' should be a valid choice so I'm not sure why you'd use a 'missing value indicator' aka:NA to represent none. It's valid to represent the selection as an empty array, but I'd say that if 'none' is a choice, then 'none' should be in the collection. You'd just have to do additional data sanitization to ensure that if they do say 'none' then they can't put other choices in that collection. If that complicates the logic behind asking the question of your data 'who said 'none'?, it's easier to look for people who's choices is a zero length list vs. searching within each list element to find 'none'. But in any case, a NA value of 'choices' would mean the question was not answered.
But I don't want to derail the topic at hand (NULL vs. NA) into a discussion object modeling philosophy, Do you at least agree now that NA is a better fit to represent the absence of value than NULL?