unconf17 icon indicating copy to clipboard operation
unconf17 copied to clipboard

package (or expanded functionality to tidyjson?) for creating nested data frame from JSON data

Open sctyner opened this issue 7 years ago • 11 comments

I've recently become a fan of the purrr package, so much so that when I was working with a bunch of JSON data at a hackathon this past weekend, I kept wishing that I could use purrr and tidyr functionality to just magically turn all the JSON data into a tidy nested data frame.

I used the function jsonlite::fromJSON() to convert the JSON data to a data frame, but it did not have the nice tidyverse structure I've come to know and love.

Is there a way to get nice nested tidy data frames from JSON? I've looked into tidyjson a bit but it's not exactly what I would imagine if I had to pick the best way to get JSON data in to tidy, nested data frames. (I'm not sure if it does nesting TBH; I haven't looked into it that much.) This seems like something that would be extremely helpful for Shiny App developers who depend on API data, or anyone using JSON data in R with the tidyverse packages.

sctyner avatar May 15 '17 17:05 sctyner

@hadley and I talk about this regularly. It is a real need! I have some ideas about verbs for such a package. In some sense jsonlite::flatten() does this, but then you end up needing more control. However, you don't want to do every last thing by hand, either. I definitely know what you mean!

To be more of a Debbie Downer, I'm not sure how amenable this project is to the Unconf format. I think it requires development of a very principled worldview and then a very disciplined implementation. It feels really hard to do in this setting.

jennybc avatar May 17 '17 02:05 jennybc

@sctyner I think this is a great idea. While I agree that actually implementing this in a robust package might be better saved for later, I'd love to see a unconf discussion group synthesize some of the knowledge on how best to go about this (and how not to!) Almost every time I run into this issue I think "if only I could mind-meld with Jenny right now..."

One thing I've been playing with recently in this area: json-ld provides a method called "Flatten", which is often coupled to a second method called "Frame" which basically lets you write a little JSON template which specifies what data you want and how you want it structured. Note that JSON-LD methods can be applied to any JSON file, and yes @jeroen has an awesome little jsonld package providing convenient access to these methods. Probably not the right hammer for every task, but one might imagine a package implementing a similar notion of flatten & frame if not directly calling these existing methods.

cboettig avatar May 17 '17 03:05 cboettig

What sort of data json data structures do you encounter for which a logical mapping into a data frame would exist, but jsonlite::toJSON(..., flatten = TRUE) isn't well suited?

jeroen avatar May 17 '17 07:05 jeroen

@jeroen I will try to note down specific examples as I re-encounter. But in the abstract, here are some of the reasons I set flatten = FALSE in fromJSON():

  • I want to retain the original list intact as a list column in a tibble I work on.
  • I want to pull info out of it selectively and possibly not at once.
  • The items I want to use to create simple atomic vectors in the tibble aren't necessarily at the same level of hierarchy in the original list.
  • I want list-columns.

Basically it feels like there are a lot of actions that must be part of flatten(). And I want access to them myself.

I also have a more general discomfort that the structure of the thing I get back depends on the data. I'm happy to use flatten = TRUE when first getting to know an API but then I gradually creep towards doing the list processing myself.

jennybc avatar May 17 '17 15:05 jennybc

@jennybc That's exactly it. I wish I had said it as elegantly! 😄

sctyner avatar May 17 '17 16:05 sctyner

Here's a toy example of some JSON that doesn't flatten to a data.frame. Very nested structures are usually the source of issues for me, even with purrr, because often I want to pull data found at various different levels of nesting into a single row for the data.frame I care about.

Below I also give an example of how one might address this using the idea of a frame.

library("jsonlite")
library("jsonld")
library("magrittr")

json <-'{
      "@id": "http://example.org/library",
      "@type": "ex:Library",
      "ex:contains": {
        "@id": "http://example.org/library/the-republic",
        "@type": "ex:Book",
        "ex:contains": {
          "@id": "http://example.org/library/the-republic#introduction",
          "@type": "ex:Chapter",
          "dc:description": "An introductory chapter on The Republic.",
          "dc:title": "The Introduction"
        },
        "dc:creator": "Plato",
        "dc:title": "The Republic"
      }
    }
  '
df <-fromJSON(json, flatten = TRUE)

## not a df
class(df)

Note that df is still a (rather cumbersome!) list. Like @jennybc says, this is particularly annoying because the type/structure is unpredictable (depends on how much a nesting a given element might have), so hard to program around, so we usually wind not flattening the data (but having to iterate over some often ugly nesting).

framing solution

Let's imagine I just want to pull out book titles from the middle of that nested structure. Here's a frame for that:

frame <-
'{
  "@explicit": "true",
  "@type": "ex:Book",
  "dc:title": {}
}'
jsonld_frame(json, frame) %>% fromJSON()

gives us:

                                      @id   @type     dc:title
1 http://example.org/library/the-republic ex:Book The Republic

How about a data frame with the title and creator for all objects, regardless of nesting depth:

frame <-
'{
  "@explicit": "true",
  "@id": {},
  "dc:title": {"@default": "NA"},
  "dc:creator": {"@default": "NA"}
}'

jsonld_frame(json, frame) %>% fromJSON()
                                                 @id      @type dc:creator         dc:title
1                           http://example.org/library ex:Library         NA               NA
2              http://example.org/library/the-republic    ex:Book      Plato     The Republic
3 http://example.org/library/the-republic#introduction ex:Chapter         NA The Introduction

As an aside, I think this is potentially relevant to the schema issue @stephlocke mentions in https://github.com/ropensci/unconf17/issues/85#issuecomment-302045099 . Framing was designed explicitly for the problem Steph mentions, where you want your tool to get a consistent data object back even if the data provider is doing things like adding more data.

cboettig avatar May 17 '17 19:05 cboettig

Thanks @cboettig, I'll definitely see if I can get this schema framing going in the context of some of my API packages.

stephlocke avatar May 17 '17 22:05 stephlocke

@stephlocke cool, I'd be curious to hear how it goes. In case it's helpful, here's official docs on writing a frame. the json-ld playground is also useful for testing & sharing frames.

Maybe it would just be re-inventing the wheel, but I could also see a tidyjson package merely borrowing the framing notion as as a way of saying "give me these elements, where-ever you find them, and give them back to me in this structure". (Just realizing in my examples above, they all request an unnested structure, and then piping that through fromJSON is resulting in simple data.frames, but as you'll see you could say "I want a list nested just like so" instead.)

cboettig avatar May 17 '17 22:05 cboettig

We are over by the stage at a small round table. There's three of us: Katie, Kelly, and myself!

sctyner avatar May 25 '17 17:05 sctyner

I am planning to try to revive the SurveyMonkey -> R API package (RMonkey, which broke when SurveyMonkey changed their API and is not currently maintained) in the near future and would love to use whatever you come up with to parse the (very nested) JSON if at all applicable.

sfirke avatar May 26 '17 23:05 sfirke

Repo: https://github.com/ropenscilabs/tidyerjson

stefaniebutland avatar Oct 24 '17 18:10 stefaniebutland