frictionless-r
frictionless-r copied to clipboard
API Brainstorm Thread
trafficstars
I'm starting this thread to brainstorm some of the ideas I mention in #198 and #251. It leans into the idea of data packages and table resources being their own class, not just lightweight descriptors. In this approach:
- A data package object would be a list of resource objects. Properties would be stored in its attributes, and be accessed with
get_prop()andset_props(). These functions would ensure the object was always valid. - A table resource object would be a list of fields objects. Properties would be stored in its attributes, as with data package objects.
- When table resource objects were read with
read_resource(), it would make them a tibble AND a table resource object. So it would allow you to manage a data frame with frictionless metadata simultaneously.
Although it does introduce a lot of implementation complexity in some areas, I think it potentially simplifies user experience and reduces complexity in other areas:
- users no longer have to keep their loaded data frames synchronized with their descriptor metadata, because a loaded resource tibble IS a table resource object in all of its metadata glory
- we can easily carry context around in an object (e.g. the working directory of a descriptor; #251), without it polluting the rest of the descriptor attributes
- validation is streamlined because properties are always modified through fns that insure the object stays valid
It's also a pretty big departure from the current architecture, so I totally understand if you're not wanting to go this direction... I'm mostly sharing this to just get more ideas / possibilities flowing.
pkg <- example_package()
pkg
#> A Data Package with 3 resources:
#> • deployments
#> • observations
#> • media
#> Use `get_descriptor()` to print the Data Package as a list.
# Instead of using `unclass()`, we use `get_descriptor()` to convert the
# data package object into a raw descriptor object (list)
get_descriptor(pkg)
#> $name
#> [1] "example_package"
#>
#> $id
#> [1] "115f49c1-8603-463e-a908-68de98327266"
#>
#> $created
#> [1] "2021-03-02T17:22:33Z"
#>
#> $image
#> ...
# Instead of setting properties directly on the data package object, we get
# and set properties using `get_prop()` and `set_props()`. This allows us to
# validate the properties before setting them, so the data package object
# is always guaranteed to be valid.
get_prop(pkg, "id")
#> [1] "115f49c1-8603-463e-a908-68de98327266"
pkg <- set_props(pkg, id = "new-id")
get_prop(pkg, "id")
#> [1] "new-id"
# Because all properties are stored as attributes in the data package object,
# we can have the object's items refer directly to the child resources
# of the data package:
pkg$deployments
#> A Table Resource with 5 fields:
#> • deployment_id (string)
#> • longitude (number)
#> • lattitude (number)
#> • start (date)
#> • comments (string)
#> Use `get_descriptor()` to print the Table Resource as a list.
#> Use `read_resource()` to load the data of this Table Resource.
# As with a data package object, we can use `get_descriptor()` to convert
# the resource object into a raw descriptor object (list)
get_descriptor(pkg$deployments)
#> $name
#> [1] "deployments"
#>
#> $path
#> [1] "<...>"
#>
#> $profile
#> [1] "tabular-data-resource"
#>
#> $title
#> [1] "Camera trap deployments"
#> ...
# As with data package objects, we use get_prop() and set_props() to work with
# properties:
get_prop(pkg$deployments, "title")
#> [1] "Camera trap deployments"
pkg$deployments <- set_props(pkg$deployments, title = "Camera trap deployments (modified)")
get_prop(pkg$deployments, "title")
#> [1] "Camera trap deployments (modified)"
# We let the child items of table resource objects refer to field objects:
pkg$deployments$deployment_id
#> A Field:
#> • name: deployment_id
#> • type: string
#> • constraints: {required: TRUE, unique: TRUE}
#> Use `get_descriptor()` to print the Field as a list.
# And as usual, we can convert to raw descriptor via `get_descriptor()`:
get_descriptor(pkg$deployments$deployment_id)
#> $name
#> [1] "deployment_id"
#>
#> $type
#> [1] "string"
#>
#> $constraints
#> $constraints$required
#> [1] TRUE
#>
#> $constraints$unique
#> [1] TRUE
# (Also, `get_prop()` and `set_props()` would work with field objects)
# Where this approach gets really interesting is when we start loading the data
# from resources:
rsc <- read_resource(pkg$deployments)
#> # A Table Resource tibble: 3 × 5
#> deployment_id longitude latitude start comments
#> <chr> <dbl> <dbl> <date> <chr>
#> 1 1 4.62 50.8 2020-09-25 NA
#> 2 2 4.64 50.8 2020-10-01 "On \"forêt\" road."
#> 3 3 4.65 50.8 2020-10-05 "Malfunction/no photos, data"
# Notice the header in the printout -- this is not your average tibble!
# What we get here is a subclassed tibble allowing it to be both a tibble AND
# keep track of the resource metadata simultaneously. This means `get_prop()`
# and `set_props()` can still be used!
get_prop(rsc, "title")
#> [1] "Camera trap deployments (modified)"
rsc <- set_props(rsc, title = "Camera trap deployments (modified again)")
get_prop(rsc, "title")
#> [1] "Camera trap deployments (modified again)"
# We can also still use `get_descriptor()` with the tibble!
get_descriptor(rsc)
#> $name
#> [1] "deployments"
#>
#> $path
#> [1] "<...>"
#>
#> $profile
#> [1] "tabular-data-resource"
#>
#> $title
#> [1] "Camera trap deployments (modified again)"
#> ...
# Properties of fields could be set in tidy pipelines, and new fields
# could be created by adding columns:
rsc <- rsc |>
mutate(
deployment_id = set_props(deployment_id, title = "New deployment ID title"),
) |>
mutate(
new_field = start + 1,
new_field = set_props(new_field, title = "The day after the start day"),
)
# What's cool about this, is now we can use `get_descriptor()` to get the
# descriptor of the resource tibble, and it will include the new field in the
# resulting schema.
# And we can update our package with the new resource at any time:
pkg$deployments <- rsc
# We could also update the resource's path to control how the resource
# will be saved when we write the package to disk:
pkg$deployments <- set_props(pkg$deployments, path = "deployments_new.csv")
# Or set the path to NULL to have the resource embed the tibble data in the
# "data" prop when it's converted to a descriptor:
pkg$deployments <- set_props(pkg$deployments, path = NULL)