dataframe
dataframe copied to clipboard
Add API to unfold object-columns into properties via reflection
As with unfold()
in krangl (see https://holgerbrandl.github.io/krangl/data_model/#to-type-or-not-to-type), it would be great if kotlin-df could provide a similar API. Clearly, this could be done manually using df.add()
, but with many attributes this is very tedious. And in krangl we've seen that this can be done very efficiently via reflection.
Example (from krangl but yet missing in kdf):
data class City(val name:String, val code:Int)
data class Person(val name:String, val address:City)
val persons : List<Person> = listOf(
Person("Max", City("Dresden", 12309)),
Person("Anna", City("Berlin", 10115))
)
val personsDF: DataFrame = persons.asDataFrame() // <- Also sems missing API in kdf?
// unfold City attributes into different columns
personsDF.unfold<City>("address") // <- Missing API in kdf
For impl see https://github.com/holgerbrandl/krangl/blob/49418b2ca6ee6ae165c034e56e4da77e4707f7ad/src/main/kotlin/krangl/Builder.kt#L115
persons.asDataFrame()
-- i believe persons.toDataFrame() does the trick https://kotlin.github.io/dataframe/createdataframe.html#todataframe
@koperagen Thanks for the pointer.
I'd still think that the key aspect of this ticket is still valid: Once we have objects of type E in a column of a data-frame we can not conveniently expand E's properties into columns (without writing it out attribute by attribute using add()
). That's what unfold
was/is doing in krangl and I used it quite frequently.
@koperagen Thanks for the pointer.
I'd still think that the key aspect of this ticket is still valid: Once we have objects of type E in a column of a data-frame we can not conveniently expand E's properties into columns (without writing it out attribute by attribute using
add()
). That's whatunfold
was/is doing in krangl and I used it quite frequently.
Indeed. We'll introduce this API. I also made an example of how it can be done now.
class RepositoryInfo(val data: Any)
fun download(url: String) = RepositoryInfo("fancy response from the API")
val interestingRepos = dataFrameOf("name", "url")(
"dataframe", "/dataframe",
"kotlin", "/kotlin",
)
val initialData = interestingRepos
.add("response") { download("url"()) }
class WebScrappingInitialData(val name: String, val url: String, val response: RepositoryInfo)
val df = initialData.cast<WebScrappingInitialData>(verify = true)
val response by column<RepositoryInfo>()
df
.replace(response)
.with { it.asIterable().toDataFrame().asColumnGroup(it.name()) }
//.ungroup(response)
.print()
df
.replace("response")
.with { (it as DataColumn<RepositoryInfo>)
.asIterable()
.toDataFrame {
// todo a better example for this dsl
properties(maxDepth = 1) {
}
}
.asColumnGroup(it.name())
}
.rename { columnGroup("response").children() }.into { "response_" + it.name() }
.ungroup("response")
.schema().print()
Actually, there is non-documented API for that. I haven't found a good name for that, so it's currently called read
:
data class City(val name:String, val code:Int)
data class Person(val name:String, val address:City)
val persons : List<Person> = listOf(
Person("Max", City("Dresden", 12309)),
Person("Anna", City("Berlin", 10115))
)
val personsDF = persons.toDataFrame()
personsDF.read(Person::address)
Thank you, @holgerbrandl, unfold
seems to be a good name for this operation. Did you take it from some other dataframe library?
And also List
-> DataFrame
conversion is called toDataFrame
instead of asDataFrame
, because in Kotlin Stdlib as
means wrapping and to
means copying: asSequence
, but toList
.
The name was inspired from the terminology section in https://tidyr.tidyverse.org/. Altough in tidyr they ended up using unnest_wider (see https://tidyr.tidyverse.org/reference/hoist.html). unnest
is also a nice candidate imho.
Regarding to vs as, indeed naming was/is wrong in krangl. I'm a slow learner so it took me forever to memorize the semantics here. :-)
added as unfold()
:)
Just tried v0.9.1, and the unfold works nicely. Thanks @Jolanrensen for picking up the idea.
The design is great from a programmer's perspective, but imho a bit too complex from a data-science one (which is less technical).
I've figured that I have to flatten()
the result to make the columns visible, which imho could&should be one step. How could I prefix the attributes with the unfolded attribute name (plus _
) to keep the overview in complex tables with many columns? How to keep the original column? I could cherrypick attributes of interest with a select
on the final result, but it would imho have been more efficient to cherrypick variables when unfolding to minimize the memory footprint of the dataframe.
So in short, I think the parameters from the krangl-unfold are missing:
- attributes: List<String>? = null,
- keep: Boolean = true,
- addPrefix: Boolean = false,
The unfold impl in kdf supports multiple attributes unfolding at once, but I have not yet come across a use-case for that in my daily work. It's rather that I typically want to have more control about how to unfold a specific column.
Disclaimer: It could well be that I had missed some details here, because when playing with the new unfold API the published docs were still about v0.8.
fyi, I've added the convenience wrapper described above in https://github.com/holgerbrandl/kdfutils/blob/5f2226eb8e112b8a02f5bd6f96775fbcbc286fb1/src/main/kotlin/com/github/holgerbrandl/kdfutils/Unfold.kt#L38
@holgerbrandl I use the convenience extension
inline fun <T, reified C> DataFrame<T>.unfold(column: ColumnReference<C>, noinline body: CreateDataFrameDsl<C>.() -> Unit) =
replace(column).with { it.values().toDataFrame(body).asColumnGroup(it.name()) }
so that I can use the full power of CreateDataFrameDsl
to control how to unfold it.
It is basically a copy - paste of unfold
implementation of kdf, the only difference is that DSL is exposed as a parameter instead of using hardcoded default { properties() }
.
But this extension inherits hierarchical nature of kdf, so unfolded columns are conveniently grouped, which is not what you want, so you might still need to flatten()
the result.
Btw, what do you mean to make the columns visible
? How are they not visible before?
@pacher Thanks for the pointer. Not sure how to create a ColumnReference more elegantly. getColumn
does so, but requires to reference the df twice (df.unfold(df.getColumn("foo")
), which is arguably not pretty. But that would be a different question/thread imho. :-)
Regarding "How are they not visible before?"
data class City(val name: String, val code: Int)
data class Person(val name: String, val city: City)
val persons: List<Person> = listOf(
Person("Max", City("Dresden", 12309)),
Person("Anna", City("Berlin", 10115))
)
val personsDF = persons.toDataFrame()
personsDF.getColumn()
val unfoldPersons = personsDF.unfold("city")
unfoldPersons.print()
unfoldPersons["code"]
name city
0 Max { name:Dresden, code:12309 }
1 Anna { name:Berlin, code:10115 }
and fails with
Exception in thread "main" java.lang.IllegalArgumentException: Column not found: 'code'
So unfold
did not fulfill its contract which imho is making the attribute code
from the object-column city
visible within my dataframe. I understand that is by design and the user has to flatten
first.
In more technical terms, I think the unfold
is intentionally more a map
-style operation because it's transforming a column without actually unfolding it in space (that is columns) .
My concern here is already tracked in of https://github.com/Kotlin/dataframe/issues/232
My intent from above was just to point others - in case they struggle in the same way I do- to an alternative drop-in replacement of unfold
(which is imho not trival to write down without deeper insights into kdf's internals)
@pacher Thanks for the pointer. Not sure how to create a ColumnReference more elegantly.
getColumn
does so, but requires to reference the df twice (df.unfold(df.getColumn("foo")
), which is arguably not pretty. But that would be a different question/thread imho. :-)
I don't use strings, but String.toColumnAccessor()
should do. An overload like
inline fun <T, reified C> DataFrame<T>.unfold(column: String, noinline body: CreateDataFrameDsl<C>.() -> Unit) = unfold(column.toColumnAccessor() as ColumnReference<C>, body)
val unfoldPersons = personsDF.unfold("city") unfoldPersons["code"]
So
unfold
did not fulfill its contract which imho is making the attributecode
from the object-columncity
visible within my dataframe. I understand that is by design and the user has toflatten
first.
Contract violation is a serious accusation ;-) I've never seen such a contract for unfold
written anywhere and don't believe that it is broken. I guess documentation could be improved, which is true for any project at any time ;) . If you display this dataframe, new columns are perfectly visible and nicely packed into a group to keep things tidy. Imho it is reasonable to expect accessing new columns through city.code
, just like you would access a property of City
object in code.
What if I am unfolding already deeply nested column? It makes sense to have result "in place" instead of new columns suddenly popping up on the top level.
What if I am unfolding many/multiple columns at once? To make it "in space" one has to change column names and add number suffixes or something to avoid name clashing.
But as a programmer, how can I reference those new columns from code? With hierarchical dataframes I can just recursively select all those columns by "code" or better yet using City::code
with which even refactoring works. It is safe to rename code
in IDE or even delete it because the compiler will tell me to fix usage of it in column selectors.
My concern here is already tracked in of #232
Everything above would better fit to #232. I apologize for bringing it here
My intent from above was just to point others - in case they struggle in the same way I do- to an alternative drop-in replacement of
unfold
(which is imho not trival to write down without deeper insights into kdf's internals)
Your replacement reflects your vision that everything should be auto-flattened. That is exactly why I had to provide an alternative which better fits into hierarchical design of kdf. Because mixing the two approaches might not be the best for struggling users.
In fact, I made my extension so that I can make even more column groups!!! e.g.
personsDF.unfold {
properties(name, city)
"details" {
"country" from getCountry(city)
"population" from ...
etc.
}
}
The contract is imho defined by the name, which is about spreading or straighten out. If natural language meaning and API do not consent, an API is not discoverable, making it very hard to learn & use. Sticking to the example from above here's before and after with the built-in unfold
# before
name city
0 Max City(name=Dresden, code=12309)
1 Anna City(name=Berlin, code=10115)
# after unfold
name city
0 Max { name:Dresden, code:12309 }
1 Anna { name:Berlin, code:10115 }
To me it's very obvious that nothing has been unfolded, but city colum has been been map
ed instead.
What if I am unfolding many/multiple columns at once? To make it "in space" one has to change column names and add number suffixes or something to avoid name clashing.
My referenced drop-in replacement takes care of that. It's baked in poorly, because I do not know the standard way that is used to resolve naming conflicts e.g. when doing a merge in kdf.
In fact, when dealing with data tables, typing is never possible because it dies the second I want to add a single column. It's nice from a programming perspective that kdf can apply schema's to data-frames, but in analytical workflows it's always AnyFrame. That's why solid string-accessors and column-name-disambiguation are imho key to enable the library for data-science.
Everything above would better fit to https://github.com/Kotlin/dataframe/issues/232. I apologize for bringing it here
Indeed, apologies from me as well as I started the argument :-)