dataframe icon indicating copy to clipboard operation
dataframe copied to clipboard

Add API to unfold object-columns into properties via reflection

Open holgerbrandl opened this issue 2 years ago • 6 comments

As with unfold() in krangl (see https://holgerbrandl.github.io/krangl/data_model/#to-type-or-not-to-type), it would be great if kotlin-df could provide a similar API. Clearly, this could be done manually using df.add(), but with many attributes this is very tedious. And in krangl we've seen that this can be done very efficiently via reflection.

Example (from krangl but yet missing in kdf):

data class City(val name:String, val code:Int)
data class Person(val name:String, val address:City)

val persons : List<Person> = listOf(
    Person("Max", City("Dresden", 12309)),
    Person("Anna", City("Berlin", 10115))
)

val personsDF: DataFrame = persons.asDataFrame() // <- Also sems missing API in kdf?

// unfold City attributes into different columns
personsDF.unfold<City>("address")  // <- Missing API in kdf

For impl see https://github.com/holgerbrandl/krangl/blob/49418b2ca6ee6ae165c034e56e4da77e4707f7ad/src/main/kotlin/krangl/Builder.kt#L115

holgerbrandl avatar Aug 28 '22 08:08 holgerbrandl

persons.asDataFrame() -- i believe persons.toDataFrame() does the trick https://kotlin.github.io/dataframe/createdataframe.html#todataframe

koperagen avatar Aug 29 '22 12:08 koperagen

@koperagen Thanks for the pointer.

I'd still think that the key aspect of this ticket is still valid: Once we have objects of type E in a column of a data-frame we can not conveniently expand E's properties into columns (without writing it out attribute by attribute using add()). That's what unfold was/is doing in krangl and I used it quite frequently.

holgerbrandl avatar Sep 02 '22 15:09 holgerbrandl

@koperagen Thanks for the pointer.

I'd still think that the key aspect of this ticket is still valid: Once we have objects of type E in a column of a data-frame we can not conveniently expand E's properties into columns (without writing it out attribute by attribute using add()). That's what unfold was/is doing in krangl and I used it quite frequently.

Indeed. We'll introduce this API. I also made an example of how it can be done now.

class RepositoryInfo(val data: Any)

fun download(url: String) = RepositoryInfo("fancy response from the API")

val interestingRepos = dataFrameOf("name", "url")(
    "dataframe", "/dataframe",
    "kotlin", "/kotlin",
)

val initialData = interestingRepos
    .add("response") { download("url"()) }

class WebScrappingInitialData(val name: String, val url: String, val response: RepositoryInfo)

val df = initialData.cast<WebScrappingInitialData>(verify = true)

val response by column<RepositoryInfo>()

df
    .replace(response)
    .with { it.asIterable().toDataFrame().asColumnGroup(it.name()) }
    //.ungroup(response)
    .print()

df
    .replace("response")
    .with { (it as DataColumn<RepositoryInfo>)
        .asIterable()
        .toDataFrame {
            // todo a better example for this dsl
            properties(maxDepth = 1) {
                
            }
        }
        .asColumnGroup(it.name())
    }
    .rename { columnGroup("response").children() }.into { "response_" + it.name() }
    .ungroup("response")
    .schema().print()

koperagen avatar Sep 02 '22 18:09 koperagen

Actually, there is non-documented API for that. I haven't found a good name for that, so it's currently called read:

data class City(val name:String, val code:Int)
data class Person(val name:String, val address:City)

val persons : List<Person> = listOf(
    Person("Max", City("Dresden", 12309)),
    Person("Anna", City("Berlin", 10115))
)

val personsDF = persons.toDataFrame()

personsDF.read(Person::address)

Thank you, @holgerbrandl, unfold seems to be a good name for this operation. Did you take it from some other dataframe library?

nikitinas avatar Oct 13 '22 23:10 nikitinas

And also List -> DataFrame conversion is called toDataFrame instead of asDataFrame, because in Kotlin Stdlib as means wrapping and to means copying: asSequence, but toList.

nikitinas avatar Oct 13 '22 23:10 nikitinas

The name was inspired from the terminology section in https://tidyr.tidyverse.org/. Altough in tidyr they ended up using unnest_wider (see https://tidyr.tidyverse.org/reference/hoist.html). unnest is also a nice candidate imho.

Regarding to vs as, indeed naming was/is wrong in krangl. I'm a slow learner so it took me forever to memorize the semantics here. :-)

holgerbrandl avatar Oct 14 '22 15:10 holgerbrandl

added as unfold() :)

Jolanrensen avatar Dec 20 '22 13:12 Jolanrensen

Just tried v0.9.1, and the unfold works nicely. Thanks @Jolanrensen for picking up the idea.

The design is great from a programmer's perspective, but imho a bit too complex from a data-science one (which is less technical).

I've figured that I have to flatten() the result to make the columns visible, which imho could&should be one step. How could I prefix the attributes with the unfolded attribute name (plus _) to keep the overview in complex tables with many columns? How to keep the original column? I could cherrypick attributes of interest with a select on the final result, but it would imho have been more efficient to cherrypick variables when unfolding to minimize the memory footprint of the dataframe.

So in short, I think the parameters from the krangl-unfold are missing:

  • attributes: List<String>? = null,
  • keep: Boolean = true,
  • addPrefix: Boolean = false,

The unfold impl in kdf supports multiple attributes unfolding at once, but I have not yet come across a use-case for that in my daily work. It's rather that I typically want to have more control about how to unfold a specific column.

Disclaimer: It could well be that I had missed some details here, because when playing with the new unfold API the published docs were still about v0.8.

holgerbrandl avatar Jan 22 '23 20:01 holgerbrandl

fyi, I've added the convenience wrapper described above in https://github.com/holgerbrandl/kdfutils/blob/5f2226eb8e112b8a02f5bd6f96775fbcbc286fb1/src/main/kotlin/com/github/holgerbrandl/kdfutils/Unfold.kt#L38

holgerbrandl avatar Feb 04 '23 22:02 holgerbrandl

@holgerbrandl I use the convenience extension

inline fun <T, reified C> DataFrame<T>.unfold(column: ColumnReference<C>, noinline body: CreateDataFrameDsl<C>.() -> Unit) =
    replace(column).with { it.values().toDataFrame(body).asColumnGroup(it.name()) }

so that I can use the full power of CreateDataFrameDsl to control how to unfold it. It is basically a copy - paste of unfold implementation of kdf, the only difference is that DSL is exposed as a parameter instead of using hardcoded default { properties() }.

But this extension inherits hierarchical nature of kdf, so unfolded columns are conveniently grouped, which is not what you want, so you might still need to flatten() the result. Btw, what do you mean to make the columns visible? How are they not visible before?

pacher avatar Feb 05 '23 09:02 pacher

@pacher Thanks for the pointer. Not sure how to create a ColumnReference more elegantly. getColumn does so, but requires to reference the df twice (df.unfold(df.getColumn("foo")), which is arguably not pretty. But that would be a different question/thread imho. :-)

Regarding "How are they not visible before?"

data class City(val name: String, val code: Int)
    data class Person(val name: String, val city: City)

    val persons: List<Person> = listOf(
        Person("Max", City("Dresden", 12309)),
        Person("Anna", City("Berlin", 10115))
    )

    val personsDF = persons.toDataFrame()

    personsDF.getColumn()

    val unfoldPersons = personsDF.unfold("city")
    unfoldPersons.print()
    unfoldPersons["code"]
   name                         city
 0  Max { name:Dresden, code:12309 }
 1 Anna  { name:Berlin, code:10115 }

and fails with

Exception in thread "main" java.lang.IllegalArgumentException: Column not found: 'code'

So unfold did not fulfill its contract which imho is making the attribute code from the object-column city visible within my dataframe. I understand that is by design and the user has to flatten first.

In more technical terms, I think the unfold is intentionally more a map-style operation because it's transforming a column without actually unfolding it in space (that is columns) .

My concern here is already tracked in of https://github.com/Kotlin/dataframe/issues/232

My intent from above was just to point others - in case they struggle in the same way I do- to an alternative drop-in replacement of unfold (which is imho not trival to write down without deeper insights into kdf's internals)

holgerbrandl avatar Feb 05 '23 17:02 holgerbrandl

@pacher Thanks for the pointer. Not sure how to create a ColumnReference more elegantly. getColumn does so, but requires to reference the df twice (df.unfold(df.getColumn("foo")), which is arguably not pretty. But that would be a different question/thread imho. :-)

I don't use strings, but String.toColumnAccessor() should do. An overload like

inline fun <T, reified C> DataFrame<T>.unfold(column: String, noinline body: CreateDataFrameDsl<C>.() -> Unit) = unfold(column.toColumnAccessor() as ColumnReference<C>, body)
    val unfoldPersons = personsDF.unfold("city")
    unfoldPersons["code"]

So unfold did not fulfill its contract which imho is making the attribute code from the object-column city visible within my dataframe. I understand that is by design and the user has to flatten first.

Contract violation is a serious accusation ;-) I've never seen such a contract for unfold written anywhere and don't believe that it is broken. I guess documentation could be improved, which is true for any project at any time ;) . If you display this dataframe, new columns are perfectly visible and nicely packed into a group to keep things tidy. Imho it is reasonable to expect accessing new columns through city.code, just like you would access a property of City object in code.

What if I am unfolding already deeply nested column? It makes sense to have result "in place" instead of new columns suddenly popping up on the top level. What if I am unfolding many/multiple columns at once? To make it "in space" one has to change column names and add number suffixes or something to avoid name clashing. But as a programmer, how can I reference those new columns from code? With hierarchical dataframes I can just recursively select all those columns by "code" or better yet using City::code with which even refactoring works. It is safe to rename code in IDE or even delete it because the compiler will tell me to fix usage of it in column selectors.

My concern here is already tracked in of #232

Everything above would better fit to #232. I apologize for bringing it here

My intent from above was just to point others - in case they struggle in the same way I do- to an alternative drop-in replacement of unfold (which is imho not trival to write down without deeper insights into kdf's internals)

Your replacement reflects your vision that everything should be auto-flattened. That is exactly why I had to provide an alternative which better fits into hierarchical design of kdf. Because mixing the two approaches might not be the best for struggling users.

In fact, I made my extension so that I can make even more column groups!!! e.g.

personsDF.unfold {
    properties(name, city)
    "details" {
        "country" from getCountry(city)
        "population" from ...
        etc.    
    }
}

pacher avatar Feb 05 '23 21:02 pacher

The contract is imho defined by the name, which is about spreading or straighten out. If natural language meaning and API do not consent, an API is not discoverable, making it very hard to learn & use. Sticking to the example from above here's before and after with the built-in unfold

# before
   name                           city
 0  Max City(name=Dresden, code=12309)
 1 Anna  City(name=Berlin, code=10115)

# after unfold
   name                         city
 0  Max { name:Dresden, code:12309 }
 1 Anna  { name:Berlin, code:10115 }

To me it's very obvious that nothing has been unfolded, but city colum has been been maped instead.

What if I am unfolding many/multiple columns at once? To make it "in space" one has to change column names and add number suffixes or something to avoid name clashing.

My referenced drop-in replacement takes care of that. It's baked in poorly, because I do not know the standard way that is used to resolve naming conflicts e.g. when doing a merge in kdf.

In fact, when dealing with data tables, typing is never possible because it dies the second I want to add a single column. It's nice from a programming perspective that kdf can apply schema's to data-frames, but in analytical workflows it's always AnyFrame. That's why solid string-accessors and column-name-disambiguation are imho key to enable the library for data-science.

Everything above would better fit to https://github.com/Kotlin/dataframe/issues/232. I apologize for bringing it here

Indeed, apologies from me as well as I started the argument :-)

holgerbrandl avatar Feb 06 '23 07:02 holgerbrandl