krangl icon indicating copy to clipboard operation
krangl copied to clipboard

Support replaceNA?

Open alphaho opened this issue 4 years ago • 4 comments

I've found it a bit counter-intuitive to replace NA with the value I want, as there's no API to call directly.

For now, I've used the below logic to do the job. May I know is this the proper way to do so? Do we need to add such API to krangl? (I'm happy to raise a PR if the below snippet is the officital way)

val myColumnName = "whatever-column-I-want-to-replace-NA"
df.addColumn(myColumnName to { it[myColumnName].values().map { it ?: 0 } }) 

alphaho avatar Mar 16 '21 07:03 alphaho

Related to https://github.com/holgerbrandl/krangl/issues/110

alphaho avatar Mar 16 '21 08:03 alphaho

Thanks for the suggestion. Indeed I find the current approach clumsy as well.

I was thinking about


fun DataFrame.replaceNA(value: Any,  columnSelect: ColumnSelector) =
    colSelectAsNames(columnSelect).fold(this) { df, column, ->
        df.addColumn(column) {
            it[column].values().map {
                it ?: value
            }
        }
    }

// note: colSelectAsNames is an internal method as of now

fun main() {
    val sales = dataFrameOf("product", "sales", "city")(
        "A", 32, "Dresden",
        null, 12.3, "New York",
        "A", 24, "Boston",
        "B", null, "Boston",
        "B", 44, null
    )

    sales
        .replaceNA("AAA") { listOf("city", "product") }
        .apply {
            print()
            schema()
        }
    sales
        .replaceNA("AAA") { listOf("city") }
        .apply {
            print()
            schema()
        }

    sales
        .replaceNA(0) { listOf("sales") }
        .apply {
            print()
            schema()
        }
}

But maybe that's too much as well, and a per column extension would be the most intuitive solution. For sure a PR would be great here. I'm happy to help as well, but would value your input first regarding an intuitive API design.

holgerbrandl avatar Mar 30 '21 20:03 holgerbrandl

You are right. Calling sales.replaceNA("AAA") { listOf("city", "product") } may not be as intuitive as it would be in Pandas, which will be df['DataFrame Column'] = df['DataFrame Column'].fillna(0) .

I think at least there are two problems here:

  1. one is the ColumnSelector (e.g. { listOf("city", "product")}) seems a bit verbose to use. The first thought that comes to my mind is whether we can get rid of calling listOf, but only require vararg column names.

  2. the other is immutability puts some constraints in defining such APIs. That's what I believe to be the reason that the extension function is defined on DataFrame instead of a column. How about we have some API like:

fun DataFrame.replaceColumn(columnName: String, update: (DataCol) -> DataCol) {
    val df = this
    df.addColumn(columnName) {
        update(it[columnName])
    }
}

// would need some help to make this work:
fun DataCol.replaceNA(defaultValue: Any): DataCol = values().map { it?: defaultValue}

sales.replaceColumn("product") { 
  it.replaceNA("AAA")
}.replaceColumn("sales") {
  it.replaceNA(0)
}

Another kind of API may be like the following. But it seems a bit harder to set up, as I think we would need some kind of mutable column to collect all the mutations:

sales.replaceColumns("product", "sales") {  productColumn, salesColumn ->
  productColumn.replaceNA("AAA")
  salesColumn.replaceNA(0)
}

This is to mimic the style of buildList API or buildString which take a function to mutate MutableList / StringBuilder to build up the new immutable value (DataCol in the case of above API). I think this can be more friendly to new users, as it provides some familiarity to the users (using pandas) while keeping the DataFrame and DataCol immutable, even though it may have some performance drawbacks.

How do you think about these two APIs?

alphaho avatar Apr 10 '21 09:04 alphaho

Sorry for the delay & and thanks for the proposal.

Rearding 1. It's not that simple to replace listOf with varargs, because there are multiple implementations of ColumnSelect and listOf is just one of them. See https://github.com/holgerbrandl/krangl/blob/17be47ab78f61d1e2db68d7e99ed3a078985bdf0/src/main/kotlin/krangl/Select.kt#L52-L52

Regarding 2. addColumn will already replace an existing column. However, I agree that the signature of replaceColumn may make it more intuitive. This does not affect immutability (which even in your example is still in place)

I don't think that we can express replaceColumns in kotlin unless we provide different implementations (one for each number of arguments).

With some weeks since my last comment, I actually like the idea of sales.replaceNA("AAA") { listOf("city", "product") } except for the method name which might be better replaceNaWith or replaceNaWithIn.

holgerbrandl avatar Apr 21 '21 20:04 holgerbrandl