dataframe icon indicating copy to clipboard operation
dataframe copied to clipboard

Create DataFrame from list of rows where each row is Map

Open PoslavskySV opened this issue 3 years ago • 9 comments

Hi guys,

it would be nice to add a method for creating a DataFrame from a list of rows represented as general Maps. Right now when I do:

val rows : List<Map<String, Any?>>

val df = rows.toDataFrame()

I get a wired result - DataFrame with columns obtained from the properties of Map class. But it is more intuitive to get a DataFrame with columns obtained from the keys of Maps. Does it make sense for you?

PoslavskySV avatar Feb 17 '22 21:02 PoslavskySV

Does this map only have primitive values? I'm asking because in that case we can simply create an overload for List<Map<String, Any?>>. But if you need to convert objects inside this Map into dataframe structures, then Map class should be supported as a special case in existing Iterable<*>.toDataFrame()

koperagen avatar Mar 06 '22 23:03 koperagen

I am not sure that I got your question. Map values can have arbitrary types (same for each Map of iterable). For example, suppose I have a list (or any other iterable) of maps with the following structure:

{
    "col1" -> String,
    "col2" -> Double,
    "col3" -> MyEnum
}

then from the rows.toDataFrame() I would expect to obtain AnyFrame with the columns col1 of type String, col2 of type Double etc. Does it sound meaningful?

PoslavskySV avatar Mar 09 '22 20:03 PoslavskySV

Yes. Let me clarify the question. If you have col4 in your map with the type class Name(val firstName: String, val lastName: String), we can convert it in two ways:

  1. To DataColumn<Name>
  2. To ColumnGroup with 2 columns, firstName and lastName (it's like Iterable<*>.toDataFrame(depth = 2) would work for classes) Do you need 1 or 2?

koperagen avatar Mar 09 '22 21:03 koperagen

Thank you for clarification, now I see! I need option 1, and generally I feel that it is more natural here.

PoslavskySV avatar Mar 09 '22 23:03 PoslavskySV

@PoslavskySV Till the team implements the proper solution you can use this extension function:

fun List<Map<String, Any?>>.toDataFrame(): AnyFrame {
    val columns = mutableMapOf<String, MutableList<Any?>>()
    val notNullCols = mutableSetOf<String>()
    val columnSize = size

    forEachIndexed { rowIndex, row ->
        for (col in row.keys) {
            if (columns[col] == null)
                columns[col] = mutableListOf()

            val value =
                if (row[col].let { it is String && it.isEmpty() }) null
                else row[col]

            if (value != null)
                notNullCols += col

            while (columns[col]!!.size < rowIndex)
                columns[col]!! += null

            columns[col]!! += value
        }
    }

    return columns
        .filter { it.value.isNotEmpty() && it.key in notNullCols }
        .also { map ->
            for ((_, value) in map) {
                while (value.size < columnSize)
                    value += null
            }
        }
        .map { (key, value) -> DataColumn.create(key, value, infer = Infer.Type) }
        .toDataFrame()
}

...

listOf(
    mapOf("a" to "1", "b" to "2", "c" to "3"),
    mapOf("a" to "4", "b" to "5", "c" to "6"),
    mapOf("a" to "7", "b" to "8", "c" to "9"),
).toDataFrame().print(borders = true)

// ⌌-----------⌍
// |  | a| b| c|
// |--|--|--|--|
// | 0| 1| 2| 3|
// | 1| 4| 5| 6|
// | 2| 7| 8| 9|
// ⌎-----------⌏

ian-k avatar Apr 13 '23 20:04 ian-k

@ian-k I took the liberty of refactoring your example a bit so it's clearer how it works :). Thanks for sharing!

Jolanrensen avatar Apr 17 '23 10:04 Jolanrensen

Yes. Let me clarify the question. If you have col4 in your map with the type class Name(val firstName: String, val lastName: String), we can convert it in two ways:

1. To `DataColumn<Name>`

2. To `ColumnGroup`  with 2 columns, firstName and lastName (it's like `Iterable<*>.toDataFrame(depth = 2)` would work for classes)
   Do you need 1 or 2?

@koperagen I think it should be option 1. if it gets implemented. As discussed in many other places, it is easy to unfold Name to columns, but there is no easy way to fold it back in case one day you do need DataColumn<Name>.

pacher avatar Apr 17 '23 11:04 pacher

What is the work required to close this issue? Is it just to include the extension function above in the library somewhere? Looking for a couple of 'good first issue' tickets to contribute to.

engineerdan avatar Oct 16 '23 11:10 engineerdan

What is the work required to close this issue? Is it just to include the extension function above in the library somewhere? Looking for a couple of 'good first issue' tickets to contribute to.

Implementation can be put here. We usually split it into API and implementation parts, with API simply delegating the work to impl kotlin/org/jetbrains/kotlinx/dataframe/api/toDataFrame.kt kotlin/org/jetbrains/kotlinx/dataframe/impl/api/toDataFrame.kt

Tests for API can go here kotlin/org/jetbrains/kotlinx/dataframe/testSets/person/DataFrameTests.kt

Then you can add documentation similar to part here https://github.com/Kotlin/dataframe/blob/5b425a7fcde397426e6d14a9c5398c6cd5c91c7a/docs/StardustDocs/topics/createDataFrame.md

Here's a template for it :)

`DataFrame` from `List<Map<String, Any?>>`:

<!---FUN createDataFrameFromListOfMap-->

<!---END-->

Code sample there is inserted by korro Gradle task. You need to add a test in kotlin/org/jetbrains/kotlinx/dataframe/samples/api/Create.kt with code that uses this new function and then run the task. You can use some of the existing ones as an example. If all goes well, createDataFrame.md will be updated

Extra info about korro can be found here https://github.com/Kotlin/dataframe/blob/master/docs/contributions.md. But don't hesitate to ask

koperagen avatar Oct 16 '23 12:10 koperagen