dataframe readExcel created <Comparable> column

Hello, found this annoying situation where a schema would be printed as

F1: Comparable
F2: String
F3: String
F4: String
F5: Comparable
F6: String

This creates 3 kinds of issues:

If a filter is used, sometimes only one of the entries will be retrieved even if there are 2 for the given filter.
If an update or sort operation is used this issues occurs -> class java.lang.Double cannot be cast to class java.lang.String (java.lang.Double and java.lang.String are in module java.base of loader 'bootstrap')
If a convert{ "F1"<String>() }.to<String>() is called this error occurs -> Can't find converter from kotlin.Comparable<> to kotlin.String*

For situation 1 I tried to update or convert the column to a String, hence why I discovered situations 2 & 3

Thanks

Aug 14 '22 21:08 LeandroC89

Hi! Would it be helpful and clear what's going on if schema was printed like this?

F1: Comparable (String & Double)
F2: String
F3: String
F4: String
F5: Comparable (String & Double)
F6: String

Aug 16 '22 12:08 koperagen

Hi, my issue is not with the schema itself but with using Comparable as a type.

If you have 2 occurrences of the same value on the same column. Excel filter would detect both instances. But Dataframe filter may return only 1. (I suppose this has to do with one being considered String and the other a Double. Despite the reason it is a huge risk as filter suddenly becomes unreliable)

Same happens with the sort operation, instead of sorting it throws an Exception since it can't compare String with a Double.

If one tries to use convert to change types, it either isn't possible or you have define the column type as either String or Double since Comparable is not a valid type. And then you get the same Exception as before.

If one tries the convert to String it'll fail. Likewise for the update function.

My workaround was to add a new column which is the Comparable typed column .toString() and use it instead.

My point is, if Comparable is unreliable and clearly the library isn't prepared for it, wouldn't it be better to simply remove it? Make it so if a Text is found on a column of Number up until then, then the whole column would be type String.

I even ensured I had selected all populated Excel cells and had them typed as Text beforehand (not that Excel is any good at typings unless you go through each cell, double click and press Enter but that would take forever).

Thank you.

Aug 16 '22 23:08 LeandroC89

Make it so if a Text is found on a column of Number up until then, then the whole column would be type String.

This sounds good, i think. If it means no loss of data (i.e. all those numbers can be converted back by something like it.toDoubleOrNull()), then we probably can do it.

But i still would like to see if we can improve experience of how to handle situations when this weird type shows up in input data.

Because, in fact, you can tidy up this column like this df.convert { F1 }.with(Infer.Type) { (it as? Double)?.let { it.toString() } }

Another solution could be saving F1 as a ColumnGroup: F1

string: String
double: Double

So you could df.convert { F1 }.with { it.string ?: it.double?.toString() }

What do you think? My concern is that first solution is probably easy to miss, and the second one can be confusing, because you suddenly get ColumnGroup instead of DataColumn. Maybe we should print schema after read operations by default in notebooks and add some extra information, idk

Aug 17 '22 11:08 koperagen

Hello, and thank you for your explanation.

I'm having some difficulty getting the inferType to properly work. I managed to get it like this: (I wanted nulls as "" for this case)

.convert { "F1"<Any>() }
            .with(inferType = true) {
                 it.toString() }

I used Any because Comparable is not a valid Column Type, whether used for String access or for Column Accessors (Using String would cause an issue with InferType)

Still had to deal with the ".0" resulting from the conversion Double to String but i's already something I can work with.

But as for the Comparable use case:

It can cause issues when using sort
Filters may become unreliable if data isn't properly converted beforehand
Column<Comparable> is not accepted, having to resort to using one of the column types or Any

Does it make sense to keep it as a possible column type when reading a file? (Just adding some remarks, I already have a workaround for my main issue thanks to your reply 👍 )

Thank you!

Aug 19 '22 08:08 LeandroC89

@koperagen had other ideas for handling multiple types in one column, which I summarized here: https://github.com/Kotlin/dataframe/issues/466.

Oct 09 '23 14:10 Jolanrensen

dataframe dataframe copied to clipboard

readExcel created <Comparable> column

dataframe
dataframe copied to clipboard