dataframe icon indicating copy to clipboard operation
dataframe copied to clipboard

Additional feedback from dataframe operations

Open koperagen opened this issue 2 years ago • 12 comments

Some read/write operations may yield unexpected results, for example

  1. reading CSV file with non-default delimiter could produce a dataframe with one column instead of 5. There is a %trackExecutions that prints generated schema, but it's either prints after each cell, or doesn't print at all
  2. dataframe saves nested objects with just "toString", so you couldn't read it back. Might be helpful to print paths to all columns with non-serializable objects
  3. ...?

One of the possible solutions is some sort of logging: https://github.com/Kotlin/dataframe/issues/138#issuecomment-1210644392

I think we need more specific use cases

koperagen avatar Aug 11 '22 17:08 koperagen

else -> {
    this.setCellValue(any.toString())
}

here might be one of them. Or you are looking for something more specific?

The same case I have in saving to Arrow now.

Also I want to make saving to Arrow with some pre-defined schema and print warnings (or throw exceptions in strict mode) if some column does not exist in actual data or can not be converted to target type.

Kopilov avatar Aug 15 '22 14:08 Kopilov

Also I want to make saving to Arrow with some pre-defined schema and print warnings (or throw exceptions in strict mode) if some column does not exist in actual data or can not be converted to target type.

Sounds like AnyFrame.convertTo

here might be one of them. Or you are looking for something more specific?

this is a good example

koperagen avatar Aug 16 '22 12:08 koperagen

Sounds like AnyFrame.convertTo

Does it sound good? Saving "as is" wold be default behavior but since Arrow supports explicit schemas, I want to use them also. Some another system might expect the data with some declared schema (and expose that), like inserting to existing SQL table.

Kopilov avatar Aug 16 '22 15:08 Kopilov

Sounds like AnyFrame.convertTo

Does it sound good? Saving "as is" wold be default behavior but since Arrow supports explicit schemas, I want to use them also. Some another system might expect the data with some declared schema (and expose that), like inserting to existing SQL table.

I probably misunderstood a little bit. So you want to make some dataframe to match specific schema and then save it, like

val df1: DataFrame<YourSchema> = df.yourFunction<YourSchema>(Mode.Warning)
df.writeArrow()

or df.writeArrow("file.feather", arrowSchema, Mode.Warning)?

koperagen avatar Aug 16 '22 17:08 koperagen

So you want to make some dataframe to match specific schema and then save it

Actually, yes.

df.writeArrow("file.feather", arrowSchema, Mode.Warning)

Like this. Currently it looks like df.arrowWriter(arrowSchema)...//some combinations of target format and sink. Not provided arrowSchema causes it's generation from actual data.

I am thinking about user-friendly API now.

Kopilov avatar Aug 17 '22 12:08 Kopilov

Main internal logic is here but flags are still not exposed

Kopilov avatar Aug 17 '22 13:08 Kopilov

Regarding logging / throwing exceptions, maybe some kind of a callback parameter could help? Like df.arrowWriter(..., SchemaMissmatch.Ignore / SchemaMissmatch.Throw / SchemaMissmatch.Callback { // log here }) SchemaMissmatch.Ignore and SchemaMissmatch.Throw are implementations of the same interface as Callback, one does nothing and the second just throws an exception

So the library is not dependent on any specific logging framework, and you can even use this callback to update UI

koperagen avatar Aug 18 '22 19:08 koperagen

May be, thanks

Kopilov avatar Aug 19 '22 06:08 Kopilov

Returning to the issue topic, what do you think about addition dataframe feedback generally?

Kopilov avatar Aug 19 '22 06:08 Kopilov

Returning to the issue topic, what do you think about addition dataframe feedback generally?

I'm concerned that logging could be a slippery slope. If a library starts to produce some additional output, it could easily become irrelevant noise if it's turned on by default. So, it should be turned off and people can choose to opt-in logging if they need some extra information to find errors, right?

Probably i should see how it's used in other libraries. Do you have an example?

koperagen avatar Aug 22 '22 20:08 koperagen

I think using standard logging levels would be enough.

Example of using loggers inside the library can be found in Apache Arrow: injecting the logger, writing logs.

Here I have made an example of running tiny application with different levels turned off/on: https://github.com/Kopilov/testLoggingLevels. Were you looking for something like this?

Kopilov avatar Aug 23 '22 14:08 Kopilov

Parsing falls into this category of implicit operations. Especially parsing of Double which can produce different result for the same input depending on locale. https://github.com/Kotlin/dataframe/issues/568

koperagen avatar Feb 05 '24 17:02 koperagen