dataframe Additional feedback from dataframe operations

Some read/write operations may yield unexpected results, for example

reading CSV file with non-default delimiter could produce a dataframe with one column instead of 5. There is a %trackExecutions that prints generated schema, but it's either prints after each cell, or doesn't print at all
dataframe saves nested objects with just "toString", so you couldn't read it back. Might be helpful to print paths to all columns with non-serializable objects
...?

One of the possible solutions is some sort of logging: https://github.com/Kotlin/dataframe/issues/138#issuecomment-1210644392

I think we need more specific use cases

Aug 11 '22 17:08 koperagen

else -> {
    this.setCellValue(any.toString())
}

here might be one of them. Or you are looking for something more specific?

The same case I have in saving to Arrow now.

Also I want to make saving to Arrow with some pre-defined schema and print warnings (or throw exceptions in strict mode) if some column does not exist in actual data or can not be converted to target type.

Aug 15 '22 14:08 Kopilov

Also I want to make saving to Arrow with some pre-defined schema and print warnings (or throw exceptions in strict mode) if some column does not exist in actual data or can not be converted to target type.

Sounds like AnyFrame.convertTo

here might be one of them. Or you are looking for something more specific?

this is a good example

Aug 16 '22 12:08 koperagen

Sounds like AnyFrame.convertTo

Does it sound good? Saving "as is" wold be default behavior but since Arrow supports explicit schemas, I want to use them also. Some another system might expect the data with some declared schema (and expose that), like inserting to existing SQL table.

Aug 16 '22 15:08 Kopilov

Sounds like AnyFrame.convertTo

Does it sound good? Saving "as is" wold be default behavior but since Arrow supports explicit schemas, I want to use them also. Some another system might expect the data with some declared schema (and expose that), like inserting to existing SQL table.

I probably misunderstood a little bit. So you want to make some dataframe to match specific schema and then save it, like

val df1: DataFrame<YourSchema> = df.yourFunction<YourSchema>(Mode.Warning)
df.writeArrow()

or df.writeArrow("file.feather", arrowSchema, Mode.Warning)?

Aug 16 '22 17:08 koperagen

So you want to make some dataframe to match specific schema and then save it

Actually, yes.

df.writeArrow("file.feather", arrowSchema, Mode.Warning)

Like this. Currently it looks like df.arrowWriter(arrowSchema)...//some combinations of target format and sink. Not provided arrowSchema causes it's generation from actual data.

I am thinking about user-friendly API now.

Aug 17 '22 12:08 Kopilov

Main internal logic is here but flags are still not exposed

Aug 17 '22 13:08 Kopilov

Regarding logging / throwing exceptions, maybe some kind of a callback parameter could help? Like df.arrowWriter(..., SchemaMissmatch.Ignore / SchemaMissmatch.Throw / SchemaMissmatch.Callback { // log here }) SchemaMissmatch.Ignore and SchemaMissmatch.Throw are implementations of the same interface as Callback, one does nothing and the second just throws an exception

So the library is not dependent on any specific logging framework, and you can even use this callback to update UI

Aug 18 '22 19:08 koperagen

May be, thanks

Aug 19 '22 06:08 Kopilov

Returning to the issue topic, what do you think about addition dataframe feedback generally?

Aug 19 '22 06:08 Kopilov

Returning to the issue topic, what do you think about addition dataframe feedback generally?

I'm concerned that logging could be a slippery slope. If a library starts to produce some additional output, it could easily become irrelevant noise if it's turned on by default. So, it should be turned off and people can choose to opt-in logging if they need some extra information to find errors, right?

Probably i should see how it's used in other libraries. Do you have an example?

Aug 22 '22 20:08 koperagen

I think using standard logging levels would be enough.

Example of using loggers inside the library can be found in Apache Arrow: injecting the logger, writing logs.

Here I have made an example of running tiny application with different levels turned off/on: https://github.com/Kopilov/testLoggingLevels. Were you looking for something like this?

Aug 23 '22 14:08 Kopilov

Parsing falls into this category of implicit operations. Especially parsing of Double which can produce different result for the same input depending on locale. https://github.com/Kotlin/dataframe/issues/568

Feb 05 '24 17:02 koperagen

dataframe dataframe copied to clipboard

Additional feedback from dataframe operations

dataframe
dataframe copied to clipboard