dataframe
dataframe copied to clipboard
Additional feedback from dataframe operations
Some read/write operations may yield unexpected results, for example
- reading CSV file with non-default delimiter could produce a dataframe with one column instead of 5. There is a %trackExecutions that prints generated schema, but it's either prints after each cell, or doesn't print at all
- dataframe saves nested objects with just "toString", so you couldn't read it back. Might be helpful to print paths to all columns with non-serializable objects
- ...?
One of the possible solutions is some sort of logging: https://github.com/Kotlin/dataframe/issues/138#issuecomment-1210644392
I think we need more specific use cases
else -> {
this.setCellValue(any.toString())
}
here might be one of them. Or you are looking for something more specific?
The same case I have in saving to Arrow now.
Also I want to make saving to Arrow with some pre-defined schema and print warnings (or throw exceptions in strict mode) if some column does not exist in actual data or can not be converted to target type.
Also I want to make saving to Arrow with some pre-defined schema and print warnings (or throw exceptions in strict mode) if some column does not exist in actual data or can not be converted to target type.
Sounds like AnyFrame.convertTo
here might be one of them. Or you are looking for something more specific?
this is a good example
Sounds like AnyFrame.convertTo
Does it sound good? Saving "as is" wold be default behavior but since Arrow supports explicit schemas, I want to use them also. Some another system might expect the data with some declared schema (and expose that), like inserting to existing SQL table.
Sounds like AnyFrame.convertTo
Does it sound good? Saving "as is" wold be default behavior but since Arrow supports explicit schemas, I want to use them also. Some another system might expect the data with some declared schema (and expose that), like inserting to existing SQL table.
I probably misunderstood a little bit. So you want to make some dataframe to match specific schema and then save it, like
val df1: DataFrame<YourSchema> = df.yourFunction<YourSchema>(Mode.Warning)
df.writeArrow()
or
df.writeArrow("file.feather", arrowSchema, Mode.Warning)
?
So you want to make some dataframe to match specific schema and then save it
Actually, yes.
df.writeArrow("file.feather", arrowSchema, Mode.Warning)
Like this. Currently it looks like df.arrowWriter(arrowSchema)...//some combinations of target format and sink
.
Not provided arrowSchema causes it's generation from actual data.
I am thinking about user-friendly API now.
Main internal logic is here but flags are still not exposed
Regarding logging / throwing exceptions, maybe some kind of a callback parameter could help?
Like
df.arrowWriter(..., SchemaMissmatch.Ignore / SchemaMissmatch.Throw / SchemaMissmatch.Callback { // log here })
SchemaMissmatch.Ignore
and SchemaMissmatch.Throw
are implementations of the same interface as Callback
, one does nothing and the second just throws an exception
So the library is not dependent on any specific logging framework, and you can even use this callback to update UI
May be, thanks
Returning to the issue topic, what do you think about addition dataframe feedback generally?
Returning to the issue topic, what do you think about addition dataframe feedback generally?
I'm concerned that logging could be a slippery slope. If a library starts to produce some additional output, it could easily become irrelevant noise if it's turned on by default. So, it should be turned off and people can choose to opt-in logging if they need some extra information to find errors, right?
Probably i should see how it's used in other libraries. Do you have an example?
I think using standard logging levels would be enough.
Example of using loggers inside the library can be found in Apache Arrow: injecting the logger, writing logs.
Here I have made an example of running tiny application with different levels turned off/on: https://github.com/Kopilov/testLoggingLevels. Were you looking for something like this?
Parsing falls into this category of implicit operations. Especially parsing of Double which can produce different result for the same input depending on locale. https://github.com/Kotlin/dataframe/issues/568