DataFrame icon indicating copy to clipboard operation
DataFrame copied to clipboard

Support of temporal column types

Open sgoeschl opened this issue 4 years ago • 18 comments

Hi Alexander,

I'm looking at "DataFrame" to use it in of my pet projects

  • How to support temporal column types, e.g. LocalDate or LocalDateTime? Want I want to do is to hide a parsed CSV or Excel sheet behind "DataFrame" but Excel has temporal column types
  • Did you have a look at Commons CSV? Since you strive for minimal dependencies it might be not an option but I wanted to ask :-)

Siegfried

sgoeschl avatar May 12 '20 15:05 sgoeschl

Hi Siegfried,

Thanks for your suggestions. I plan to rework column types to allow easy definition of custom value types. This would allow the definition of DateColumns. Further, it would require the option to add hints for parsing (e.g. date format within a column).

A bridge for Commons CSV is a great idea to improve file format support.

nRo avatar May 18 '20 17:05 nRo

Hi Alexander,

drop me a note when there is something to integrate - seems that I some unexpected free time in summer due to COVID-19 :-)

Siegfried

sgoeschl avatar May 18 '20 18:05 sgoeschl

Hi Siegfried,

Help is always greatly appreciated :)

I hope that I can work on custom value type support this weekend. Further, I will try to create issues for other features and enhancements to better track progress.

nRo avatar May 19 '20 14:05 nRo

I'm ready to kick into action - https://issues.apache.org/jira/browse/FREEMARKER-144

sgoeschl avatar May 30 '20 20:05 sgoeschl

So I started working on custom value type support in this branch: value-type-abstraction and create an issue #22 There are still some TODOs to support custom value types, described in #22

This approach would allow the creation of a Date ValueType that can support different formats for parsing and writing

nRo avatar May 31 '20 16:05 nRo

I got a proof of concept running integrating "nRo/Dataframe" into Apache FreeMarker CLI

sgoeschl avatar Jun 01 '20 09:06 sgoeschl

Thats great!

Regarding temporal column types. How would you expect date formats are handled per default? Some kind of auto detection, or a default format that can be changed by giving hints when the data is read.

nRo avatar Jun 02 '20 16:06 nRo

You make the decision - both approaches sound good to me :-)

What I want to do:

  • Transform Apache Commons CSV into a DataFrame and here I only handle Strings (partly implemented)
  • Transform a Map into a DataFrame (partly implemented)
  • Transform an Excel sheet (POI & getDateCellValue) to a DataFrame

So I guess I come from side of defining the DataFrame and then populating it

sgoeschl avatar Jun 03 '20 07:06 sgoeschl

Okay, thanks for the feedbacks. Unfortunately, I don't have much time at the moment. But I will keep working on that over the next days.

nRo avatar Jun 07 '20 16:06 nRo

Mostly finished but not really happy with the code :-)

sgoeschl avatar Jun 17 '20 18:06 sgoeschl

Is the feature branch stable enough to do some preliminary integration?

sgoeschl avatar Jun 17 '20 18:06 sgoeschl

The feature branch is pretty stable. It only needs support for autodetection of custom types and some tests.

nRo avatar Jun 18 '20 16:06 nRo

sorry for the delay, the custom value feature branch is now merged into master #22 .

I will now start looking into temporal column types. I suppose that three different types would be required:

  • Date
  • DateTime
  • Timespan

Each should be able to handle different formats. Possible operations for these tables would be:

  • convert them from one type into another
  • operations for different time units (add minutes, hours, days,...)
  • operations between columns (calculate difference,...)

Any other suggestions?

nRo avatar Jul 26 '20 16:07 nRo

Hi, little bit confused by "Timespan" - is this a time-only value, e.g. "13:42:23" or do you mean "3:24h"?

Regarding operations - little to no suggestion from my side - I would mostly filter / sort / query on temporal columns coming from CSV and Excel.

sgoeschl avatar Aug 05 '20 09:08 sgoeschl

Hi, with "Timespan" I mean a type like Java 8 Duration. Subtraction of two Dates would result in a Timespan for example

nRo avatar Aug 11 '20 17:08 nRo

Ack - assuming that I understood the things correctly

  • A column to be read would consist of either Date (e.g. 14.10.2019) or DateTime in some format (e.g. 2019-10-14T12:00:00)
  • Substractions of dates would result in a timespan

A few questions along the line

  • Is it possible to read timespans from CSV? Usually they don't have a qualifier such as seconds or day as cell values
  • Would timestamps be supported, e .g. "12:01:31"?

sgoeschl avatar Aug 11 '20 19:08 sgoeschl

sorry for the delay again

  • A column to be read would consist of either Date (e.g. 14.10.2019) or DateTime in some format (e.g. 2019-10-14T12:00:00)

exactly. some points I am still not sure about:

  • how to pass format information to the CSV parser (should be rather simple)
  • implement autodetection for temporal columns
  • How to handle Timezones
  • Substractions of dates would result in a timespan

yes. Thats how I would do it. So that if two dates are extracted you get information about how many hours, minutes, seconds,... passed between those dates.

  • Is it possible to read timespans from CSV? Usually they don't have a qualifier such as seconds or day as cell values
  • Would timestamps be supported, e .g. "12:01:31"?

Timestamps could be supported by adding parsing hints to the CSV parser. They could either just be epoch miliseconds (Long) or a format like you described.

Autodetection for timestamps is more difficult. A timestamp could also be a Long column.

nRo avatar Aug 22 '20 10:08 nRo

Hi Alexander,

Don't worry about the delays - it is an open-source project after all :-)

Thanks in advance,

Siegfried Goeschl

On 22.08.2020, at 12:57, Alexander Grün [email protected] wrote:

sorry for the delay again

A column to be read would consist of either Date (e.g. 14.10.2019) or DateTime in some format (e.g. 2019-10-14T12:00:00) exactly. some points I am still not sure about:

how to pass format information to the CSV parser (should be rather simple) implement autodetection for temporal columns How to handle Timezones Substractions of dates would result in a timespan yes. Thats how I would do it. So that if two dates are extracted you get information about how many hours, minutes, seconds,... passed between those dates.

Is it possible to read timespans from CSV? Usually they don't have a qualifier such as seconds or day as cell values Would timestamps be supported, e .g. "12:01:31"? Timestamps could be supported by adding parsing hints to the CSV parser. They could either just be epoch miliseconds (Long) or a format like you described.

Autodetection for timestamps is more difficult. A timestamp could also be a Long column.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/nRo/DataFrame/issues/21#issuecomment-678626254, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABR6IDWFI4OUYM5E3IRVRDSB6QBHANCNFSM4M66LYMQ.

sgoeschl avatar Aug 23 '20 18:08 sgoeschl