dataframe icon indicating copy to clipboard operation
dataframe copied to clipboard

readDelimiter variant for Regex as delimiter

Open dave08 opened this issue 1 year ago • 5 comments

Maybe since this is a function to especially read delimeters, it might be useful to have an override that takes in a Regex as a delimiter... this might be used for command line output tables that are usually space separated but sometimes inside a column value there might be a single space, so I need to use "\s\s+" to correctly read it in.

dave08 avatar Jun 20 '24 11:06 dave08

Hi. Library we're using now only has String and Char options for delimiter. Is your file a CSV/TSV or just a plain txt with some special format you want to parse? image

koperagen avatar Jun 20 '24 12:06 koperagen

Say I have (output from kubectl get namespaces):

NAME                     STATUS   AGE      LABELS
argo-events              Active   2y77d    app.kubernetes.io/instance=argo-events,kubernetes.io/metadata.name=argo-events
argo-workflows           Active   2y77d    app.kubernetes.io/instance=argo-workflows,kubernetes.io/metadata.name=argo-workflows
argocd                   Active   5y18d    kubernetes.io/metadata.name=argocd
beta                     Active   4y235d   kubernetes.io/metadata.name=beta

Then I have multiple spacess as delimiters...

In some command line outputs, I have two words in one column:

NAME                                                                     CLUSTER        CDS        LDS        EDS        RDS          ECDS         ISTIOD                             VERSION
foo-5fcd67944f-2t97k.dev                                           Kubernetes     SYNCED     SYNCED     SYNCED     SYNCED       NOT SENT     istiod-1-18-7-dbcdbb5f4-nth9n      1.18.7
foo-6f8bf4c9b9-qrwf9.prod                                          Kubernetes     SYNCED     SYNCED     SYNCED     SYNCED       NOT SENT     istiod-1-16-7-6d46d45875-gxtzw     1.16.7

Like that NOT SENT... that's where a regex can help here. It's not just tabs, it's a bunch of spaces.

Also, how would you parse Markdown tables (or similar)...? Unless the library trims all those extra spaces... but I guess with markdown there might be more complications that just a delimiter.

dave08 avatar Jun 20 '24 12:06 dave08

Good questions indeed. I think such tables should be parsed by readDelimStr in the future. For now i can only suggest something like this for Markdown.

fun String.markdownCells() = trim('|').split("|").map { it.trim() }

val s = """
| Month    | Savings |
| -------- | ------- |
| January  | $250    |
| February | $80     |
| March    | $420    |""".trimIndent()

val lines = s.lineSequence()
lines.drop(2).toList().toDataFrame().split { value }.by { it.markdownCells() }.into(lines.first().markdownCells())

koperagen avatar Jun 20 '24 13:06 koperagen

I think that's a bit of an advanced technique for most people with this kind of use case... and it involves parsing in two steps...

I wonder if some kind of readDSL would be better here... it could possibly work by line and give helpers for extracting the titles and values?

dave08 avatar Jun 20 '24 15:06 dave08

Please share desired API or example of usages that you have in mind. Maybe something like this could be added

koperagen avatar Jun 20 '24 15:06 koperagen

I'm closing this fow now. We're working on a new CSV implementation based on Deephaven CSV https://github.com/Kotlin/dataframe/issues/827 since it's faster and lighter, however this also doesn't allow Regexes for delimiter characters unfortunately, just a Char. It does have multiple other options like ignoreSurroundingSpaces, which can trim leading and trailing blanks around values and it can recognize quote characters. That might help :).

We plan to have an experimental version of it in 0.15. If that still does not work, I'd recommend modifying the string manually, potentially adding quote characters and then parsing it as delimStr.

Edit: well, apparently it seems to have some issues with delimiter = ' ', ignoreSurroundingSpaces = true. I'll make an issue over at deephaven XD https://github.com/deephaven/deephaven-csv/issues/212

Jolanrensen avatar Oct 22 '24 11:10 Jolanrensen

Hi, since you mentioned you were developing your own CSV library, I thought I would comment here.

Whether you decide to use Deephaven's CSV library or develop your own, there are a variety of things we learned along the way that may benefit you. We used some clever ideas for high performance and also some cute tricks for automatic "type inference". I'd be happy to discuss in more detail in some appropriate forum if you would find that helpful. Best, Corey Kosak @ Deephaven

kosak avatar Oct 26 '24 04:10 kosak

@kosak We're not developing our own CSV library. We're simply replacing our Apache commons CSV integration in DataFrame with Deephaven's :) exactly for the reasons you mentioned; performance, type inference, etc. Plus, while we currently don't store our data primitively, using Deephaven, that remains a viable option in the future.

Jolanrensen avatar Oct 28 '24 09:10 Jolanrensen

https://github.com/deephaven/deephaven-csv/issues/212 is merged :)

We'll add it in https://github.com/Kotlin/dataframe/pull/903. Simply set hasFixedWidthColumns = true and the column widths are determined by the width of the headers + spaces.

You can also manually specify fixedColumnWidths if this goes wrong.

Jolanrensen avatar Nov 11 '24 12:11 Jolanrensen