readDelimiter variant for Regex as delimiter
Maybe since this is a function to especially read delimeters, it might be useful to have an override that takes in a Regex as a delimiter... this might be used for command line output tables that are usually space separated but sometimes inside a column value there might be a single space, so I need to use "\s\s+" to correctly read it in.
Hi. Library we're using now only has String and Char options for delimiter. Is your file a CSV/TSV or just a plain txt with some special format you want to parse?
Say I have (output from kubectl get namespaces):
NAME STATUS AGE LABELS
argo-events Active 2y77d app.kubernetes.io/instance=argo-events,kubernetes.io/metadata.name=argo-events
argo-workflows Active 2y77d app.kubernetes.io/instance=argo-workflows,kubernetes.io/metadata.name=argo-workflows
argocd Active 5y18d kubernetes.io/metadata.name=argocd
beta Active 4y235d kubernetes.io/metadata.name=beta
Then I have multiple spacess as delimiters...
In some command line outputs, I have two words in one column:
NAME CLUSTER CDS LDS EDS RDS ECDS ISTIOD VERSION
foo-5fcd67944f-2t97k.dev Kubernetes SYNCED SYNCED SYNCED SYNCED NOT SENT istiod-1-18-7-dbcdbb5f4-nth9n 1.18.7
foo-6f8bf4c9b9-qrwf9.prod Kubernetes SYNCED SYNCED SYNCED SYNCED NOT SENT istiod-1-16-7-6d46d45875-gxtzw 1.16.7
Like that NOT SENT... that's where a regex can help here. It's not just tabs, it's a bunch of spaces.
Also, how would you parse Markdown tables (or similar)...? Unless the library trims all those extra spaces... but I guess with markdown there might be more complications that just a delimiter.
Good questions indeed. I think such tables should be parsed by readDelimStr in the future. For now i can only suggest something like this for Markdown.
fun String.markdownCells() = trim('|').split("|").map { it.trim() }
val s = """
| Month | Savings |
| -------- | ------- |
| January | $250 |
| February | $80 |
| March | $420 |""".trimIndent()
val lines = s.lineSequence()
lines.drop(2).toList().toDataFrame().split { value }.by { it.markdownCells() }.into(lines.first().markdownCells())
I think that's a bit of an advanced technique for most people with this kind of use case... and it involves parsing in two steps...
I wonder if some kind of readDSL would be better here... it could possibly work by line and give helpers for extracting the titles and values?
Please share desired API or example of usages that you have in mind. Maybe something like this could be added
I'm closing this fow now. We're working on a new CSV implementation based on Deephaven CSV https://github.com/Kotlin/dataframe/issues/827 since it's faster and lighter, however this also doesn't allow Regexes for delimiter characters unfortunately, just a Char.
It does have multiple other options like ignoreSurroundingSpaces, which can trim leading and trailing blanks around values and it can recognize quote characters. That might help :).
We plan to have an experimental version of it in 0.15. If that still does not work, I'd recommend modifying the string manually, potentially adding quote characters and then parsing it as delimStr.
Edit: well, apparently it seems to have some issues with delimiter = ' ', ignoreSurroundingSpaces = true. I'll make an issue over at deephaven XD https://github.com/deephaven/deephaven-csv/issues/212
Hi, since you mentioned you were developing your own CSV library, I thought I would comment here.
Whether you decide to use Deephaven's CSV library or develop your own, there are a variety of things we learned along the way that may benefit you. We used some clever ideas for high performance and also some cute tricks for automatic "type inference". I'd be happy to discuss in more detail in some appropriate forum if you would find that helpful. Best, Corey Kosak @ Deephaven
@kosak We're not developing our own CSV library. We're simply replacing our Apache commons CSV integration in DataFrame with Deephaven's :) exactly for the reasons you mentioned; performance, type inference, etc. Plus, while we currently don't store our data primitively, using Deephaven, that remains a viable option in the future.
https://github.com/deephaven/deephaven-csv/issues/212 is merged :)
We'll add it in https://github.com/Kotlin/dataframe/pull/903. Simply set hasFixedWidthColumns = true and the column widths are determined by the width of the headers + spaces.
You can also manually specify fixedColumnWidths if this goes wrong.