rust-csv
rust-csv copied to clipboard
Support multi-character delimiters
What version of the csv crate are you using?
1.1.3
Briefly describe the question, bug or feature request.
This was briefly discussed in #47, but I'd like to see support for delimiters of multiple characters. For use cases where CSV files may contain arbitrary strings, a single character delimiter is often not enough. In that case, having the ability to specify multiple delimiters is very useful.
This feature is not supported by csv in Python, but it is supported in Pandas (see https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html).
Include a complete program demonstrating a problem.
Example with git log:
$ git log --pretty=format:"%h|%s|%aN|%aE|%aD" -n 10 > test.csv
$ cat test.csv
331e56ec2|This is a redacted message.|Redacted Name|[email protected]|Wed, 7 Oct 2020 09:00:51 -0700
280c2120e|This is a redacted message.|Redacted Name|[email protected]|Tue, 6 Oct 2020 16:48:09 -0700
c0e58d9d6|This is a redacted message.|Redacted Name|[email protected]|Tue, 6 Oct 2020 16:42:50 -0700
>>> import pandas as pd
>>> pd.read_csv('test.csv', sep='\|')
331e56ec2 This is a redacted message. Redacted Name [email protected] Wed, 7 Oct 2020 09:00:51 -0700
0 280c2120e This is a redacted message. Redacted Name [email protected] Tue, 6 Oct 2020 16:48:09 -0700
1 c0e58d9d6 This is a redacted message. Redacted Name [email protected] Tue, 6 Oct 2020 16:42:50 -0700
Random example
$ cat test2.csv
a|||g
b|||h
c|||i
d|||j
e|||k
f|||l
>>> import pandas as pd
>>> pd.read_csv('test2.csv', sep='\|\|\|')
a g
0 b h
1 c i
2 d j
3 e k
4 f l
Let me know if this sounds reasonable! If so, I'd be happy to help implement this.
Your example doesn't really require multiple delimiter support though. I would rather see a real world example where this is necessary.
In truth, I'm not opposed to this on principle. I'm opposed to it because it is not as easy to implement as you might think. Quite the complication actually. Compare the performance of pandas with this library for example. And otherwise, virtually all csv libraries do not support this, so it doesn't seem like it's either particularly important or worth the implementation complexity.
I'm dealing with testing a vendor-specified format that is quoted and whitespace-delimited:
- Quoted and non-quoted entries are supported.
- Double-quote is the quoting character.
- Entries with whitespace must be quoted.
- Double-quote and backslash characters inside quoted entry must be escaped with a backslash.
- Escaping is not supported in non-quoted entries.
- Linefeed escape sequences (\n) are supported in quoted strings.
- Linefeed escape sequences are trimmed from the end of an entry.
csv works great for all of these requirements as long as I stick to a single space character delimiter. To be more flexible, I'd like to be able to specify the delimiter as one or more of the ASCII whitespace characters (\s+ regex, essentially). Would you consider incorporating this feature? I agree that this is not a CSV-like format, but csv is so good at parsing this kind of structured text data. Multiple delimiters would allow using all the other goodness on a wider range of formats with no(?) downsides, because it is an opt-in feature.
@overhacked I'm not clear on why your format requires multi-byte delimiters.
But no, it's unlikely that I'll ever add this. You claim there are no downsides, and that may be true from the API perspective, but there are downsides with respect to performance and code complexity.