cassava
cassava copied to clipboard
Implementation of space-delimited data
Encoding and decoding for all APIs together with tests and documentations. Previously dicussed at #14
Benchmarks don't indicate any performance regression for CSV.
I had a very brief look at the patch on my phone. Do we really need new top-level functions? Can't we just extend DecodeOptions?
I didn't give it much thought. But it look like a good idea. The only question is how to extend DecodeOptions. `decDelimiter' only make sense for CSV data.
data DecodeOptions = SpaceDelim | CSV Word8
Other option is to use record and ignore custom delimiter
data DecodeOptions = DO { decDelim :: Word8, spaceDelim :: Bool }
Maybe flag for skipping header should go to the DecodeOptions too
Could you describe the grammar and escaping rules for space delimited data? Can there be multiple spaces between columns? Can there be other space characters than ASCII spaces?
I think we should be able to support this by adding fields to DecodeOptions
for recognizing the separators, escaping the right characters, etc. Python does it this way. I don't think we need new top-level functions, except a convenient spaceDelimDecodeOptions :: DecodeOptions
constant.
AFAIK there is no standard for space-delimited data (or however it called). And if such standard does exist no one cares and just do whatever he like. So it's important to be permissive. Fields are separated by one or more space or tab or any mixture of them. Also leading and trailing spaces should be dropped. Following data have both leading spaces, multiple spaces as separator and trailing spaces
1 a 1.22
123 bcd 3.3
88 c 0.4 ← invisible trailng space
1078 d 0.4
Escaping rules are complicated matter. I work mostly with numbers so I don't know waht escaping schemes are used. I assumed CSV escaping.
Here is grammar written as haskell-like pseudocode. Hope it's undestandable
row = many ws *> field `sepBy` many1 ws <* many ws
field = csvEscapedField <|> many1 notWS
ws = ' ' <|> '\t'
Would it be enough to change decDelimiter
to a Parser ByteString
? If we did that, you could parse space-delimited data using the current code by setting decDelimiter = many (' ' <|> '\t')
. We just need to make sure this doesn't kill performance.
We would also need to add a decStripWhitespace
option.
Check out the Python (http://docs.python.org/3/library/csv.html) and Go (http://golang.org/pkg/encoding/csv/) CSV modules for ideas. They already support customizations like these.
Maybe. There could be some subtle moments. I need to think about it
Here is grammar for both CSV and space-delimited data.
CSV grammar
file = [header CRLF] record *(CRLF record) [CRLF]
header = name *(COMMA name)
record = field *(COMMA field)
name = field
field = escaped | non-escaped
escaped = DQUOTE *(TEXTDATA | COMMA | CR | LF | 2DQUOTE) DQUOTE
non-escaped = *TEXTDATA
COMMA = ',' or 1-byte delimiter
DQUOTE = "
CRLF = CR LF | LF
TEXTDATA = ^[ DQUOTE COMMA CR LF ]
LF = %x0A
CR = %x0D
Space delimited grammar
file = [header CRLF] record *(CRLF record) [CRLF]
header = *WS name *(+WS name) *WS
record = *WS field *(+WS field) *WS
name = field
field = escaped | non-escaped
escaped = DQUOTE *(TEXTDATA | COMMA | CR | LF | 2DQUOTE) DQUOTE
non-escaped = +TEXTDATA
DQUOTE = "
CRLF = CR LF | LF
TEXTDATA = ^[ DQUOTE WS CR LF ]
LF = %x0A
CR = %x0D
WS = %x20 | %x09
There are three differences: different separators, diffirent field parsers and space-delimited parser drops leading/trailing spaces. Changes in the field parser are nessesary. First it need to stop on both space and tab. And unescaped field must be at least one character. It's important otherwise we have ambigoius grammar. For example line "a " could be parsed as:
record 0WS (non-escaped "a") [] 1WS
record 0WS (non-escaped "a") [1WS (non-escaped "")] 0WS
I hope it's clear.
I've looked at both python and go libraries. It look like none could be used to parse data using grammar above. It look like only way to push grammar selection into option is to either enumerate grammars or by setting field and header parsers.
I thought about this a bit today. I didn't get much further than breaking down the changes into a diff:
-header = name *(COMMA name)
-record = field *(COMMA field)
+header = *WS name *(+WS name) *WS
+record = *WS field *(+WS field) *WS
-non-escaped = *TEXTDATA
+non-escaped = +TEXTDATA
-TEXTDATA = ^[ DQUOTE COMMA CR LF ]
+TEXTDATA = ^[ DQUOTE WS CR LF ]
I made some initial attempts in adding more field to DecodeOptions
, but it didn't work out.
I think that grammars are different enough that it's difficult to unify them using options. Best I can think up:
data DecOpt = CSV Word8 | SpaceDelim
But then we lose record update syntax. It is possible to put everything to record:
data DecOpt = DecOpt
{ isSpaceDelim :: Bool
, csvSeparator :: Word8
}
In this case fields have different meaning depending on values of other field.
csvSeparator
doesn't mean anything if isSpaceDelim
is True
.
There is also type class approach. It could be extended to handle any CSV like format. But is there any other?
class DecOpt a where
-- Return parsers for header and ordinary record
toDecOpt :: a -> (Parser,Parser)
data CsvOpt = CsvOpt Word8
instance DecOpt where
toDecOpt (Csv d) = (header d, record d)
data SpaceDelim = SpaceDelim
instance DecOpt SpaceDelim where
toDecOpt _ = (headerTable, recordTable)
Let me start with the constraints I'm working with:
- The more general we make the library (e.g. provide more top-level combinators) the more difficult it becomes for users to grasp). For example, if we double the number of top-level decode and encode functions, there's more stuff there for the users to understand.
- Certain kinds of additions lead to a doubling of the number of functions in the API. For example, supporting files with and without headers (
decode
anddecodeByName
) led to such a doubling. I decided to try to only add new top-level functions if the return value differs (as in the case ofdecode
anddecodeByName
) and handle any format differences using the options records. - At some point we'll just be writing another generic parser library and we'll end up in the Turing tarpit, where everything can be done using cassava, but nothing can be done well and/or easily. I'm trying to to constrain the problem domain we're working in to prevent that from happening. We don't want to end up with
decodeOptions :: CsvOpts ... | SpaceDelimOpts ... | XmlOpts ...
. ;)
At the extreme, the user could just provide a Parser (Vector (Vector ByteString))
and we could offer a decodeUsing :: FromRecord a => Parser (Vector (Vector ByteString)) -> ByteString -> Vector a
. That would be the most general (although not the most efficient, as we always need to construct the vector of vector of byte strings in memory).
I'd like to avoid this if possible, especially if it's not needed. I went and looked at the output formats of Matlab, Octave, and Excel and they all use a single separator (space or tab), with the exception of the Excel Formatted Text output, which uses several spaces.
I went ahead and checked the current field parser and it actually uses this grammar:
TEXTDATA = ^[ DQUOTE <escape-char> CR LF ]
so it almost does what your grammer specifies. If we made decDelimiter
a predicate function it should do exactly what your grammar does (i.e. disallow any kind of whitespace).
At first I thought we could change the parsing code for separators from word8 (decDelimiter opts)
to many (word8 (decDelimiter opts))
, but that doesn't work as this CSV data would parse correctly:
a,b,,c
Hence we'd have to change decDelimiter
to be of type Parser ()
as I mentioned before.
Well I dn't like type-classes idea either. There isn't enough CSV-like formats to justify such generality.
I don't think that switching decDelimiter to Parser ()
is good idea. Parser for unescaped fields require predicate on delimiter character. Since we need predicate anyway we can just add flag to choose between one character and one or more characters as delimiter. In the same way it's possible tyo add flag for dropping initial/trailing whitespaces.
data DecOpt = DecOpt
{ decDelim :: Word8 → Bool
, oneOrMoreDelim :: Bool
, stripDelimOnEdge :: Bool
}
Then default options are:
csvOpts = DecOpt (== ',') False False
spaceDelimOpts = DecOpt (\c -> == ' ' || == '\t') True True
I almost forgot about this
So does following design for decode/encode options seems reasonable to you? It does manage to unify CSV and space delimited data .
data DecOpt = DecOpt
{ decDelim :: Word8 → Bool
-- Predicate for the separator character
, oneOrMoreDelim :: Bool
-- Whether consecutive separator characters should be treated as single separator
, stripDelimOnEdge :: Bool
-- Thether separators on start/end of line should be dropped
}
data EncOpt = EncOpt
{ encDelim :: Word8
, extraCharsToEscape :: Word8 -> Bool
-- Any other characters which sould be escaped.
}
Have you tried to implement it to make sure it actually works? :)
Not yet. Being naturally lazy I like to avoid obviosly wrong ideas as early as possible. But it's simple enumeration of differences in grammar so there shouldn't be any problems. Most of the pieces are already implemented and need only reshuffling.
I'll look into it.
I'm working on this on the https://github.com/tibbe/cassava/tree/space-delim branch.
I've implemented correct trimming of spaces. It turned out to be tricky and I had to rewrite record
parser.
Correctly leading/trailing spaces
It's hard to strip whitespaces correctly because
a) It's valid part of field for CSV so
"a,b,c " -> ["a","b","c "]
b) If we're using spaces as delimiter we get spurious empty field
at the end fo the line
"a b c " -> ["a","b","c",""]
Only reliable way to strip them is to read whole line, strip spaces
and parse stripped line.
I'm working on space2
branch in my repo.
Implementation is mostly complete. There are no tests for encoding yet and few performance regressions too.
Only reliable way to strip them is to read whole line, strip spaces and parse stripped line.
I tried as well and came to the same conclusion.