cassava Implementation of space-delimited data

Encoding and decoding for all APIs together with tests and documentations. Previously dicussed at #14

Benchmarks don't indicate any performance regression for CSV.

Jan 02 '13 14:01 Shimuuar

I had a very brief look at the patch on my phone. Do we really need new top-level functions? Can't we just extend DecodeOptions?

Jan 02 '13 15:01 tibbe

I didn't give it much thought. But it look like a good idea. The only question is how to extend DecodeOptions. `decDelimiter' only make sense for CSV data.

data DecodeOptions = SpaceDelim | CSV Word8

Other option is to use record and ignore custom delimiter

data DecodeOptions = DO { decDelim :: Word8, spaceDelim :: Bool }

Jan 03 '13 11:01 Shimuuar

Maybe flag for skipping header should go to the DecodeOptions too

Jan 03 '13 11:01 Shimuuar

Could you describe the grammar and escaping rules for space delimited data? Can there be multiple spaces between columns? Can there be other space characters than ASCII spaces?

I think we should be able to support this by adding fields to DecodeOptions for recognizing the separators, escaping the right characters, etc. Python does it this way. I don't think we need new top-level functions, except a convenient spaceDelimDecodeOptions :: DecodeOptions constant.

Jan 04 '13 16:01 tibbe

AFAIK there is no standard for space-delimited data (or however it called). And if such standard does exist no one cares and just do whatever he like. So it's important to be permissive. Fields are separated by one or more space or tab or any mixture of them. Also leading and trailing spaces should be dropped. Following data have both leading spaces, multiple spaces as separator and trailing spaces

   1 a   1.22
 123 bcd 3.3
  88 c   0.4   ← invisible trailng space
1078 d   0.4

Escaping rules are complicated matter. I work mostly with numbers so I don't know waht escaping schemes are used. I assumed CSV escaping.

Here is grammar written as haskell-like pseudocode. Hope it's undestandable

row   = many ws *> field `sepBy` many1 ws <* many ws
field = csvEscapedField <|> many1 notWS
ws    = ' ' <|> '\t'

Jan 04 '13 17:01 Shimuuar

Would it be enough to change decDelimiter to a Parser ByteString? If we did that, you could parse space-delimited data using the current code by setting decDelimiter = many (' ' <|> '\t'). We just need to make sure this doesn't kill performance.

We would also need to add a decStripWhitespace option.

Check out the Python (http://docs.python.org/3/library/csv.html) and Go (http://golang.org/pkg/encoding/csv/) CSV modules for ideas. They already support customizations like these.

Jan 05 '13 20:01 tibbe

Maybe. There could be some subtle moments. I need to think about it

Jan 07 '13 10:01 Shimuuar

Here is grammar for both CSV and space-delimited data.

CSV grammar

   file        = [header CRLF] record *(CRLF record) [CRLF]
   header      = name  *(COMMA name)
   record      = field *(COMMA field)
   name        = field
   field       = escaped | non-escaped
   escaped     = DQUOTE *(TEXTDATA | COMMA | CR | LF | 2DQUOTE) DQUOTE
   non-escaped = *TEXTDATA

   COMMA    = ',' or 1-byte delimiter
   DQUOTE   = "
   CRLF     = CR LF | LF
   TEXTDATA = ^[ DQUOTE COMMA CR LF ]
   LF       = %x0A
   CR       = %x0D

Space delimited grammar

   file        = [header CRLF] record *(CRLF record) [CRLF]
   header      = *WS name  *(+WS name)  *WS
   record      = *WS field *(+WS field) *WS
   name        = field
   field       = escaped | non-escaped
   escaped     = DQUOTE *(TEXTDATA | COMMA | CR | LF | 2DQUOTE) DQUOTE
   non-escaped = +TEXTDATA

   DQUOTE   = "
   CRLF     = CR LF | LF
   TEXTDATA = ^[ DQUOTE WS CR LF ]
   LF       = %x0A
   CR       = %x0D
   WS       = %x20 | %x09

There are three differences: different separators, diffirent field parsers and space-delimited parser drops leading/trailing spaces. Changes in the field parser are nessesary. First it need to stop on both space and tab. And unescaped field must be at least one character. It's important otherwise we have ambigoius grammar. For example line "a " could be parsed as:

record 0WS (non-escaped "a") [] 1WS
record 0WS (non-escaped "a") [1WS (non-escaped "")] 0WS

I hope it's clear.

I've looked at both python and go libraries. It look like none could be used to parse data using grammar above. It look like only way to push grammar selection into option is to either enumerate grammars or by setting field and header parsers.

Jan 10 '13 11:01 Shimuuar

I thought about this a bit today. I didn't get much further than breaking down the changes into a diff:

-header      = name  *(COMMA name)
-record      = field *(COMMA field)
+header      = *WS name  *(+WS name)  *WS
+record      = *WS field *(+WS field) *WS
-non-escaped = *TEXTDATA
+non-escaped = +TEXTDATA
-TEXTDATA = ^[ DQUOTE COMMA CR LF ]
+TEXTDATA = ^[ DQUOTE WS CR LF ]

I made some initial attempts in adding more field to DecodeOptions, but it didn't work out.

Jan 15 '13 03:01 tibbe

I think that grammars are different enough that it's difficult to unify them using options. Best I can think up:

data DecOpt = CSV Word8 | SpaceDelim

But then we lose record update syntax. It is possible to put everything to record:

data DecOpt = DecOpt
  { isSpaceDelim :: Bool
  , csvSeparator :: Word8
  }

In this case fields have different meaning depending on values of other field. csvSeparator doesn't mean anything if isSpaceDelim is True.

There is also type class approach. It could be extended to handle any CSV like format. But is there any other?

class DecOpt a where
  -- Return parsers for header and ordinary record
  toDecOpt :: a -> (Parser,Parser)

data CsvOpt = CsvOpt Word8
instance DecOpt where
  toDecOpt (Csv d) = (header d, record d)


data SpaceDelim = SpaceDelim
instance DecOpt SpaceDelim where
  toDecOpt _ = (headerTable, recordTable)

Jan 16 '13 12:01 Shimuuar

Let me start with the constraints I'm working with:

The more general we make the library (e.g. provide more top-level combinators) the more difficult it becomes for users to grasp). For example, if we double the number of top-level decode and encode functions, there's more stuff there for the users to understand.
Certain kinds of additions lead to a doubling of the number of functions in the API. For example, supporting files with and without headers (decode and decodeByName) led to such a doubling. I decided to try to only add new top-level functions if the return value differs (as in the case of decode and decodeByName) and handle any format differences using the options records.
At some point we'll just be writing another generic parser library and we'll end up in the Turing tarpit, where everything can be done using cassava, but nothing can be done well and/or easily. I'm trying to to constrain the problem domain we're working in to prevent that from happening. We don't want to end up with decodeOptions :: CsvOpts ... | SpaceDelimOpts ... | XmlOpts .... ;)

At the extreme, the user could just provide a Parser (Vector (Vector ByteString)) and we could offer a decodeUsing :: FromRecord a => Parser (Vector (Vector ByteString)) -> ByteString -> Vector a. That would be the most general (although not the most efficient, as we always need to construct the vector of vector of byte strings in memory).

I'd like to avoid this if possible, especially if it's not needed. I went and looked at the output formats of Matlab, Octave, and Excel and they all use a single separator (space or tab), with the exception of the Excel Formatted Text output, which uses several spaces.

I went ahead and checked the current field parser and it actually uses this grammar:

TEXTDATA = ^[ DQUOTE <escape-char> CR LF ]

so it almost does what your grammer specifies. If we made decDelimiter a predicate function it should do exactly what your grammar does (i.e. disallow any kind of whitespace).

At first I thought we could change the parsing code for separators from word8 (decDelimiter opts) to many (word8 (decDelimiter opts)), but that doesn't work as this CSV data would parse correctly:

a,b,,c

Hence we'd have to change decDelimiter to be of type Parser () as I mentioned before.

Jan 16 '13 17:01 tibbe

Well I dn't like type-classes idea either. There isn't enough CSV-like formats to justify such generality.

I don't think that switching decDelimiter to Parser () is good idea. Parser for unescaped fields require predicate on delimiter character. Since we need predicate anyway we can just add flag to choose between one character and one or more characters as delimiter. In the same way it's possible tyo add flag for dropping initial/trailing whitespaces.

data DecOpt = DecOpt
  { decDelim         :: Word8 → Bool
  , oneOrMoreDelim   :: Bool
  , stripDelimOnEdge :: Bool
  }

Then default options are:

csvOpts = DecOpt (== ',') False False
spaceDelimOpts = DecOpt (\c -> == ' ' || == '\t') True True

Jan 16 '13 18:01 Shimuuar

I almost forgot about this

So does following design for decode/encode options seems reasonable to you? It does manage to unify CSV and space delimited data .

data DecOpt = DecOpt
  { decDelim         :: Word8 → Bool 
    -- Predicate for the separator character
  , oneOrMoreDelim   :: Bool
    -- Whether consecutive separator characters should be treated as single separator
  , stripDelimOnEdge :: Bool
    -- Thether separators on start/end of line should be dropped
  }

data EncOpt = EncOpt
  { encDelim :: Word8
  , extraCharsToEscape :: Word8 -> Bool
    -- Any other characters which sould be escaped.
  }

Feb 21 '13 18:02 Shimuuar

Have you tried to implement it to make sure it actually works? :)

Feb 22 '13 19:02 tibbe

Not yet. Being naturally lazy I like to avoid obviosly wrong ideas as early as possible. But it's simple enumeration of differences in grammar so there shouldn't be any problems. Most of the pieces are already implemented and need only reshuffling.

Feb 22 '13 20:02 Shimuuar

I'll look into it.

Feb 22 '13 22:02 tibbe

I'm working on this on the https://github.com/tibbe/cassava/tree/space-delim branch.

Feb 23 '13 00:02 tibbe

I've implemented correct trimming of spaces. It turned out to be tricky and I had to rewrite record parser.

Correctly leading/trailing spaces

It's hard to strip whitespaces correctly because
 a) It's valid part of field for CSV so
    "a,b,c " -> ["a","b","c "]
 b) If we're using spaces as delimiter we get spurious empty field
    at the end fo the line
    "a b c " -> ["a","b","c",""]

Only reliable way to strip them is to read whole line, strip spaces
and parse stripped line.

I'm working on space2 branch in my repo.

Feb 25 '13 17:02 Shimuuar

Implementation is mostly complete. There are no tests for encoding yet and few performance regressions too.

Feb 28 '13 16:02 Shimuuar

Only reliable way to strip them is to read whole line, strip spaces and parse stripped line.

I tried as well and came to the same conclusion.

Mar 07 '13 17:03 tibbe

cassava cassava copied to clipboard

Implementation of space-delimited data

cassava
cassava copied to clipboard