cassava
cassava copied to clipboard
Field parsers do no check for end of input
In numeric fields it could lead to silent truncation:
>>> import Data.Csv
>>> import Data.Vector (Vector)
>>> decode NoHeader "12.7" :: Either String (Vector (Only Int))
Right (fromList [Only {fromOnly = 12}])
Or worse:
>>> decode NoHeader "1e6" :: Either String (Vector (Only Int))
Right (fromList [Only {fromOnly = 1}])
Also it would be good idea to allow trailing (and probably leading) whitespaces since those are frequently added for readability in hand-written CSV
Good catch. This is a bit odd since decode
eventually ends up calling the parser called csv
, defined in Encoding.hs
:
csv :: FromRecord a => DecodeOptions -> AL.Parser (V.Vector a)
csv !opts = do
vals <- records
_ <- optional endOfLine
endOfInput
return $! V.fromList vals
where
records = do
!r <- record (decDelimiter opts)
if blankLine r
then (endOfLine *> records) <|> pure []
else case runParser (parseRecord r) of
Left msg -> fail $ "conversion error: " ++ msg
Right val -> do
!vals <- (endOfLine *> records) <|> pure []
return (val : vals)
Note the call to endOfInput
. Perhaps that attoparsec endOfInput
combinator is broken. Alternatively perhaps the decimal
combinator in attoparsec consumes the whole literal, even if it ignores part of it. I will have to investigate.
No decimal doesn't do anything like that
Prelude Data.Attoparsec Data.Attoparsec.ByteString.Char8> parse decimal "12.7"
Done ".7" 12
From what I can tell, all of the field parsing functions in Conversion.hs
are using the attoparsec parseOnly
function which silently ignores unconsumed input. The FromField
parsers should be changed to use a full attoparsec parse
and fail if the result isn't Done
with an empty leftover string.
The documentation for the attoparsec parseOnly
function neglects to mention that it succeeds even if there's left over input. But it's pretty clear from the source code:
parseOnly :: Parser a -> B.ByteString -> Either String a
parseOnly m s = case T.runParser m (I s) mempty Complete failK successK of
Fail _ _ err -> Left err
Done _ a -> Right a
_ -> error "parseOnly: impossible error!"
Great detective work. We should indeed switch to using parse
then. I've filed a bug with attoparsec to get the docs fixed.