parjs
parjs copied to clipboard
feature: Case insensitive string parsing
I encountered a use case where I would need a parser that could consume a string, but insensitive to the case of the letters it contains.
For example, I would like a parser that matches "foo"
, "Foo"
, "fOo"
, etc.
import { string } from "parjs";
const p = string("foo");
p.parse("foo");
// ParjsSuccess { value: 'foo', kind: 'OK' }
p.parse("Foo");
// ParjsFailure {
// trace: {
// userState: ParserUserState {},
// position: 0,
// reason: "expecting 'foo'",
// input: 'Foo',
// location: [Getter],
// stackTrace: [Array],
// kind: 'Soft'
// }
// }
Related:
- https://hackage.haskell.org/package/hsemail-1.3/docs/Text-ParserCombinators-Parsec-Rfc2234.html#1 has
caseString
- https://github.com/mrkkrp/megaparsec/blob/6fc90d777f5601f2fdadf2e0dc7259e2974fef23/Text/Megaparsec/Common.hs#L32C1-L42 has
string'
I think this is a good idea 👍🏻 I will try to find some time to implement it.
If you would like to give it a shot, let me know. It could be a nice task.
Looking at the reference implementation in the first link you provided:
-- |Case-insensitive variant of Parsec's 'char' function.
caseChar :: Char -> CharParser st Char
caseChar c = satisfy (\x -> toUpper x == toUpper c)
-- |Case-insensitive variant of Parsec's 'string' function.
caseString :: String -> CharParser st ()
caseString cs = mapM_ caseChar cs <?> cs
It looks like the parser checks each char.toUpperCase()
matches, one by one (probably? not sure but I think haskell usually evaluates lazily), or fails with the error message for the given string.
We have uniUpper
and upper
parsers, it would probably make sense to have a case insensitive version of each.
I tried to use different existing parsers and combinators, but never got something that worked to my liking. I followed a similar approach to the one you described, but the closest thing I got had funny results when used with the or
combinator (from memory as I don't have access to the laptop I tried on right now):
const p1 = caseString("foo");
const p2 = caseString("faa");
const p = p1.pipe(or(p2));
p.parse("FOO"); # worked
p2.parse("FAA"); # worked
p.parse("FAA"); # failed with "unexpected `a`"
I switched to implement it as a parser in this repo, as a close copy of the current string
implementation. Currently going through the process with my employer to be able to contribute it here...
Totally agree, this is required.
We do have an anyChar
parser that matches any character. It doesn't need a unicode version. uniUpper
is needed because of the uppercase restriction. Checking if a unicode character is upper case requires a lookup in a data structure (interval tree).
A case-insensitive version of string
would need to be written custom as @Pamplemousse said because after applying a parser once, most combinators go into hard fail mode and if the next application fails unexpectedly then it will be a hard failure.
I would love for you to contribute it :smile:.
A more general approach
While I think the case-instensitive string parser is needed, I also have another idea. Something interesting that hasn't been done before.
Suppose we require all parsers to implement an operator called caseInsensitive
which returns a case insensitive version of themselves. Combinators would implement it by delegating to the child parsers, while building block parsers would return a different parser.
For example, uniUpper.caseInsensitive
would be anyChar
. string
would give the case-insensitive version @Pamplemousse talked about.
The benefit is that you don't have to worry about case sensitivity when building the parser. You could build it however you want and then apply caseInsensitive
and get a case insensitive version.
For some combinators this might not work as expected. For example, there is no way to convert must
to a case-insensitive version. However, that's a parser that applies a predicate on the output, so it is different.
The question
The question is whether case insensitivity is general enough to warrant people who write parsers to implement the operation.