Earley
Earley copied to clipboard
WIP Disambiguation production
Definitely WIP at the moment,
The thrust of it is that the continuations are explicitly [a] -> [b] now, which for most productions is just a map, however for disambiguation, this structure is exposed.
Currently only nonterminals work, as, for example, the Pure case runs the continuation once for every individual element. What needs to change would be to maintain a list of a, and defer applying args. I had a quick look at jiggling things around towards that end without luck.
If this would be desirable even for just nonterminals, then I'd be happy to neaten things up from the state it's in at the moment.
Ok, I neatened it up a little and limited disambiguation to nonterminals.
Motivating example parsing an expression without precedence
-- λ> :main "A + B * C * G"
-- Ambiguous
-- ├╴ ((A+B)*(C*G))
-- ├╴ *
-- │ ├╴ Ambiguous
-- │ │ ├╴ ((A+B)*C)
-- │ │ └╴ (A+(B*C))
-- │ └╴ G
-- └╴ +
-- ├╴ A
-- └╴ Ambiguous
-- ├╴ (B*(C*G))
-- └╴ ((B*C)*G)
I'm sure there's still some neatening + testing to do!
As a further simplification to the implementation, I wonder if we could just make NonTerminal :: !(r e t a) -> !(Prod r e t ([a] -> [b])) -> Prod r e t b...
Ah, instead of making disambiguate operate in Grammar we can make the nonterminal 'live' during parse with something like this:
Disamb p d -> do
r <- mkRule p
scont' <- newConts =<< newSTRef [Cont C.id d args scont]
parse (State (NonTerminal r (pure id)) C.id pos scont' : ss) env
I don't think this should be a performance disadvantage compared to running in Grammar, however it feels bad putting pure id as the right prod for NonTerminal (I guess this would be solved by the previous comment). This also means that we end up calling the disamb function with an empty list some of the time which is also a little odd.
This is a neat idea. I had a little think about the API and implementation:
If we had
expose :: Prod r e t a -> Prod r e t [a]
and
conceal :: Prod r e t [a] -> Prod r e t a
(but perhaps with better chosen names), then it seems like we should be able to implement
disambiguate :: ([a] -> [b]) -> Prod r e t a -> Prod r e t b
disambiguate f = conceal . fmap f . expose
without it needing to be in grammar. This also seems to solve https://github.com/ollef/Earley/issues/62, because we can implement
constraint :: (a -> Bool) -> Prod r e t a -> Prod r e t a
constraint = disambiguate . filter
This is possible if we tweak the Prod type e.g. as follows (along the lines of one of your comments):
data Prod r e t a where
-- Applicative.
Terminal :: !(t -> Maybe a) -> !(Prod r e t (a -> [b])) -> Prod r e t b
NonTerminal :: !(r e t a) -> !(Prod r e t (a -> [b])) -> Prod r e t b
Pure :: [a] -> Prod r e t a
-- Monoid/Alternative. We have to special-case 'many' (though it can be done
-- with rules) to be able to satisfy the Alternative interface.
Alts :: ![Prod r e t a] -> !(Prod r e t (a -> [b])) -> Prod r e t b
Many :: !(Prod r e t a) -> !(Prod r e t ([a] -> [b])) -> Prod r e t b
-- Error reporting.
Named :: !(Prod r e t a) -> e -> Prod r e t a
because then we get
expose :: Prod r e t a -> Prod r e t [a]
expose (Terminal b p) = Terminal b (fmap (fmap pure) p)
expose (NonTerminal r p) = NonTerminal r (fmap (fmap pure) p)
expose (Pure x) = Pure [x]
expose (Alts as p) = Alts as (fmap (fmap pure) p)
expose (Many p q) = Many p (fmap (fmap pure) q)
expose (Named p n) = Named (expose p) n
conceal :: Prod r e t [a] -> Prod r e t a
conceal (Terminal b p) = Terminal b $ fmap (fmap concat) p
conceal (NonTerminal r p) = NonTerminal r $ fmap (fmap concat) p
conceal (Pure as) = Pure $ concat as
conceal (Alts as p) = Alts as $ fmap (fmap concat) p
conceal (Many p q) = Many p $ fmap (fmap concat) q
conceal (Named p n) = Named (conceal p) n
but I have not tried updating the other modules (parser etc).
Maybe it breaks down somewhere.
I like expose and conceal a lot! Much nicer than smashing it all into [a] -> [b].
It makes me worry a little bit from a performance point of view to be passing lists around everywhere in Prod and I feel like there's basically no chance of any fusion taking place in parse.
Just as an aside, @ollef, did you ever experiment with using Data.Array.ST for anywhere lists are used in parse? In fact, I wonder if this would be just slightly better done with Seq (or even Tree) given all the concatting... Must benchmark after
mmm, I think that keeping this to nonterminals might be the pragmatic thing to do, with your scheme above it doesn't seem possible to push Pure fs down into the leaves of Alts for example.
Attaching to nonterminals only seems important wrt parse as written, as they're the place where new continuations are made which act as "join" points in the continuation chain.
I haven't thought it through, but I wonder if it would be possible to push disambiguations all the way down to Alts, and have it be Alts :: ![Prod r e t a] -> !(Prod r e t ([a] -> b)) -> Prod r e t b
hmm, I've got a testcase where:
This works, and d is called with a list of length 2:
xx <-
disambiguate d $
(Foo <$ anyToken)
<|> (Bar <$ anyToken)
This doesn't work, d is called twice with a list of length 1 and the whole parse returns 2 results:
xx <-
disambiguate d $
(Foo <$ many anyToken)
<|> (Bar <$ anyToken)
Sadly I've long since flushed the context needed to easily understand what's going on :(
...
Oh, I still see several todos in the code, I guess I'm just hitting one of these cases past-me foresaw.
...
On a hunch I removed the first case in simplifyCont and this seems to sort things for at least this example, don't quite understand why though yet.