scalpel icon indicating copy to clipboard operation
scalpel copied to clipboard

Highly unintuitive behavior of `inSerial . stepNext . inSerial . stepNext`

Open bakhtiyarneyman opened this issue 3 years ago • 2 comments

Define lists as follows:

ghci> lists = inSerial . many . stepNext

When used not from under chroot, lists is idempotent.

ghci> scrapeStringLike "<foo><baz></baz></foo><bar></bar>" $ lists $ html anySelector
Just ["<foo><baz></baz></foo>","<bar></bar>"]
ghci> scrapeStringLike "<foo><baz></baz></foo><bar></bar>" $ lists $ lists $ html anySelector
Just [["<foo><baz></baz></foo>"],["<bar></bar>"]]
ghci> scrapeStringLike "<foo><baz></baz></foo><bar></bar>" $ lists $ lists $ lists $ html anySelector
Just [[["<foo><baz></baz></foo>"]],[["<bar></bar>"]]]

Not so much when used from under chroot:

ghci> scrapeStringLike "<outer><foo><baz></baz></foo><bar></bar></outer>" $ chroot "outer" $ lists $ html anySelector
Just ["<foo><baz></baz></foo>","<bar></bar>"]
ghci> scrapeStringLike "<outer><foo><baz></baz></foo><bar></bar></outer>" $ chroot "outer" $ lists $ lists $ html anySelector
Just [["<baz></baz>"],[]]
ghci> scrapeStringLike "<outer><foo><baz></baz></foo><bar></bar></outer>" $ chroot "outer" $ lists $ lists $ lists $ html anySelector
Just [[[]],[]]

This inconsistency makes it much harder to reason locally about the written code. IMHO for a good user experience, one of the two behaviors should be selected, and I think the former is better because it's less opinionated. (The second behavior could be obtained by interspersing chroots between lists calls).

I think, this can be fixed by setting ctxInChroot to False in toZipper (defined as a where binding for inSerial.)

bakhtiyarneyman avatar Oct 29 '22 23:10 bakhtiyarneyman

FWIW I don't particularly like the fact that inSerial changes the behavior when under a chroot. I find it unintuitive and it couples the combinators together. I'd prefer to have an explicit combinator that allows to drop inside the children of a node, rather than doing it implicitly in an ad-hoc way depending on whether we are under chroot or not.

Maybe something like:

children :: ScraperT str m a -> ScraperT str m a
root :: Selector -> ScraperT str m a -> ScraperT str m a
chroot = children . root 

This change would of course alter the behavior when getting attributes of the chrooted node.

bakhtiyarneyman avatar Oct 29 '22 23:10 bakhtiyarneyman

This does seem unintuitive :/

I think the reason for the discrepancy is that inSerial conceptually wraps the HTML in an outer set of tags when not in a chroot so that for HTML like <a>1</a><b>2</b>, inSerial would visit the <a> and <b> sub-trees.

It does seem broken that stacking multiple calls to inSerial wouldn't result in descending further into the tree.

fimad avatar Oct 30 '22 16:10 fimad