infra icon indicating copy to clipboard operation
infra copied to clipboard

String splitting algorithms could use optional "nesting characters"

Open tabatkins opened this issue 8 years ago • 7 comments

When splitting strings, it's reasonably common to only want to split on "top-level" instances of the split chars, and have "nesting" characters, like parens, within which you don't look for the splitting characters. For example, splitting a string on commas, but the string can contain functions with comma-separated arguments.

Most of the strings I work with get parsed by CSS, which has a "split by top-level comma" algo already, so I don't have a concrete use for this in Infra just yet, but I use that algorithm commonly enough that I'd bet other people would benefit from having something like it available, at least for parens.

You'd need a list of start/end string pairs, and keep a stack of start strings seen that gets popped when the topmost end string is seen, and only trigger splitting when the stack is empty.

tabatkins avatar Mar 27 '17 15:03 tabatkins

It seems your comment got cut off. Is this used outside CSS parsers? Because inside CSS, you need to handle all kinds of other CSS rarities too such as escapes and then you might as well invoke the CSS parser to be sure.

annevk avatar Mar 27 '17 15:03 annevk

Comment was cut off and I finished editting immediately, but apparently not before you checked the thread. Check again. ^_^

tabatkins avatar Mar 28 '17 00:03 tabatkins

srcset parser has something like this, but without a stack (it uses a dumb state machine).

zcorpan avatar Mar 28 '17 16:03 zcorpan

Looks like srcset only looks for parens to account for possible future CSS functions? Right now the only valid descriptors are 1w, 1x, and 1h.

If that's the case, then the algo is broken - it'll misparse at times once that starts being allowed. It needs to track nesting, and you're making my point for me. ^_^

tabatkins avatar Mar 28 '17 21:03 tabatkins

If srcset needs to be compatible with CSS it also needs to handle escapes and should just be defined by the CSS parser, I think.

annevk avatar Mar 29 '17 06:03 annevk

It's not for CSS but for future descriptors like integrity(). The algorithm is intentionally "simple" and CSS compat is not a goal.

zcorpan avatar Mar 29 '17 17:03 zcorpan

I think I'll need this at least for Content-Type and possibly other HTTP headers, but I'm not sure yet whether I want those to operate on strings or bytes.

annevk avatar Apr 19 '18 09:04 annevk