Parsers.jl icon indicating copy to clipboard operation
Parsers.jl copied to clipboard

Unclear documentation for `getstring`

Open KristofferC opened this issue 3 years ago • 6 comments

Here is a part of the docs for getstring

If the actual parsed String is needed, however, you can pass your source and the res.val::PosLen to Parsers.getstring to get the actual parsed String value.

But getstring takes 3 arguments, the last one which is called e. I don't see any description of e in the docstring and how it should be used. Looking at the tests, it seems that 0x00 is passed in.

KristofferC avatar Jun 14 '22 13:06 KristofferC

I think it is the "escapechar", so usually you'd pass opts::Options.e

here's how Options documents that: https://github.com/JuliaData/Parsers.jl/blob/b1da070ccfe7c93013a408c8509d0fd605754917/src/Parsers.jl#L46

As you say, we should improve the docs!

nickrobinson251 avatar Jun 14 '22 13:06 nickrobinson251

I think it is the "escapechar", so usually you'd pass opts::Options.e

Hm, I have no Options struct. I just parsed as:

julia> str = "\"foo\" # comment"
"\"foo\" # comment"

julia> res = Parsers.xparse(String, str)
Parsers.Result{PosLen}(-32603, 15, PosLen(0x0000000000200003))

julia> Parsers.getstring(str, res.val, 0x00) # what should go as third arg?
"foo"

KristofferC avatar Jun 14 '22 13:06 KristofferC

ah, i see. I'd strongly recommend explicitly creating an Options and passing it around if calling xparse multiple times, otherwise a new Options object gets constructed on each call.

Anyway, I believe for getstring to return the expected result (if an escapechar was parsed) you need to pass the same e (third arg) as was used by the xparse call. I suppose tests use 0x00 since any value is fine if you know no escapechar was parsed.

The default for xparse/Options is e=UInt8('"'), so when not specifying a different escapechar, i'd say pass UInt8('"') to getstring.

Arguably the e argument of getstring should have a default matching the xparse/Options default. But i don't think the 2-arg xparse function is really recommended, which is probably why no corresponding 2-arg getstring function exists.

Also, is "foo" the result you expected here? That xparse call is using quoted=true. If you wanted "\"foo\" # comment" you may want to use quoted=false

nickrobinson251 avatar Jun 14 '22 14:06 nickrobinson251

Also, is "foo" the result you expected here? That xparse call is using quoted=true. If you wanted ""foo" # comment" you may want to use quoted=false

Actually, that is not correct, the default is quoted=false. So this parsed until the end of line and returns an error code.

julia> Parsers.Options().quoted
false

julia> res = Parsers.xparse(String, str)
Parsers.Result{PosLen}(-32603, 15, PosLen(0x0000000000200003))

julia> Parsers.text(res.code)
"encountered an opening quote character, initial value parsing succeeded, reached EOF, invalid delimiter"

julia> res = Parsers.xparse(String, str, 1, sizeof(str), Parsers.Options(; quoted=true))
Parsers.Result{PosLen}(5, 6, PosLen(0x0000000000200003))

julia> Parsers.text(res.code)
"encountered an opening quote character, initial value parsing succeeded"

KristofferC avatar Jun 15 '22 13:06 KristofferC

yikes, we have too many different xparse methods that set different defaults.

julia> Parsers.xparse(String, str)  # == Parsers.xparse(String, str; quoted=true)
options.quoted = true
Parsers.Result{PosLen}(-32603, 15, PosLen(0x0000000000200003))

julia> Parsers.xparse(String, str, 1, sizeof(str))
options.quoted = true
Parsers.Result{PosLen}(-32603, 15, PosLen(0x0000000000200003))

julia> Parsers.xparse(String, str, 1, sizeof(str), Parsers.Options())
options.quoted = false
Parsers.Result{PosLen}(33, 15, PosLen(0x000000000010000f))

julia> Parsers.xparse(String, str, 1, sizeof(str), Parsers.Options(; quoted=true))
options.quoted = true
Parsers.Result{PosLen}(5, 6, PosLen(0x0000000000200003))
  1. The first hits this, which passes quoted::Bool=true: https://github.com/JuliaData/Parsers.jl/blob/e2259a6e10452cf0d5c541183d455295415d1dbe/src/Parsers.jl#L211-L212

  2. The second hits this, which uses Parsers.XOPTIONS https://github.com/JuliaData/Parsers.jl/blob/e2259a6e10452cf0d5c541183d455295415d1dbe/src/Parsers.jl#L217-L218

    • and XOPTIONS has quoted=true https://github.com/JuliaData/Parsers.jl/blob/e2259a6e10452cf0d5c541183d455295415d1dbe/src/Parsers.jl#L164
    • (aside: Why does XOPTIONS exist?)
  3. The third hits the same method as 2, but passing in Parsers.Options() which has quoted=false: https://github.com/JuliaData/Parsers.jl/blob/e2259a6e10452cf0d5c541183d455295415d1dbe/src/Parsers.jl#L157

  4. The fourth hits the same method as 2/3, but passes in Parsers.Options(; quoted=true) to set that explicitly ...but returns a different answer to 1 (xparse(String, str; quoted=true)), because Options() defaults to delim=nothing whereas 1 sets delim=UInt8(',') https://github.com/JuliaData/Parsers.jl/blob/e2259a6e10452cf0d5c541183d455295415d1dbe/src/Parsers.jl#L149

This is now very off-topic from your original issue (sorry), so can move it to a new issue, but i think perhaps we could simplify the xparse interface to make this whole thing a little less confusing / more explicit.

I think i'd be in favour of requiring a user-given ::Options argument.

cc @quinnj


EDIT: opened #124

nickrobinson251 avatar Jun 15 '22 16:06 nickrobinson251

0x00 is fine to use as the 3rd argument if you're not dealing with quoted/escaped characters at all. Basically any byte that shouldn't occur in your string. Parsers.getstring just checks if the PosLen is escaped and, if so, will do the unescaping for you based on that 3rd argument, so if you know the PosLen will never be escaped (i.e. you're not using quote or escape chars), then it's fine to just ignore all of that.

@nickrobinson251, yeah, the xparse methods have kind of grown into a bit of a spaghetti mess; they were each born from different use-cases/motivations, but I agree it's gotten a bit intractable.

quinnj avatar Jun 15 '22 16:06 quinnj