serpent custom / __tostring printer results incorrectly put into quotes

I'm evaluating the feasibility of using Serpent as Metalua's pretty-printer. I'm stuck with the fact that the result of __tostring metamethods are put into quotes. There's a similar issue with custom printers. For instance:

token_mt = { }
function token_mt :__tostring()
    return string.format("token(%s, %s)", self.tag, self.value)
end

function token(tag, value)
    return setmetatable({tag=tag; value=value}, token_mt)
end

stream = { token("Keyword", "local"), token("Id", "x") }

s = require 'serpent'
print(s.block(stream))
-- incorrectly produces {"token(Keyword, local)", "token(Id, x)"} which can't be deserialized


pp = require'metalua.pprint'
pp.print(stream)
-- produces { token(Keyword, local), token(Id, x) }, readily accepted by loadstring().

Dec 30 '13 15:12 fab13n

I'd say that __tostring results should be put into quotes — they are, indeed, strings.

Maybe __serialize would help? (Didn't look into it closely.)

Dec 30 '13 15:12 agladysh

@agladysh Adding this produces the same result:

token_mt.__serialize=token_mt.__tostring
print(s.serialize(stream,{ }))

Besides, it could be debatable whether this is a bug for __tostring, but the same issue exists with custom printers.

Dec 30 '13 15:12 fab13n

@fab13n, I agree, it may make sense to handle __serialize differently. I think for consistency, __tostring should still be quoted.

Is it be sufficient for your purposes if __serialize outputs its result as is?

Dec 30 '13 17:12 pkulchenko

I think tostring results shouldn't be quoted by Serpent itself. Otherwise, we can't tell the difference between a generated output and the corresponding literal string. In the example above I couldn't tell the difference between the outputs corresponding to the two following lines:

stream1 = { token("Keyword", "local"), token("Id", "x") }
stream2 = { [[token("Keyword", "local")]], [[token("Id", "x")]] }
assert(s.line(stream1)==s.line(stream2))

Sure, the version with mandatory quotes is valid Lua, but a valid Lua that's semantically different from the original one. It's arguably more treacherous, rather than safer. If users want to add quotes, they can add them in their __tostring() implementation. But the point of a pretty-printer is, well, to print something pretty, or at least readable. Compilability is mandatory for serialization, not for display.

As for using serialization instead of pretty-printing when one intends to pretty-print in a prettier way, I find this rather confusing :-) Besides, I don't even know whether I can use readability-improving options, such as indentation, with s.serialize().

Dec 30 '13 17:12 fab13n

Yes, I do see this point and has struggled with the same considerations in the beginning. The main goal was to make the output consistent between serialization and pretty-printing as it's convenient in many situations take the "printed" output and paste it into a script. For example, when you use a console in the IDE to execute commands, is the result serialization or pretty-printing? It's probably mostly the latter, but I've seen on several occasions people taking the output from the console and pasting it into their code.

stream1 = { token("Keyword", "local"), token("Id", "x") }
stream2 = { [[token("Keyword", "local")]], [[token("Id", "x")]] }
assert(s.line(stream1)==s.line(stream2))

Right, but these are different values and I wanted them to look different. Similar to that, there is ambiguity with nil vs "nil", true vs. "true" and so on (which doesn't exist with the current output).

I can add a parameter to disable quoting of string values and would prefer keep the current option as the default.

Dec 30 '13 18:12 pkulchenko

On Mon, Dec 30, 2013 at 7:01 PM, Paul Kulchenko [email protected]:

assert(s.line(stream1)==s.line(stream2))

Right, but these are different values and I wanted them to look different.

My point is, they are identical with the current implementation, and I want them to look different!

Dec 30 '13 18:12 fab13n

My point is, they are identical with the current implementation, and I want them to look different!

I understand, but they only look the same because your token() call returns its own representation.

There is still a difference between:

function token_mt :__tostring() return "true" end

-- and

function token_mt :__tostring() return true end

I pushed a change (in rawstring branch) that adds rawstring parameter:

print(s.block(stream, {rawstring = true}))

-- generates

{
  token(Keyword, local),
  token(Id, x)
} --[[table: 00C8C248]]

You can make it a default for all methods, so that your users don't need to specify it.

Will this work?

Dec 30 '13 19:12 pkulchenko

I'm not sure if this is a correct approach. I think this should be configurable on per-object level. It makes little sense to me to configure a thing like this once per serialization call.

Dec 30 '13 19:12 agladysh

I thought so as well; for example, adding something like __raw attribute to the metatable would signal that the result string should not be quoted (and we'd not need the serialization parameter), but I'd prefer not to clash with (possible) Lua values, like __mode. Maybe _raw or _rawstring?

Dec 30 '13 20:12 pkulchenko

For example, when you use a console in the IDE to execute commands, is the result serialization or pretty-printing?

There is yet another use-case for pretty-printing while serializing — human-readable config files. This is a primary reason why in Lua Núcleo we have tpretty as well as tserialize (along with debug visualization of complex tables, of course).

So, we have at least three reasons to convert a Lua object to string:

Machine-readable serialization (can load data back, don't care about readability)
Human-readable serialization (can load data back, do care about readability)
Human-readable visualization (don't care to load data back, do care about readability)

Human-readable serialization and visualization should be highly configurable to conform to usage modes (single-line logging, multi-line config files) and to coding guidelines (indentation rules, line width etc.)

Dec 30 '13 20:12 agladysh

Maybe I'm missing something (didn't study serpent design too close yet, sorry), but I'm not sure why __tostring and __serialize are supported by default at all.

I'd start with __serpent metamethod, which will have (read-only) access to the configuration and will allow object author emit custom serialized data — to be pasted literally to the resulting string.

Then I'd consider enabling __tostring support for human-readable visualization by default — but only there.

To me __serialize metamethod name looks too generic — there is no well-established protocol for it, as far as I'm aware. But maybe it will make sense to support it as a generic serialization fallback for serialization (and only there), accepting no parameters and returning a string to, again, be pasted literally to the output.

Dec 30 '13 20:12 agladysh

Furthermore, all this metamethod handling is a sugar that will slow down the basic implementation. Why not leave it to user to sort out in a custom serialization callback function? (Well, OK, and maybe provide some useful common implementations in an auxiliary module...)

Dec 30 '13 20:12 agladysh

Maybe _raw or _rawstring?

Please don't use such generic names to control module-specific behaviour.

If you really want a flag, use __serpent_raw or something.

Dec 30 '13 20:12 agladysh

BTW, while we're considering __serpent (or whatever) custom API, there is a common usage pattern — at least in our code.

When I'm serializing to an output stream (say, stdout), I prefer to write there directly to be easier on GC. Compare Núcleo's tstr and tstr_cat.

Dec 30 '13 20:12 agladysh

It makes little sense to me to configure a thing like this once per serialization call. vs. This is a primary reason why in Lua Núcleo we have tpretty as well as tserialize...

This is the same thing that serpent does with different methods and options it provides.

Using per-object configuration allows one to mix objects with different configurations in one serialization call, although I'm not sure how useful it is.

Human-readable serialization and visualization should be highly configurable to conform to usage modes (single-line logging, multi-line config files) and to coding guidelines (indentation rules, line width etc.)

Yes, serpent also supports all these combinations except limiting line width.

When I'm serializing to an output stream (say, stdout), I prefer to write there directly to be easier on GC. Compare Núcleo's tstr and tstr_cat.

This is a good point; it should be possible to write a wrapper around serpent that will allow "pulling" of the next token, allowing to do whatever desired with it and limiting memory use to the token being returned.

@fab13n, if you prefer not to add the rawstring option to the serialization call, then I can go back to the original proposal and to use __serialize call to provide literal serialization. In the current logic __serialize metamethod (if present) takes precedence over __tostring. I don't see a problem with rawstring option as it can be made a default, but @agladysh is right that in this case it will be applied to all __tostring/__serialize results.

Dec 30 '13 20:12 pkulchenko

Yes, serpent also supports all these combinations except limiting line width.

Is this hard to implement? tpretty even places short (sub-)entries on a single line, and uses multi-line serialization for longer ones (the implementation is rather messy though).

Dec 30 '13 21:12 agladysh

Is this hard to implement?

No, but it hasn't been a popular feature so far. Also, the original goal for Serpent was to provide "good enough" implementation (in the sense of "good" correctness and "enough" features for majority of users), while still staying small. It's not so small anymore, but I'm hesitant to add features "just in case". I usually look for specific use cases or user requests.

Dec 30 '13 21:12 pkulchenko

BTW, there is yet another use-case "axis" for human-readable part — fast and good-enough vs. slow and really nice-looking.

When I'm writing data to a log file, I need it to be human-readable, but I need to write it really fast, even at expense of certain features (tstr).

When I'm generating code (like a configuration file), I need to do it prettily and I do not care for the speed (as much): tpretty.

Interestingly enough, tserialize is actually almost useless — loading back Lua code is too slow when compared with other machine-readable solutions (like, say, luatexts or msgpack).

Dec 30 '13 21:12 agladysh

I usually look for specific use cases or user requests.

I'm considering dropping tpretty in favor of serpent someday. Lack of line limit is a blocker, since we need to generate code that is compliant to our coding guidelines. So, consider this a +1 for that feature. :)

Dec 30 '13 21:12 agladysh

Lack of line limit is a blocker, since we need to generate code that is compliant to our coding guidelines. So, consider this a +1 for that feature. :)

That's good as a feature request ;).

Good point on fast vs. pretty, although it's difficult to say ahead of time what features can be skipped to make things fast(er).

Dec 30 '13 21:12 pkulchenko

Using per-object configuration allows one to mix objects with different configurations in one serialization call, although I'm not sure how useful it is.

Well, to me __tostring in serialization/visualization context seems to be useful only as a clutch for a quick-and-dirty visualization. Given that, I would say that the per-call configuration is meaningless.

I see several major use-cases (examples in pseudocode, I've actually needed everything listed here at one time or another):

Mutate object on serialization.

__serpent = function(self)
  -- Rest of the fields are ephemeral
  return { x = self.x, y = self.y } 
end

Provide a constructor function for serialization.

__serpent = function(self, serpent)
  local data = tclone(self)
  data.tag, data.id = nil, nil
  return self.tag, " ", serpent.short_string(self.id), serpent.newline, data
end

--- For self = { tag = "foo:bar", id = "baz", quo = 42 }:
---
--> foo:bar "baz"
--> {
-->   quo = 42;
--> }

Control formatting for human-readable serialization.

__serpent = serpent:format_fields { type = serpent.long_string }

--- To be put on `data` above to get:
--> foo:bar "baz"
--> {
-->   type = [[mystring]];
-->   quo = 42;
--> }

Provide custom visualization.

__serpent = function(self, serpent)
  if not self.tag then
    return self
  end

  return "`", self.tag, " ", serpent:comma_list(self)
end

Not a single use-case that I see now requires non-raw pasting.

What am I missing?

Dec 30 '13 21:12 agladysh

Furthermore, __tostring is a potential source of bugs here — it is so generic. If I'm using serpent for serialization, I must add __serialize to every object that might have __tostring in its metatable — even on a object from foreign module. And there is no saying what metatamethods and for what reason foreign module objects would have.

(I also updated the post above a bit.)

Dec 30 '13 21:12 agladysh

Item 3 above also may include custom field order for an object, BTW, — useful for DSLs, where field order may be determined by the purpose of the field (say, first common fields, then specialized for this specific construct), thus be dependent on tag field value...

Dec 30 '13 21:12 agladysh

(Edited that big post with use-cases again.)

Dec 30 '13 21:12 agladysh

...And here is a problem with using a metamethod instead of a callback in serialization. Generally, you want to save the object in such way that, when loaded back, it will be able to save itself again.

This means that this doesn't cut the mustard:

__serpent = function(self)
  return { x = self.x, y = self.y } 
end

You always need a constructor that will set a metatable:

__serpent = function(self)
  return "make_foo", " ", { x = self.x, y = self.y } 
end

This is not a problem for visualization — you don't need to load anything back. But for serialization you're better off using a callback function. This way you would not need to set any metatables on load, and can just emit good old table literals.

Dec 30 '13 22:12 agladysh

Well, to me __tostring in serialization/visualization context seems to be useful only as a clutch for a quick-and-dirty visualization. Given that, I would say that the per-call configuration is meaningless.

It's actually something that many users find useful/helpful. Some use it for their own values that already provide __tostring (pkulchenko/ZeroBraneStudio#77) and some use it to serialize and print their userdata (for example, Marmalade Quick).

I see several major use-cases (examples in pseudocode, I've actually needed everything listed here at one time or another):

That's a lot to review and digest; thank you for the summary. What's __serpent in these examples?

Dec 30 '13 22:12 pkulchenko

__serpent is a fancier __serialize metamethod — I used a different name here so we'll not be bound by an existing contract while discussing the matter.

Dec 30 '13 22:12 agladysh

It's actually something that many users find useful/helpful. Some use it for their own values that already provide __tostring (pkulchenko/ZeroBraneStudio#77) and some use it to serialize and print their userdata (for example, Marmalade Quick).

The usage of the __tostring in general is not useful for serialization. Look at luasocket, for example. You can't load that back.

True, in certain closed ecosystems it is possible to institute a stricter contract. But in general, IMO, a serialization code should not look at __tostring at all — unless user explicitly configures it to do that.

Visualization is debatable.

But, if I were writing a debugger, I would, first of all, provide my own callback, so users could specialize their code. Besides, what if __tostring for a given object has side-effects, for example? It is not forbidden by the general __tostring contract. So, if mobdebug uses __tostring for value visualization, IMHO, this is a dangerous misfeature that can cost a careless and unlucky user many hours of pain.

Dec 30 '13 22:12 agladysh

__serpent is a fancier __serialize metamethod — I used a different name here so we'll not be bound by an existing contract.

That's what I suspected, but I thought you were arguing against adding metamethods to all the objects: "If I'm using serpent for serialization, I must add __serialize to every object that might have __tostring in its metatable — even on a object from foreign module."

It seems like there are three main options for implementing that:

Using __tostring; this should be good enough for 90% of the users, maybe more. The current issue (at least for Fabien) is that the output of __tostring is not literal.
Using special metamethod, like __serpent (or existing __serialize). This should cover all the special cases (and should use literal output), but needs to be installed for every object that requires that treatment.
Using special callback, probably based on a type. This allows the callback to be specified only once (instead of attaching the metamethod) and can be hidden as one of the default options (in Fabien's case). This is similar to the current custom option, but it's only triggered for "plain" tables.

Dec 30 '13 22:12 pkulchenko

(BTW, please don't let my ramblings to distract you from giving Fabien a working solution for his immediate problem :-) We can discuss theory for ever, but it is practice that is important.)

Dec 30 '13 22:12 agladysh