Parser refactoring proposal

Open CasperN opened this issue 3 years ago • 12 comments

As I tend to do, I'll try to keep this top comment updated with the latest state. This is based off of https://github.com/google/flatbuffers/issues/6428#issuecomment-856272970.

Proposal

Why

flatc is really complicated and hard to follow. Particularly, idl_parser.cpp is kind of a huge mess. I actually think the right way out is to rewrite components, rather than trying to refactor it, but this can be decided at implementation time...

What

Proposed components

flatc
- Invoke subcomponents
fb-schema-parser, proto-schema-parser
- these take in .fbs or .proto files and parses them to reflection bfbs
bfbs2schema
- this is so we can output flatbuffers files translated from protobuffers
conform
- takes two bfbs schemas and verifies if the second is a valid evolution of the first.
code generators
- They take in reflection bfbs and generate files in their local languages
json2fbs, fbs2json, fbs2flexbuffers, flexbuffers2fbs
- Transpiles std::in, possibly given a reflection.bfbs if transpiling fbs.

We may want flexbuffers2json and json2flexbuffers, to be part of flatc for legacy purposes but imo, it should be a different binary since it doesn't touch any flatbuffers.

Rougher ideas

Code generators should be independent binaries, invoked by something like std::system (if that works).
- flatc should access some config file, e.g. .flatc, that enables users to add more code generators, e.g.
```
code_generators {
  flag: "mylanguage"
  executable_path: "~/my/language/main.o"
}
```
- there should be some way of passing flags to the code generator too, e.g.
```
flatc --mylanguage foo.fbs -- --mylanguage-features
# this acts like
flatc foo.fbs --schema | ~/my/language/main.o --mylanguage-features
```
the various transpilers should also be independent binaries if we want plugin system for that too
- Plausibly, we'd want that for translation to XML, YAML, other formats
We might want to defer all the flag logic to another library... perhaps Absiel, if we can get to C++11

Jun 17 '21 16:06 CasperN

Saying the current code is "kind of a huge mess" without a) being specific about what is so bad about the current code and b) what advantages this new structure would bring is not very productive.

Other than that the file is (too) big I don't see a huge mess myself.

Your "proposed components" simply list functionality that we already have. I presume you want these things to be more "separate", but the how of that is the most interesting. For example the fbs and proto schema parsers share a LOT of functions. Where do these functions go? Some kind of inheritance tree? Given that 95% of the work is about fbs, is fbs parsing well served by having the parsing split up over 2 files dictated by what happens to be shared with proto? etc etc?

We can certainly split up idl_parser.cpp and pull parts out (the conform and serialization stuff for example should be pretty easy to pull out), but most certainly this would take the form of a refactor instead of a rewrite, for two reasons:

refactor is WAY more code-review friendly, and can be done incrementally.
there's a TON of functionality in there, and we can't afford all sorts of subtle things breaking due to a rewrite. Yes, we have tests, and we should probably have a lot better test coverage, but for something as big as this, you're never going to be sure a "rewrite" doesn't break stuff, no matter how good you think your tests are.

While I welcome the idea of making independent binaries (and codegens in other languages) possible, I don't see why we should require it for all code generators, especially all the mainstream languages we already have. It's not like flatc takes long to build. And for many people, having a single executable that does everything and requires no external files is convenient.

I would not want to pull in Abseil or some other big library for the convenience of flag parsing. So far, FlatBuffers has been a "no dependencies" library, which makes it convenient to build on many platforms, and we'd need a pretty strong reason to break that.

@vglavnyy who has worked a lot with this code as well.

Jun 17 '21 20:06 aardappel

Saying the current code is "kind of a huge mess" without a) being specific about what is so bad about the current code and b) what advantages this new structure would bring is not very productive...

Sure. I didn't want to dive into all the details immediately, just lay out the high level structure. Basically, the main issue is that Parser does everything. I think it should be refactored into a few classes that have better separation of concerns.

However, what do you think of a plugin model + config file where people can attach their own code generators? (and perhaps transpilers, if we can flesh out those details)

most certainly this would take the form of a refactor instead of a rewrite, for two reasons...

I agree that this will be incremental. I think I was unclear in what I meant by 'rewrite' vs 'refactor'. I more or less meant "major overhaul where basically every line will be moved"

While I welcome the idea of making independent binaries (and codegens in other languages) possible, I don't see why we should require it for all code generators, especially all the mainstream languages we already have.

I agree, especially with the chicken/egg problem, as discussed elsewhere. Though, we should convert one code generator to test the feature.

I would not want to pull in Abseil or some other big library for the convenience of flag parsing. So far, FlatBuffers has been a "no dependencies" library, which makes it convenient to build on many platforms, and we'd need a pretty strong reason to break that.

I would also like absl::Status, so we can have error messages rather than storing them in Parser::error_ but I think we can move that into CheckedError, so I guess that's unnecessary convenience too.

Your "proposed components" simply list functionality that we already have. I presume you want these things to be more "separate", but the how of that is the most interesting. For example the fbs and proto schema parsers share a LOT of functions. Where do these functions go? Some kind of inheritance tree? Given that 95% of the work is about fbs, is fbs parsing well served by having the parsing split up over 2 files dictated by what happens to be shared with proto? etc etc?

Yep, these are the details I'd like to discuss in this issue. Here's my first pass on what can be taken out of Parser:

Tokenizer
StringPool (for symbol names, filenames, commonly shared strings)
Base parser (for whatever functionality is shared)
Fb schema parser
Proto schema parser
Json Parser
SymbolTableHolder
(De)Serializer for reflection.fbs
Advanced feature and conformance checker

BaseParser can be a parent of the other parsers, but I don't think anything else should be inherited. Maybe holding a BaseParser as a member would be better. I think such details should be determined during implementation.

Given that 95% of the work is about fbs, is fbs parsing well served by having the parsing split up over 2 files dictated by what happens to be shared with proto? etc etc?

I think we should move the Parser::Parse* methods like the following (again, this is a first pass, its up for debate):

BaseParser:

ParseHexNum
ParseVector
ParseType
ParseComma
ParseVectorDelimiters
ParseTableDelimiters
ParseFunction
ParseSingleValue

FlatbufferParser:

ParseEnum
ParseTable
ParseField
ParseNamespacing
ParseDecl
ParseMetadata
ParseEnum
ParseNamespace
ParseDecl
ParseMetadata
ParseService
ParseRoot
ParseAlignAttribute
ParseNestedFlatbuffer*
ParseEnumFromString*

ProtobufferParser:

ParseProtoFields
ParseProtoOption
ParseProtoKey
ParseProtoDecl
ParseProtoCurliesOrIdent
ParseTypeFromProtoType

JsonParser:

ParseTypeIdent
ParseString
ParseAnyValue
ParseHash

weird ones I'd have to think more about

ParseFlexBufferNumericConstant
ParseFlexBufferValue
ParseFlexBuffer

* These parse functions depend on symbol resolution. In my opinion, symbol resolution should happen in a second pass so its easier to reason about the actions of a Parser. The Json Parser is also a transpiler, which is slightly different from FlatbufferParser and ProtobufferParser which essentially fill SymbolTablesHolders.

Of course, there are 40 more methods to consider.

Jun 18 '21 03:06 CasperN

In the period 2020-2021 (18 months) the idl_parser.cpp was changed 64 times.

411 deletions, 827 insertions

Total lines: 3600. From my point of view, this file is pretty stable (only 10% of code changed during 18 months). It has some problems but all of them are non-critical. I don't see how refactoring might significantly improve future development. Parser's performance isn't critical. Code is stable.

Code generators are more unstable. Until now not all fbs-features are supported by languages. For example:

idl_gen_cpp (3000 lines in total): 437 deletions, 942 insertions
idl_gen_rust (1800 lines in total): 643 deletions, 1611 insertions

Changes in Parser for changes in Parser isn't a good idea until we migrate to C++11 (range-for, etc) or C++17 (string_view, optional, variant).

The best thing that we can do: extract and clarify Parser's public interface (IDLOptions as well). This should be the public interface for user code, not for idl_gen_<lang> units. The new public interface (free-standing functions or new classes) can use the existing Parser as a ParserImpl without any changes. This ParserImpl might be extracted without any drawback.

Jun 20 '21 14:06 vglavnyy

From my point of view, this file is pretty stable (only 10% of code changed during 18 months).

I strongly disagree with the notion that Parser should not be improved because it hasn't been modified much recently.

Reading code is more important than editing code, and I had to read a lot more than I actually changed when I interacted with Parser. Understanding what's happening is not trivial. I mostly relied on tests to validate my changes.
Modifications to Parser will be necessary for future cross language development, such as the bag of ideas in #5875. I can also argue that few changes were attempted because Parser is complicated.

Changes in Parser for changes in Parser isn't a good idea until we migrate to C++11 (range-for, etc) or C++17 (string_view, optional, variant).

Im also in favor of moving to C++11 or 17, though I wonder if it a goal of Flatbuffers to be compatible with old C++ for portability reasons. I don't understand why new language features would block a refactoring.

The best thing that we can do: extract and clarify Parser's public interface (IDLOptions as well). This should be the public interface for user code, not for idl_gen_ units.

I'm not sure if I understand what you mean here, can you clarify? I think you're suggesting getting the code generators out of holding a const Parser & parse_; field, and I agree with that.

The direction I prefer to push in is by providing the flatbuffer type system in a reflection fb, as per #6428, and for that to be the public interface to both code generators and users.

Jun 20 '21 17:06 CasperN

Modifications to Parser will be necessary for future cross language development, such as the bag of ideas in #5875.

Which ones are directly connected with the parsers (fbs, json, flex)?

I don't understand why new language features would block a refactoring.

Of course it can't block. It was only my opinion. The idl parser.cpp is good enough (high quality and stable) and there are other high-priority issues (bugs and not-implemented features).

I think you're suggesting getting the code generators out of holding a const Parser & parse_;

I don't mind if a generator uses Parser. There are several kinds of Parser's users:

C++ user application code needn't know about idl.h, it needs only:
- load schema (fbs, bfbs),
- parse json (and get a root) with schema or flex,
- generate json with schema or flex,
- check schema conformance.
C++ flatc:
- is like user app, plus generate bfbs (serialize)
C++ code based generators (invoked by flatc or by user code) have to know about types from idl.h:
- load bfbs (deserialize)
- call Lookup methods (if depends on Parser)
- call SupportsOptionalScalars() method
- access to internal (but declared as public) AST fields like enums_, namespases_, etc.
C++ unit-tests and integration testing code
non-C++ code generators: depend only on bfbs- (or json-) schema.

The current public API is idl.h file is focused on code generators. This API is too complicated if json parsing or printing is needed. The current Parser can be hidden from an application level API. A simplified API can be easily extracted without changes in the code (using pimpl idiom).

As for me, the IDLOptions looks like a set of unrelated flags. This set is like a global variable that is used for everything (parsing, code generation, json generation). It breaks the single responsibility principle (it is obvious on code review). Since 2019, 10 options were added, all of them connected with code generation. Probably the only no_warnings flag was connected with the parser. Optionaly, IDLOptions can be divided into several parts: public (or common), common-generate (common for all code generators), and language-specific parts (moved out from idl.h). A public part of these options also can be divided into common, json-, or flex- specific options. But this also a minor issue because it already works and doesn't block anything.

Jun 21 '21 02:06 vglavnyy

Which ones are directly connected with the parsers (fbs, json, flex)?

1, 2, 3, 4, 5, 6, 7, 8, 11, 12, and 15 could be implemented by adding new types or annotations which needs some adjustment to Parser as it populates idl.h. Any checks and adjustments to Conform would be implemented in Parser too. I don't think any of these are directly connected with parsing, per se, certainly not json/flexbuffer parsing, but all the fb schema type checking functionality lives in Parser.

There was a request for XML parsing in a different issue, #6033, that's definitely very challenging to implement with the current structure of Parser.

A simplified API can be easily extracted without changes in the code (using pimpl idiom).

Yep, that's probably how the refactoring would start. I hope it doesn't end there, because "single responsibility principle" and all that. Though I do recognize its a lot of effort.

As for me, the IDLOptions looks like a set of unrelated flags. This set is like a global variable that is used for everything (parsing, code generation, json generation). It breaks the single responsibility principle

Yea I agree, I think fixing that should be part of this refactoring effort.

Maybe we can agree on "clean up Parser API" as a "P1", and "refactor internals" as a "P2".

Jun 21 '21 16:06 CasperN

Agree with @vglavnyy that Parser is relatively stable, but it is indeed very big, and any functionality-preserving refactors that improve organization are still very welcome. Also agree there might be higher priority work to be found, but it also depends on what @CasperN is motivated to work towards, so if he's willing to refactor, we will benefit from that.
Moving the Parser code to a newer C++ is a separate discussion. The answer is always the same: so far we still build on VS2010 and on old Android SDKs with STLPort and other old setups. I would LOVE to remove that cruft, but we still have internal users for the latter, and not an overwhelmingly strong reason to cut that backwards compatibility. Actually, let me check on that internal user.
While we will be enabling external codegens to work on bfbs, and while maybe at some point in the future all of them work on bfbs, there will be a long time when codegens based on Parser must keep functioning. So focus on the enabling, we can focus on migration of existing code later, if that makes sense.
Of your suggested parts to factor out of Parser, I think most will be welcome, but splitting up the core parsing will hardest and possibly most contentious. Since we agree this will be iterative, I suggest simply starting right now, with whichever of those many parts is easiest to pull out. Make PR, repeat. We do the core parsing last, which also makes the effect easier to see.
Would definitely in favor of splitting up IDLOptions depending on subsets by user.
As for config files and how to deal with plugins.. I think a) there should be a super simple way to use plugins that they happen to supplied as executable files right next to where flatc resides (or in path) it "just works", and optionally have a config file (maybe one next to flatc, and one in user config location?) for plugins that need more elaborate setup, like ones run thru an interpreter for a language with flags. Config file probably in JSON since we can read that easily :)
Yes feel free to improve our error situation, but it will likely have to be a class we own, rather than a library.