solvespace icon indicating copy to clipboard operation
solvespace copied to clipboard

New file format

Open whitequark opened this issue 9 years ago • 64 comments

SolveSpace would benefit from a file format that suits the following required properties:

  • compact, with a binary representation;
  • has an alternate text representation, for tests;
  • fast to read and write;
  • easily extensible, while being backwards compatible;
  • easily supports deeply nested structures, such as for hierarchical sketches.

In addition, an optional but desirable properties would be:

  • easily readable and perhaps even writable from 3rd party software.

The current file format fulfills none of the above (not even the last requirement, since there is no explicit schema). I propose to replace it with a format based on Google Protocol Buffers, which has the following properties:

  • the binary representation is naturally compact--it writes your data and generally does no special tricks during it, except that it defines a variable-length encoding for integers;
  • the text representation is available alongside binary representation, and requires no manual work;
  • has a highly optimized reader and writer implementation in C++;
  • is literally designed because of a desire to handle protocol evolution well;
  • is naturally nested;
  • has a separate schema and very broad library support.

Such a format, with the initial version closely resembling the existing file format, seems like an ideal choice for us. The plan is to evolve it further by gradual extension, while remaining compatible with the old versions, as opposed to versioning; unknown entries are, generally, ignored, so with some care and testing, it is possible to retain an excellent degree of compatibility.

whitequark avatar Oct 14 '16 01:10 whitequark

@jwesthues Any objections? Disregard my earlier thoughts about using LLVM bitcode; after an in-depth discussion with LLVM folks that turned out to be a rather poor choice.

whitequark avatar Oct 14 '16 01:10 whitequark

The new file format should also use 64-bit handles (as would the in-memory representation). The rationale is that, once we make large assemblies fast, remapping of entities can rapidly exhaust the 65535-entity limit for one group, if one imports many instances of a large sketch. This migration is an ideal opportunity to add the 32→64 bit expansion code robustly.

whitequark avatar Oct 14 '16 02:10 whitequark

That seems generally reasonable. It would be fashionable to use JSON or some other text representation today; but especially if you optimize assemblies to make better use of the GPU, a binary format's speedup seems material.

jwesthues avatar Oct 14 '16 03:10 jwesthues

Actually, the version 3 of Protocol Buffers has a canonical JSON serialization, so we could just use that. (Google generally recommends version 2, for reasons that I do not fully understand; but 3 is probably fine too.)

whitequark avatar Oct 14 '16 03:10 whitequark

It's also worth taking a look at Cap'n Proto, designed by the original author of Protocol Buffers while taking into account many of the downsides of the latter. It has some rather desirable properties, but a) completely lacks a human-readable format usable at runtime (in fact the author recommends the use of JSON in that case); b) lacks a bidirectional JSON parser and serializer; c) is not supported nearly as broadly. Overall I feel it's not the right choice, but it deserves a mention.

whitequark avatar Oct 14 '16 12:10 whitequark

The rationale is that, once we make large assemblies fast, remapping of entities can rapidly exhaust the 65535-entity limit

Actually when I implementing dxf import, I've easily found file which doesn't fit into 64k handles.

Evil-Spirit avatar Oct 15 '16 03:10 Evil-Spirit

The main problem is to build format which don't contains implementation details like remapping-lists. This format must be human-readable (in text form). I think it is good idea to have two sections:

  1. The basic compact format representation - it is only reqests, groups and main parameters. This is fully-human readable and logically based. It must be sufficient to build whole sketch using only this part.
  2. The precomputed part. This is for fast loading while building assemblies. By default this included into file, but can be avoided by user intention (for more compact files).

For the new-version migration we can just use 1) and don't bother about 2)

Evil-Spirit avatar Oct 15 '16 03:10 Evil-Spirit

The main problem is to build format which don't contains implementation details like remapping-lists.

Without remapping, how do you propose to support nested assemblies?

jwesthues avatar Oct 15 '16 03:10 jwesthues

@jwesthues Use some effective way to represent handle-chain path to entities. Like FullSubentPath used in AutoCAD.

Evil-Spirit avatar Oct 15 '16 03:10 Evil-Spirit

You mean by making the entity's unique identifier variable-length?

jwesthues avatar Oct 15 '16 03:10 jwesthues

@jwesthues, yes. Since we want to have some nested sketches (assembly is one of the cases), we don't want to have one big heap of entities, so we can use variable-length identifiers.

Evil-Spirit avatar Oct 15 '16 03:10 Evil-Spirit

@Evil-Spirit So what used to be a fixed 64-bit field now must be allocated dynamically for every single entity, and you also break backwards compatibility. What's the offsetting benefit?

jwesthues avatar Oct 15 '16 03:10 jwesthues

@jwesthues, entities itself can contain only current-sketch-id (or group) as it was before. Only constraints should contain full-path since it can be applied between different groups/sketches.

benefits:

  1. This remove remap lists from files.
  2. This allows nested asm-in-asm strutures without exp grow the memory needs
  3. This faster of generation
  4. This makes ability to store only one sketch-copy in memory and insert it multiple times just using asm-group with csys basis and pointer to sketch.

I know what this is dramatical change and not all the implementations aspects is clear. But this is just the way we can think about new format. May be we can invent someting more robust.

Evil-Spirit avatar Oct 15 '16 04:10 Evil-Spirit

@Evil-Spirit The thing that dominates regeneration time is constraint solution and surface Booleans, and this speeds up neither. The memory saving is not material. You've proposed this change before, and I still don't see the benefit.

jwesthues avatar Oct 15 '16 04:10 jwesthues

Selection can operate on the nested structures and generate path for constraints. Actually existance of each entity in memory is not needed, it can be generated on the fly from requests and stored parameters for writing constraint equations of performing visualization. Reqesting for entity by its id inside group can use reqest id stored in entity id, entity id inside request and unique entity id inside group to decide what type and which params to take...

Evil-Spirit avatar Oct 15 '16 04:10 Evil-Spirit

@jwesthues Constraint solution time is not the problem. I know what we can do with. Booleans is pain, but still can be improved using some external libraries or improving the current solutions. It can be abstracted (as it actually done for using mesh or surfaces).

But we talking about file format and basis for all of aspects. It something like if we solve one small problem at now this solves a couple of big problems for the future. I not just talking about real changing, just thinking about how it can be done. May be some of this ideas help make better file format or solve some problems

Evil-Spirit avatar Oct 15 '16 04:10 Evil-Spirit

@Evil-Spirit

Actually when I implementing dxf import, I've easily found file which doesn't fit into 64k handles.

I don't consider importing such files useful (what are you going to do with them exactly?), so it doesn't matter. This is just an artificial case.

This format must be human-readable (in text form).

I agree that there should be a human-readable representation, for one reason only: we need to write tests. I see no reason to have the human-readable representation, as our only or main one. There is no benefit (it is actually easier to read protobuf files, because you don't have to write any parsers), and there are drawbacks: the files are bloated, and the parsing is slow.

For the new-version migration we can just use 1) and don't bother about 2)

Nope, the new file format should be (at least) functionally equivalent to the old one.

whitequark avatar Oct 15 '16 05:10 whitequark

@jwesthues Can you please write a (more in-depth than the comment in sketch.h) explanation of the function that remapping serves? I see that it's used for both assemblies and any groups that produce entities from entities, but I don't follow the logic behind that.

whitequark avatar Oct 15 '16 05:10 whitequark

@whitequark To maintain an associative link between two entities (like an entity imported from a part, and the imported entity in the assembly), you can either identify entities by a complete path that specifies the associative link, or keep a table that records the association. Assuming arbitrarily deep import nesting, the former requires arbitrarily long identifiers. The latter is that remap table.

jwesthues avatar Oct 15 '16 06:10 jwesthues

I don't consider importing such files useful (what are you going to do with them exactly?), so it doesn't matter. This is just an artificial case.

Actually this is not such big file as you can think. And SolveSpace is good enough to be good pure 2d editor (since we made dxf import-export). look at https://github.com/whitequark/solvespace-testcases/tree/master/tests/import/Download

Evil-Spirit avatar Oct 15 '16 06:10 Evil-Spirit

@jwesthues, This sounds resoneable for assemblies (where we actually have the case with asm-in-asm hierarchy and case where we have the arbitrary number of the same details), but why this is for each group? Can we painless avoid this?

Evil-Spirit avatar Oct 15 '16 06:10 Evil-Spirit

Speaking of Protocol Buffers... I'm looking further into it and I see some potential scalability problems. Specifically, Protocol Buffers are designed to work by deserializing the entire message at once. The problems here are twofold:

  • Parsing demands. In general the documentation recommends to avoid messages larger than 1MB, has a 64MB soft-limit to guard against DoS, and has a 512MB hard-limit, at which point it breaks due to integer overflows. Additionally, parsing involves allocation (which is cheap, as it uses arenas, but it still expands the input in memory) as well as deserialization from the wire format.
  • Lack of a seek capability. When we are loading sketch that we know we will regenerate, it is absolutely wasteful to read all the meshes and such. Similarly, with hierarchical sketches, we might not be interested in many of them at all.

Cap'n Proto solves both quite elegantly. It looks like I'll have to prototype some code with both...

whitequark avatar Oct 15 '16 06:10 whitequark

@Evil-Spirit The nesting of associative links can get pretty deep even without assemblies, like if you sketch a contour, step translating, then step translating again for a 2d array, then extrude, then ...

jwesthues avatar Oct 15 '16 06:10 jwesthues

@jwesthues, This seems like case where we can apply just remapping function instead of storing all this inside map... or not?

Evil-Spirit avatar Oct 15 '16 07:10 Evil-Spirit

@Evil-Spirit How do you remap in a way that (a) uses a fixed-length identifier, (b) maintains the associative link, even when entities in the input group are added and deleted, and (c) doesn't require that table?

jwesthues avatar Oct 15 '16 07:10 jwesthues

@jwesthues, yes, (b) is serious fact. actually this solved if use pointers instead of id for runtime and use ids only for saving.. but this is the different story.

Evil-Spirit avatar Oct 15 '16 07:10 Evil-Spirit

@Evil-Spirit The table (or equivalent additional state) is still necessary if pointers are used instead of handles.

jwesthues avatar Oct 15 '16 07:10 jwesthues

It looks like I'll have to prototype some code with both...

Ah, here we are! There are FlatBuffers, which are basically the best part of Protocol Buffers and Cap'n Proto fused together. And, unlike capnp, it properly supports JSON serialization as well.

whitequark avatar Oct 17 '16 14:10 whitequark

@jwesthues One more point I'm thinking about... with true hierarchical sketches, STEP/DXF linking, etc it seems useful to be able to use a single file for interchange, instead of having a directory mess. We could always just put everything as a nested structure or an inclusion of raw data into the .slvs file, but I had what perhaps is a better idea: make the SolveSpace 3 file format use a zip outer container, much like the new Microsoft Office (or Open Office) documents work.

This way, a nested directory tree with many sketches, STEP files, etc, would have the exact same abstract representation whether it resides on the filesystem, or in a .slvs file, with relative paths being used unmodified to refer to the linked parts. Even better, one could use any off-the-shelf zip archiver to pack, unpack, and browse such files.

If the zip container uses the "store" compression method then we can directly read the embedded sketches, using mmap/MapViewOfFile. (And, since the flatbuffers/capnp format has relatively high entropy, external archivers will usually do this on their own, and we of course do whatever compression method we find suitable.) Conversely, "deflate" could be used for low-entropy stuff like large STEP files.

whitequark avatar Oct 22 '16 13:10 whitequark

@whitequark I see the benefit. The rules for when to e.g. use a part from within the packed assembly file vs. from the filesystem are perhaps the tricky part. Use the filesystem if available, otherwise prompt the user to extract?

jwesthues avatar Oct 22 '16 17:10 jwesthues