sphinx-needs icon indicating copy to clipboard operation
sphinx-needs copied to clipboard

More efficient file format for needs database than needs.json

Open arwedus opened this issue 1 year ago • 15 comments

Problem description

sphinx-needs stores all needs with all fields in needs.json. If you have a large meta-model with lots of custom attributes, every single needs adds several kb worth of data. All fields get written to the need dictionary, even those with empty default value.

For a large project, this file ends up being huge, even more so with lots of imported needs from other sources which are imported as needs directives via custom extension. We see 130 MB for a project that is still in "incubation phase". This leads to a noticeable wait in the "writing needs.json phase", and to an impressive disk space usage if you think about caching different needs.json versions.

Please re-write the needs export (and import) to use a more space-efficient storage format, without sacrificing much run-time in the read/write phase.

I have some ideas for open discussion, nothing I've yet investigated:

  • only store non-default values for each need. This would already be a huge space saver, and the default values are easy to get back. All you would need is another entry with a meta-dictionary or maybe a schema that you save along with the project's needs.json. [edit CJS: done, see https://sphinx-needs.readthedocs.io/en/latest/changelog.html#needs-json-improvements]
  • Use a different file format that supports optional fields, like google protobuf or maybe msgspec
  • Internally change the way needs are kept in memory to organizing needs into containers per need-type. I.e. if you have the need types "req-sys", "req-sw", "test", and "spec": 4 containers. Change the configuration of needs options to define allowed extra needs options per need type. When writing needs.json, store needs grouped by need type. This way, you optimize for RAM and disk space memory. I understand that this may require a major rewrite of central parts of sphinx-needs, but I just wanted to put it in the discussion here for completeness, as this idea has also been discussed already since 2020.

arwedus avatar Dec 06 '23 12:12 arwedus

Thanks for the issue and I totally understand the problem. JSON has been selected as format of choice, as it is simple and supported by several languages out-of-the-box . Also the internal implementation was quite easy and short.

However, I'm totally open to replacing/extending the supported formats, but I would like to match this to use cases, as they may have different requirements for a format.

Use cases

I see these use cases:

  1. Network Transport
  2. Archiving
  3. Data exchange with other services/tools
  4. Customized handling (fast prototyping)

Network Transport

  • shall be small
  • content shall be use-case optimizable
  • Specific parts shall be selectable, without transferring everything at once

Archiving

  • shall be small
  • shall be complete
  • shall be useable also in x years
  • can be validated (may contain a hash)

Data exchange with other services/tools

  • shall contain needed data only
  • shall contain content and all used artifacts (images and other linked files)
  • can be validated (may contain a hash)
  • Shall be an accepted standard, maybe already supported by other tools (for instance ReqIF)

Customized handling (fast prototyping)

This use case is for me the main reason, why needs.json got chosen.

  • shall be complete
  • shall be easy to understand and use
  • Shall work without any changes or configuration

Technical formats

as by @arwedus :

  • Protobuf
  • msgspec

I also put on the table:

  • sqlite
  • bjson
  • ReqIF
  • zip + several json files (maybe this is more "packaging" and could be done with any format)

Additional requirements:

  • It shall be possible to internally transform the selected format to the needs.json format, this would make an implementation quite easy

Proposal

I haven't made any checks for all of the format, but currently, I would go with:

  • ReqIF for
    • Data exchange with other services/tools
    • Archiving (+ zipping the data with own scripts)
  • Shorten needs.json per need (so a different packaging)
    • Network Transport

For other formats I would stay with the normal needs.json

Reasons for staying with needs.json for Network Transport.

By default all data gets zipped nowadays with gzip before it get send to the server. So selecting here a binary-format may not bring the size-benefit as expected.

Also the problem is not the data-size of a single need, but that we can only fetch all data at once. If we change the packaging and e.g. create one file per need, the permalink feature could request only the needed data.

But for "WebApps" a REST-API would be normally the way to go, and would include real-time filtering ;)

New features

So based on the proposal, this features would be needed:

  • Support ReqIF as im/export format
  • Allow need-export functions to define the data to export
  • Maybe support ZIP files, like needs.zip, containing multiple `json-files.

Open for any discussion :)

danwos avatar Dec 06 '23 13:12 danwos

@danwos: Well, you know our main use case (multi-project build and build caching), yet I don't find it to be represented in your list. For our use case, I'd definitely stay with one file and I'd prefer to have it either human readable or binary (like parquet or protobuf), but not first json, than zip file. Also, for the "network transport" use case (maybe the closest one), multiple files aren't better than a single file. I mean, we have rclone for partial updates and azure blob storage performs poorly on many small files.

However, I feel that you have not adressed the obvious yet - it's simply unnecessary to store parts of the needs that do not add information. In any of those scenarios. And that's actually the optimization I'm requesting here. We will get already a factor of 5 - 10 size reduction in needs.json (depending on the complexity of the metamodel used in a project), while retaining compatibility.

arwedus avatar Dec 06 '23 13:12 arwedus

... it's simply unnecessary to store parts of the needs that do not add information

I can't entirely agree in all cases, but to be backward-compatible we can also not touch the needs.json export.

However, there can be a specialized builder, which outputs only the needed data and by default no internal data.

An idea for an implementation is:

  • New builder needs_short
  • New config option needs_short_data: List of options to export. If empty, all of them are used.
  • New config option needs_short_empty: Does export empty options (Default: False).
  • New config option needs_short_filename: Default: needs.short.
  • New config option needs_short_format: Supported file_format, maybe json, pickle, ... (Default: json)
  • New config option needs_short_structure: one of object, doc, project, which defines if the needs-data is stored in several files or one big file. (Default: project).

So if we touch the export/import mechanism, we should do it right and use an extensible architecture. Also the exported file itself should contain some information about the used export-config.

danwos avatar Dec 07 '23 06:12 danwos

@danwos @arwedus

I guess the discussion revolves around one fact:

How to distingush between an undefined value and an empty value? The topic above is mostly on undefined value I guess.

Different serialization formats defines it in different ways.

In case of XML you omit the element completely to specify it as undefined,

For example:

<name>hugo</name>
<age>99</age>

Below means the value of age is empty:

<name>hugo</name>
<age/>

and below means the value of age is undefined:

<name>hugo</name>

As XML does not have a standard way to define boolean values, how to treat the result of an undefined boolean and empty boolean is a bit tricky, but I will skip that case for now :)

I checked the JSON spec and found that JSON does not specify it so explicitely as XML.

In JSON you can specify undefined as :

{
  "name": "hugo",
}

or

{
  "name": "hugo",
  "age": null
}

In the above you explicitely state the value as undefined, but again bloats up big jsons

I feel in the end, for JSON it is based on the usecase how undefined element is represented, but it should be used consistently in the whole structure to have semantic clarity and document it to avoid any ambiguity.

To summarize:

IMHO, we should not have different serialization formats or have different builders. If Sphinx-Needs is meant to be used in large concerns with large projects, I guess we can "fix" it in the current needs.json with a good deprecation strategy.

twodrops avatar Dec 07 '23 07:12 twodrops

I feel it becomes even more important to solve this at the earliest due to this unsolved topic in Sphinx-Needs that options and links are global.

https://github.com/useblocks/sphinx-needs/discussions/601

If you have 100s of options and links like in a large project like ours, it simply bloats the JSON with empty keys for all options and links making the JSON quite unusable.

twodrops avatar Dec 07 '23 07:12 twodrops

... it's simply unnecessary to store parts of the needs that do not add information

I can't entirely agree in all cases, but to be backward-compatible we can also not touch the needs.json export.

However, there can be a specialized builder, which outputs only the needed data and by default no internal data.

An idea for an implementation is:

* New builder `needs_short`

* New config option `needs_short_data`: List of options to export. If empty, all of them are used.

* New config option `needs_short_empty`: Does export empty options (Default: `False`).

* New config option `needs_short_filename`: Default: `needs.short`.

* New config option `needs_short_format`: Supported file_format, maybe `json`, `pickle`, `...` (Default: `json`)

* New config option `needs_short_structure`: one of `object`, `doc`, `project`, which defines if the needs-data is stored in several files or one big file. (Default: `project`).

So if we touch the export/import mechanism, we should do it right and use an extensible architecture. Also the exported file itself should contain some information about the used export-config.

Yay, more options! ;-)

I think a "sparse_needs" builder could be 100% compatible with the current needs builder, but that may require forking python's own json export/import.

For a 100% compatible format, I guess we could add a "needs_schema" section to the needs.json (below "version"...) and let the parser construct needs dictionary entries based on the combination of the needs_schema and the non-default values in the need elements. If there is no "needs_schema" section, the current needs builder behavior is used.

Here's an example to demonstrate the idea:

{
    "current_version": "1.0",
    "project": "needs test docs",
    "versions": {
        "1.0": {
            "filters": {},
            "filters_amount": 0,
            "needs_schema": {
                "avatar": "",
                "closed_at": "",
                "completion": "",
                "created_at": "",
                "description": "",
                "docname": "",
                "duration": "",
                "external_css": "",
                "external_url": "",
                "full_title": "",
                "hidden": "",
                "id": "",
                "id_complete": "",
                "id_parent": "",
                "id_prefix": "",
                "is_external": false,
                "is_need": true,
                "is_part": false,
                "layout": null,
                "links": [],
                "max_amount": "",
                "max_content_lines": "",
                "parent_need": null,
                "parent_needs": [],
                "parent_needs_back": [],
                "parts": {},
                "post_template": null,
                "pre_template": null,
                "query": "",
                "section_name": "",
                "sections": [],
                "service": "",
                "signature": "",
                "specific": "",
                "status": null,
                "style": null,
                "tags": [],
                "template": null,
                "title": "",
                "type": "",
                "type_name": "",
                "updated_at": "",
                "url": "",
                "user": ""
            },
            "needs": {
                "TEST_01": {
                    "description": "TEST_01",
                    "docname": "index",
                    "external_css": "external_link",
                    "external_url": "file:///home/daniel/workspace/sphinx/sphinxcontrib-needs/tests/doc_test/external_doc/__error__#TEST_01",
                    "full_title": "TEST_01 DESCRIPTION",
                    "id": "TEST_01",
                    "id_complete": "TEST_01",
                    "id_parent": "TEST_01",
                    "is_external": true,
                    "links": ["SPEC_1"],
                    "title": "TEST_01 DESCRIPTION",
                    "type": "impl",
                    "type_name": "Implementation"
                }
            }
        }
    }
}

arwedus avatar Dec 07 '23 08:12 arwedus

Thanks for the input. @twodrops: There are no undefined options in Sphinx-Needs, as all undefined elements coming from Sphinx get an empty value by Sphinx-Needs: "". That's needed to support the filter strings and not force the user to make perfect python statements for filtering. For instance :filter: author.startswith("Frank") would not work, if not used author fields are set to None. Therefore Sphinx-Needs can't distinguish between a not set-value and an empty value during build-time anymore. But I think this is not a big problem, as "" (empty string) as field information can't be set inside Sphinx/RST, if the option is defined as text.

@arwedus: The idea is good to provide a list of defaults for later not specified options. It is not a problem to implement something like this in Sphinx-Needs and make it work.

The problem is that this is still not backward-compatible, as the format is used by external scripts from different users all around the world. Any logic to identify which way to go needs to be implemented in all the scripts as well. Scripts may not work if a new option schema is used or data is missing in the need-definition itself. Therefore I would like to have a different name instead of needs.json, to make such a not-backward-compatible difference obvious.

The implementation and used format can then be anything.

danwos avatar Dec 07 '23 10:12 danwos

@danwos: I'd argue that adding an option to a file that was not in before should not break any reasonably well implemented scripts.

You just released a sphinx-needs 2.0 and we had to update some of our extensions. Nobody complained. I guess when you release a sphinx-needs 3.0 that adds an option to needs.json to support - in addition to the established verbose needs format - a sparse needs format, there will be few people who have to adapt their scripts, if any. And those hopefully will also not complain, because such changes are backwards compatible and quick to do.

If we would never add a potentially breaking change, we would still be stuck with sphinx 4.3 and sphinxcontrib.needs 0.7.

Of course, we would need a new configuration option needs_write_sparse_json or something like that, which defaults to False (the current behavior and needs.json output).

/edit: If it's about the name, you could let needs_write_sparse_json = False default to the file name needs.json, and needs_write_sparse_json = True to the file name sparse_needs.json, for example. This way we avoid builder proliferation. The sphinx-needs external_needs import should be able to deal with both versions in a future release.

arwedus avatar Dec 07 '23 10:12 arwedus

IMHO the proposals of @arwedus goes in the right direction. We will never have the one right format that fits all use cases. So the best would be have self-contained data sets that specify their structure in a standard way. But we should also have an option in sphinx-needs to specify the schema or ontology that we expect as the input source (needs import/external needs). The data (in or out) can be validated against these ontologies to remove any ambiguity.

Also agree that we should not be afraid of breaking changes in sphinx-needs right now.

r-o-b-e-r-t-o avatar Dec 07 '23 11:12 r-o-b-e-r-t-o

Thanks everyone for the engaging discussion.

I like the idea of @arwedus and @r-o-b-e-r-t-o in general because this is an easy way to solve the problem . However I have some concerns as well, especially if the expectation out of needs_schema in the needs.json grows with time.

There are pros and cons of packing a schema into the json in a selfcontained way. ReqIF does this also, for example, by packing the schema within the XML itself of each document. Then every document created carries the schema with it and does not reference a centrally defined schema. This might result in schema duplication and inconsistencies due to that. Also if we do it, then the schema should have even more information than the list of attributes and links right? For example, type, allowed values etc. If we had such a schema for Sphinx-Needs, we would have solved the problem I mentioned above already :) https://github.com/useblocks/sphinx-needs/discussions/601

I just want to make sure that we scope the needs_schema correctly if we go with this.

Also agree that we should not be afraid of breaking changes in sphinx-needs right now.

I agree to that as well :)

twodrops avatar Dec 07 '23 13:12 twodrops

Heya, so I definitely have thoughts on this 😉

I wanted to start though, by base-lining how the needs data is actually generated/used internally within the extension.

As you know, in sphinx-need 2.0 I spent a lot of time to centralise access to needs data: https://github.com/useblocks/sphinx-needs/blob/4294f92f5db51abcbdea814e6631a01b1019656e/sphinx_needs/data.py#L400, and try to start formalising its schema: https://github.com/useblocks/sphinx-needs/blob/4294f92f5db51abcbdea814e6631a01b1019656e/sphinx_needs/data.py#L59

I've also opened a few related issues #996, #997, #1014

Below is a summary of everywhere get_or_create_needs() is actually used:

Needs data creation / mutation

  • sphinx_needs.needs.prepare_env simply initialises the empty needs dict (ID -> need item)

  • sphinx_needs.data.merge_data merges the needs from the "worker" process in to the "main" process (simple dict.update)

  • sphinx_needs.data.purge_needs removes all needs from the needs dict, that have the specified docname

  • sphinx_needs.external_needs.load_external_needs, for each need data item dict, looks for an existing need of the same ID and deletes it, then passes the read need dict on to sphinx_needs.api.add_external_need

  • sphinx_needs.api.add_need add the need to the needs dict

  • sphinx_needs.api.add_external_need calls sphinx_needs.api.add_need

  • sphinx_needs.api.del_need removes a need from the needs dict (by ID)

  • sphinx_needs.directives.analyse_need_locations add fields to the need item, based on the specifying directives location in the document tree (like parent need and section). Run after the full document has been parsed to a doctree.

  • sphinx_needs.directives.post_process_needs_data adds/modifies fields on the needs items. It is un after the full project documentation has been parsed, and all needs have been added to the needs dict.

    • extend_needs_data: Use data gathered from needextend directives to modify fields
    • resolve_dynamic_values: Find and replace text like [[ my_func(a, b, c) ]] in all fields
      • Hard-coded to bypass fields: docname, lineno, content, content_node, content_id
      • Calls sphinx_needs.functions.execute_func
    • resolve_variants_options: for fields specified by the user in config needs_variant_options, the value is converted by match_variants, the context includes the destructured need item

Need data read-only

  • Processing of nodes generated in directives:

    • sphinx_needs,directives.needs.format_need_nodes
      • calls find_and_replace_node_content for each Need node, looks through all Text and reference nodes and replace text like [[ my_func(a, b, c) ]] by calling sphinx_needs.functions.execute_func.
      • calls sphinx_needs.layout.build_need takes some fields from the required need item, then passes it to LayoutHandler.__init__ where it is also used, also calls create_need
    • sphinx_needs.directives.needbar.process_needbar
    • sphinx_needs.directives.needextract.process_needextract
      • Calls sphinx_needs.layout.create_need takes some fields from the required need item
    • sphinx_needs.directives.needfilter.process_needfilters
    • sphinx_needs.directives.needflow.process_needflow
    • sphinx_needs.directives.needgantt.process_needgantt
    • sphinx_needs.directives.needlist.process_needlist
    • sphinx_needs.directives.needpie.process_needpie
    • sphinx_needs.directives.needsequence.process_needsequence
    • sphinx_needs.directives.needtable.process_needtables
    • sphinx_needs.directives.needuml.process_needuml
      • Passes the full needs dict to as Jinja context variable for rendering UML content
      • Unpacks the need item into local variables, to use in Jinja context of rendering diagram template
    • sphinx_needs.directives.needreport
      • Calls analyse_needs_metrics, to count needs by type.
  • Processing of nodes generated in roles:

    • sphinx_needs.roles.need_count.process_need_count
    • sphinx_needs.roles.need_incoming.process_need_incoming
    • sphinx_needs.roles.need_outgoing.process_need_outgoing
    • sphinx_needs.roles.need_ref.process_need_ref
  • sphinx_needs.warnings.process_warnings the list of (non-external) need items is passed to filter_needs

  • sphinx_needs.builder

    • NeedsBuilder.finish() passes the full needs list filter_needs to generate list to output
    • NeedsIdBuilder.finish() passes the full needs list filter_needs to generate list to output

Exposure of the need object(s) to the user

From this I think its also worth noting, there are three main ways that the need data object (i.e. the internal represenation of the needs data) is exposed to the user:

  • sphinx_needs.functions.functions.resolve_variants_options calls user defined Python expressions, with the context including the unpacked need item

  • sphinx_needs.functions.execute_func calls user defined functions, parsing the functions:

    1. The specific need item
    2. The full needs dict
  • filter_single_need uses eval, with the context being:

    1. Unpacking the need item into local variables
    2. setting a local variable of current_need which is the full need item
    3. (optional) setting a local variable of needs which is the list needs passed to filter_single_need

Obviously changing the internal representation has implications for these. Personally, I feel that filter_single_need and resolve_variants_options should just use jinja expressions, then undefined context keys can be specifically handled. Obviously though this is a majorly breaking change.

Also to note, needs already have a jinja_content option, to turn the content into a jinja template. This seems to some degree a duplicate of the find_and_replace_node_content, that gets called for all rendered needs, where the [[ my_func(a, b, c) ]] is essentially a bespoke subset of jinja functionality. I would suggest that also it is a bug that, if jinja_content is True then the jinja is rendered immediately, rather than in a post-processing step, after the need data item has been fully processed.


Lastly to note, there are NeedsList.JSON_KEY_EXCLUSIONS_NEEDS and sphinx_needs.utils.INTERNALS, which determine some behaviour of how needs are represented/exported. This information I feel should be co-located on the need item "model" itself.


I agree that handling of validation and defaults can/should be improved. Ideallaly this would be compatible with the current code base.

One thing that I haven't seen mentioned here yet is https://github.com/pydantic/pydantic which is used widely for data validation

chrisjsewell avatar Feb 16 '24 16:02 chrisjsewell

Hi Chris, thanks for giving the insights into the code.

Regarding pydantic, there is already the extension by @ubmarco, which brings pydantic together with Sphinx-Needs: https://sphinx-modeling.useblocks.com/

danwos avatar Feb 19 '24 07:02 danwos

Heya, if you want to check out #1125, as an improvement to the current situation (that doesn't require too much back-breaking)

chrisjsewell avatar Feb 22 '24 12:02 chrisjsewell

We want to have the possibility to specify the keys exported. E.g. Currently the back links are no longer be exported. This leads to the issue, we would have to rebuild the links in later tooling, too. Even such tools are not well prepared to do so. See #1157

PhilipPartsch avatar Mar 26 '24 10:03 PhilipPartsch