doorstop
doorstop copied to clipboard
Questions on scalability of doorstop
I have some question that go in the direction of a potential optimization of doorstop in order to support large amounts of items/requirements (hundred thousands to millions).
After a quick test with 2000 requirements in two documents, I assume doorstop was designed for hundreds to thousands requirements. Is this correct?
Could you elaborate on the design decision to use one file per item?
What is your opinion (pros/cons) on having several items in one yaml file?
Would the validate routine run faster if one would only allow references (ref property) to full paths plus a tag name, so doorstop would not need to scan the whole git repository to find the file?
Could we parallelize the validation routine, e.g., spawning a new process per document?
Thanks for reading and looking forward to any answers or ideas!
This is beyond my endurance to see a good question having no answers :)
I am not an active maintainer of the doorstop and I don't know the history but I can try to answer some of your questions because I have spent some time already looking at the source code.
After a quick test with 2000 requirements in two documents, I assume doorstop was designed for hundreds to thousands requirements. Is this correct?
I think Doorstop's code was simply not written with performance as a goal. It has quite a few places in the code where there is a good space for optimizations including changing to combining multiple items in one document file.
Could you elaborate on the design decision to use one file per item?
Doorstop's paper says nothing about it. But my guess would be that conceptual answer is atomicity: each requirement item is a standalone file and that makes it free from any of the surrounding context, which it would have if it was a part of a bigger file with multiple items.
What is your opinion (pros/cons) on having several items in one yaml file?
I have proposed this myself but for one Markdown file here but in that proposal, the goal is not the performance but rather making it much easier to review requirements in GitLab/GitHub because when you review one! document you can see a markdown file with all of the items. That makes it closer to a normal code review process while looking at the items in separate files breaks a code review flow because you cannot see the connections between separate item-files.
Would the validate routine run faster if one would only allow references (ref property) to full paths plus a tag name, so doorstop would not need to scan the whole git repository to find the file?
It could be implemented in my long-standing PR which changes from item having one ref to having multiple references
: RFC: Item#references: initial support of many ref items (Take 3) #423. I see this could be done.
Could we parallelize the validation routine, e.g., spawning a new process per document?
Yes, I see no problem here, but check my note on the backends below.
Having all of the above said, I think what Doorstop is missing currently is a concept of a backend. It has only one working backend: separate item files stored in yaml and the code for doing all of that is not decoupled properly from the core data model in the Item class.
There is one rather easy change that could be done: all the backend code could be extracted to a separate module like: doorstop/backend/default
(or a better name).
Given that the backend layer is isolated, Doorstop could have a very clean core Item
model agnostic of any of the YAML/Markdown or whatever details. After that it could be easy to implement any other kind of backend including storing items in bigger files.
This is a minimal set of files for backend:
doorstop:
backends:
default:
- reader
- writer
- validator
From the RTEMS forums I am aware that @sebhub also has concerns about the performance of Doorstop, see here where he sees that the PyYAML is likely a bottleneck.
Having a clear backend layer would allow changing to different solutions like JSON or single files and doing this would not affect the Doorstop Core given we manage to do this isolation Core vs Backend layer.
I myself could implement all of this given Doorstop had a good contract of end-to-end tests. There are some integration level tests but they are not collected into a coherent set so I find it hard to run into such a refactoring enterprise without having a good end-to-end test harness.
With end-to-end tests, a Rosetta Stone kind of test could be implemented: you start with some document and then read-write it through all available backends and exports and in the end, you verify that you still have the same document. Doing this would make all backends conform to the same contract ensured by the end-to-end tests.
I have proposed using LLVM Integrated Tester here: RFC: Initial LIT-based integration tests setup #431, but it needs some more tooling around so currently the whole effort it is not progressing well and quickly :(
Hope this all makes sense.
I guess it would require a bit of work make Doorstop fit for millions of items. The items are just a directed, acyclic graph containing key-value pairs, so the data structure is fine. The problem is the stateless command line tool which has to load everything from the file system in each invocation. I would probably re-implement it in C++ for this job and use some sort of caching. With so many items you have to consider also the rate of change and how the development process is organized with such a big repository.
The integration tests would be the first step.
Caching would help a lot, especially if you integrated it with the caching that Git already uses in terms of tracking file changes. If you scan for refs the first time you initialize Doorstop in a checkout you'd only need to scan the diffs between invocations, for both items/docs and ref searches.
These are things I've been thinking about as I work on Doorframe to speed things up considerably.