sphinx-needs
sphinx-needs copied to clipboard
Duplicate Needs IDs not detected when running in parallel
When running a parallel build, duplicate needs aren't always detected. This is because need IDs are only scanned for duplicates when they are read, not when parallel processes are merged. During merge, one environment simply overwrites the needs of the other, ignoring potential duplicates that were parsed in parallel: https://github.com/useblocks/sphinx-needs/blob/47bd4c54cd45b751b035fb0871f98c2d2d34f232/sphinx_needs/needs.py#L713-L715 Because of this implementation, detecting the duplicates is dependent on how the documents are chunked for reading.
Thanks for the report and good finding :+1:
I think a fix should be easily possible by checking the keys for duplicates during the build.
Thanks for the report and good finding 👍
I think a fix should be easily possible by checking the keys for duplicates during the build.
I thought so too, but if you raise an exception in a sub-process, the build will get stuck, because the process cannot be joined. I have a hacky quick fix for this, but it relies on some modifications to Sphinx itself, in order to handle the error:
def is_identical_need(need, other_need):
return len(need.keys()) == len(other_need.keys()) and all(
# Convert lists and dicts to string for literal comparison
str(need[key]) == str(other_need[key])
for key in need.keys()
)
needs = env.needs_all_needs
other_needs = other.needs_all_needs
for key in other_needs.keys():
# Skip external needs
is_local = not other_needs[key]["is_external"]
if is_local and key in needs.keys() and not is_identical_need(needs[key], other_needs[key]):
raise NeedsDuplicatedId(
f"Found duplicate need ID {key} in {other_needs[key]['docname']}."
f"Need already defined in {needs[key]['docname']}."
)
The is_identical_need
is the hacky part and necessary, because for some reason, the results of exactly one process are merged twice in our build.
Also, we need to check for external needs specifically, since they are added to the environment before reading. This means they will be included in every process.
Another question would be if the DuplicateId is the only validation that has a global scope and cannot work as part of individual processes or are there other such validations. Does it then make sense to move such validations to a later stage, that is, after the process merges are completed?
@twodrops: A later check is in my understanding not possible, as a Need with the same Id may have already overwritten the original one during the merge. So there will never be more than one Need for an ID.
@Rubyfi: Ohh I thought merge_data()
gets executed by the main process for each subprocess:
https://github.com/useblocks/sphinx-needs/blob/master/sphinx_needs/needs.py#L704
Didn't know that this is part of the subprocess call itself, which indeed makes it more complex.
I guess we need to find a solution here. This issue is creating some non deterministic pipeline behaviour in one of our projects and they have categorized this as critical. Going back to single threaded sphinx build is unfortunately not an option for this project with more than 33,000 rst files :(
I can confirm this issues in my project causing non-deterministic errors.
Have a new idea for a solution.
We can check for ID duplication also during merging the result of a worker into the "global" sphinx env.
As this merge()
is called by a subprocess, raising an exception is not an option.
But we can set a new variable in the global env, if something bad happens during our merge operations:
merge_errors
: list of strings, where each entry describes a found error.
If len(merge_errors) >= 1
, we know something bad happened and we can output/log the list-elements as warning.