jesterj
jesterj copied to clipboard
Handle non-linear FTI
A recent discussion on another project made me realize that the FTI solution provided thus far only works well in a case where the DAG has only one output to either Solr or ElasticSearch, and no concern for output from other termini or intermediate processors. The current system needs to be re-designed to account for:
- Understanding which steps in the DAG are terminal steps
- Tracking FTI status for each terminus
- Accounting for splitting of documents
- Possibly making documents aware of Terminal Steps they have completed such that routers aware of downstream termini can avoid re-sending the document
- Possibly making intermediate processors aware of whether they've output a given document before or not.
One limitation on the implementation for this will be that we will make best-effort attempts to not repeat work already done, but we will not (in this release) attempt to guarantee a distributed transaction with an external side effect.
Practically speaking this means there will be a low risk of duplicate writes, but search engines normally are built to provide idempotent writing, and the inherently side-effecting processors we ship write to search engines. Fetch URL could also be side-effecting, but HTTP GET requests are supposed to be "SAFE" which encompasses idempotent as well.
The primary reason for this limitation is that Cassandra is eventually consistent, and does not provide ACID let alone XA transactions. Support for use cases where a single write per document are critical will require pluggable persistence and a TransactionalDocumentProcessor class/interface.
Now that I've got a serious working test of this use case, (not yet checked in, expanding the Non Linear FTI test added here) I've uncovered 2 deficiencies
- The model for Document doesn't match what we need to do with persistence, and this is ballooning code complexity
- The current focus on downstream potent steps is almost, but not quite right, and we probably need to track all paths through the dag, not just destinations. This should be simply a matter of renaming our potent-step related methods and improving the step names to be step path designators. This can be done after 1 is corrected and the unit test is passing again.
Also, in the back of my mind (not new new), is that we still need to up-rate idempotent steps to potent if they are the final step on a branch in the DAG, and error the plan at startup if there are any branch endings that are not potent (or up-rated idempotent).
The model is updated and the key case of handling an error at one tail of a plan that looks like this without re-sending to the other destinations is working (except on github actions where there are only 2 cores available, and the test thread can take nearly infinity to come back from a sleep). Remaining:
- Handle idempotent steps too, these are outputs and if missed need to be executed, but it's not critical to avoid executing them multiple times. Both mid-stream and terminal idempotent steps need to be thought about here.
- Handle Diamond shaped plans where the division duplicates of documents. In this case we need to consider the steps right after the duplication as modifiers to each possible destination.
This is now only waiting on the documentation in the wiki to be updated to match the implementation
Found an issue with how processors (can't) update status properly, from within the processor code anymore, because they now don't have a way to know the correct name for the status, so that needs fixed.
The version referenced above is now mostly working except for a few rare failures in the unit tests. There have been 3 types of failures, one, where 35 instead of 30 are recorded indicating that one of the pause steps unpaused before the plan could be deactivated. That's more or less a test design/timing issue, and not likely to indicate a problem in the code. This error went away when I lengthened the pause time for the test. The second type was one single case where it appears by chance only one of the two paths in the FTI diamond test produced errors (but all the other counts lined up). This could theoretically happen by chance, and only saw it once in 100 runs, so I suspect that too is a design issue. The last type of failure was an off by 1 where it seems one document gets indexed twice. That is of concern because it violates our intention to provide no more than once delivery. In 100 runs of the 2 into 3 FTI test this type of failure showed up three times.
With the last fix the test has now passed 429 times in a row (running over night) on my desktop. Unfortunately it still seems to fail 2/2 times on crave.io build infrastructure, so I need to now focus on that.
In re-reading my wiki discussion of FTI a bit I realized I've not written a test involving child documents yet, and I anticipate some changes are needed to properly account for that use case. There are many use cases that don't involve child documents, so I will break that off and list it as a known issue for the beta-3 and 1.0 releases, and target 1.1 for child doc support. There is a conceivable workaround to serialize the child documents to disk in the processor (as potent step), and add a file scanner that picks those children back off of disk and processes them, which would then provide all the necessary FTI at the child level, but care to ensure they had the right attributes (parentId, processing status, etc) would be required.
Eventually the framework should hide these details of our implementation from users, but I'd rather put out a release that correctly covers 90% of use cases sooner rather than waiting a long time to handle every use case I can think of.
So once the tests pass on crave.io build infrastructure, and the wiki is updated to match the implementation this issue will be resolved.
Only doc updates remain, so braver folks can feel free to build and test this now.
Actually, in updating the docs I realize I still haven't got tests verifying handling of non-deterministic routers, nor do I have a test demonstrating heuristic detection of permanent failures, so those tests are also required for completion, and I suspect there might be something not implemented carefully with respect to non-deterministic routers.
FTI doc is updated here such that it no longer is in direct conflict with the implementation. Hopefully it still makes sense too :).
After drawing out some cases and thinking a bit, and updating the wiki I'm convinced that although non-deterministic routers could lead to complex results the current design of the system is entirely reasonable since the moment new destinations are selected, the excluded destinations are marked DROPPED (prior to passing the document on). I don't think any changes are needed. This issue can be closed :tada: