Parsing that depends on context conditions could be postponed
Parsing that depends on context conditions should be postponed. For example: parsing that use information from other items should expect those items to be processed first, so the processing should be postponed someway.
I thought in implementing an Exception subclass of TikaException, (ContextConditionNotMetException or ContextConditionPending) that could be thrown by the Parser implementation and could be catched by the ParsingTask. If that exception is thrown, the parsing task would put the evidence item in the next priority queue to try to parse again. This way we don't need to change the API. But still remains a problem: if the item was postponed to reach the last position on the last queue, the parser should choose to parse it anyway, even without the condition met, to parse other informations not dependable of the prior condition. So we have to inform that this another condition is met: "last parsing try".
Well it is not so simple as I thought. The condition is discovered only in the ParsingTask, so it should be paused and continued latter, in the next priority queue processing.
The logic that sends items to next queues in inside AbstractTask.sendToNextTask(evidence) method.
Line 211 is the main one.
Yes. I'm there analyzing.
Well, i could successfully process my little case. I created a hashmap of "deprioritized items" associated with the respective task that deprioritized the item, so when it is time to reprocess it, it restarts from where it was stopped. This hashmap was created in class ProcessingQueues (deprioritizedItems).
But the counters became a little confused, show negative times to end.
Tomorrow I will continue.
I think that context variables must also be preserved somewhere to be restablished when the item starts to be processed again.
What would happen if file A depends on condition C, and file B depends on file A processing, but condition C never happens?
PS: I think current approach handles this specific situation.
The method "maxDeprioritizeItem" returns a boolean indicating if the postpone could be done. If the postpone wasn't done, the Parsing task would mark some "last try" metadata variable and try to call the parser the last time. The parser can identify this "last try" metadata an run any parsing that does not depends on the condition, and alert that the parse was done based on some unsatisfied condition.
In sequence: 1 - "A" will be postponed to the last queue, as C is not met. 2 - "B" will be postponed as "A" hasn't finished. 3 - Next try to process "A" will be postponed again, now to the last position of the last queue. This will be the last postpone. 4 - Next try to process "B" will be postponed again, now to the last position of the last queue. This will be the last postpone. 5 - Next try to process "A" will not be postponed again, as there is no more reason to postpone. The parsing task will process "A" with the "last try" metadata as marked as true. 6 - Next try to process "B" will not be postponed again, as there is no more reason to postpone. The parsing task will process "A" with the "last try" metadata as marked as true.
Il giorno gio 6 ott 2022 alle ore 07:27 Luis Filipe Nassif < @.***> ha scritto:
What would happen if file A depends on condition C, and file B depends on file A being processed, but condition C never happens?
— Reply to this email directly, view it on GitHub https://github.com/sepinf-inc/IPED/issues/1355#issuecomment-1269856249, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG247S62F7J22H6HF4Q5VZ3WB2ZTXANCNFSM6AAAAAAQ53YVCU . You are receiving this because you authored the thread.Message ID: @.***>
If C is met example, with B executing before A:
In sequence: 1 - "B" will be postponed as "A" didn't finished. 2 - "A" will be postponed to the last queue, as C is not met. 3 - Next try to process "B" will be postponed again, now to the last position of the last queue. This will be the last postpone. 4 - Process "D" makes the changes to attend C condition. 6 - Next try to process "A" will finish successfully. 7 - Next try to process "B" will finish successfully.
Il giorno gio 6 ott 2022 alle ore 08:43 Patrick Bernardina < @.***> ha scritto:
If C is met example:
In sequence: 1 - "A" will be postponed to the last queue, as C is not met. 2 - "B" will be postponed as "A" didn't finished. 3 - Next try to process "A" will be postponed again, now to the last position of the last queue. This will be the last postpone. 5 - Process "D" makes the changes to attend C condition. 4 - Next try to process "B" will be postponed again, now to the last position of the last queue. This will be the last postpone. 6 - Next try to process "A" will finish successfully. 7 - Next try to process "B" will finish successfully.
Il giorno gio 6 ott 2022 alle ore 08:40 Patrick Bernardina < @.***> ha scritto:
The method "maxDeprioritizeItem" returns a boolean indicating if the postpone could be done. If the postpone wasn't done, the Parsing task would mark some "last try" metadata variable and try to call the parser the last time. The parser can identify this "last try" metadata an run any parsing that does not depends on the condition, and alert that the parse was done based on some unsatisfied condition.
In sequence: 1 - "A" will be postponed to the last queue, as C is not met. 2 - "B" will be postponed as "A" didn't finished. 3 - Next try to process "A" will be postponed again, now to the last position of the last queue. This will be the last postpone. 4 - Next try to process "B" will be postponed again, now to the last position of the last queue. This will be the last postpone. 5 - Next try to process "A" will not be postponed again, as there is no more reason to postpone. The parsing task will process "A" with the "last try" metadata as marked as true. 6 - Next try to process "B" will not be postponed again, as there is no more reason to postpone. The parsing task will process "A" with the "last try" metadata as marked as true.
Il giorno gio 6 ott 2022 alle ore 07:27 Luis Filipe Nassif < @.***> ha scritto:
What would happen if file A depends on condition C, and file B depends on file A being processed, but condition C never happens?
— Reply to this email directly, view it on GitHub https://github.com/sepinf-inc/IPED/issues/1355#issuecomment-1269856249, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG247S62F7J22H6HF4Q5VZ3WB2ZTXANCNFSM6AAAAAAQ53YVCU . You are receiving this because you authored the thread.Message ID: @.***>
I think that context variables must also be preserved somewhere to be restablished when the item starts to be processed again.
Thinking better, context variables are of the context, not evidence dependant. Any changes will already be visible to next calls.
What would happen if file A depends on condition C, and file B depends on file A processing, but condition C never happens?
PS: I think current approach handles this specific situation.
The current approach should be more independent from code. Maybe in a
3 - Next try to process "A" will be postponed again, now to the last position of the last queue. This will be the last postpone. 4 - Next try to process "B" will be postponed again, now to the last position of the last queue. This will be the last postpone.
I think B could be postponed before A, if B is saw first. And all items in each processing queue (including the last queue) currently are processed in parallel.
"currently are processed in parallel." Well remembered. This can be a problem.
I have thought in the past, instead of hard fixing to what queue number each item goes, to automatically build the processing priorities (what item goes to each queue) using annotations in parsers, e.g.: Parser A depends on Parser B results. This would easy to plug parsers as plugins. But as the number of parser dependencies grows and if third party plugin parsers, not under our control, are installed, I think this eventually could cause some cyclic dependency...
So I left the priority configuration as is. One simple improvement would be to move it to iped-parsers module or to externalize it to a configuration file
patrickdalla closed this as completed 19 minutes ago
Was this intentional?
Anyway, do you think this is needed to implement #281?
"So I left the priority configuration as is. One simple improvement would be to move to iped-parsers module or to externalize it to a configuration file" Yes, this can be a better option.
We can declare something like this:
With this kind of declarations we can build the sequence of queues to execute first the attending parsers.
Had you thoght
patrickdalla closed this as completed 19 minutes ago
Was this intentional?
Anyway, do you think this is needed to implement #281?
No. Sorry.
"Anyway, do you think this is needed to implement https://github.com/sepinf-inc/IPED/issues/281?" Yes.
As one of the intentions is to extract date and ARES dates are LocalTimes without timezone information, the timezone information should be get from somewhere else.
Maybe I can use the current fixed scheme.
As one of the intentions is to extract date and ARES dates are LocalTimes without timezone information, the timezone information should be get from somewhere else.
Maybe I can use the current fixed scheme.
I think configuring to process Registry files in queue 1 and Ares in queue 2 would be enough.
I think configuring to process Registry files in queue 1 and Ares in queue 2 would be enough.
Actually, maybe no change is needed, because today Registry files are processed in queue 0 and Ares in queue 1.
Yes. You are right.
Il giorno gio 6 ott 2022 alle ore 09:40 Luis Filipe Nassif < @.***> ha scritto:
I think configuring to process Registry files in queue 1 and Ares in queue 2 would be enough.
Actually, maybe no change is needed, because today Registry files are processed in queue 0 and Ares in queue 1.
— Reply to this email directly, view it on GitHub https://github.com/sepinf-inc/IPED/issues/1355#issuecomment-1270069998, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG247S5HLQCUJMN26PKXCMDWB3JELANCNFSM6AAAAAAQ53YVCU . You are receiving this because you modified the open/close state.Message ID: @.***>
If C is met example:
In sequence: 1 - "A" will be postponed to the last queue, as C is not met. 2 - "B" will be postponed as "A" didn't finished. 3 - Next try to process "A" will be postponed again, now to the last position of the last queue. This will be the last postpone. 5 - Process "D" makes the changes to attend C condition. 4 - Next try to process "B" will be postponed again, now to the last position of the last queue. This will be the last postpone. 6 - Next try to process "A" will finish successfully. 7 - Next try to process "B" will finish successfully.
Il giorno gio 6 ott 2022 alle ore 08:40 Patrick Bernardina < @.***> ha scritto:
The method "maxDeprioritizeItem" returns a boolean indicating if the postpone could be done. If the postpone wasn't done, the Parsing task would mark some "last try" metadata variable and try to call the parser the last time. The parser can identify this "last try" metadata an run any parsing that does not depends on the condition, and alert that the parse was done based on some unsatisfied condition.
In sequence: 1 - "A" will be postponed to the last queue, as C is not met. 2 - "B" will be postponed as "A" didn't finished. 3 - Next try to process "A" will be postponed again, now to the last position of the last queue. This will be the last postpone. 4 - Next try to process "B" will be postponed again, now to the last position of the last queue. This will be the last postpone. 5 - Next try to process "A" will not be postponed again, as there is no more reason to postpone. The parsing task will process "A" with the "last try" metadata as marked as true. 6 - Next try to process "B" will not be postponed again, as there is no more reason to postpone. The parsing task will process "A" with the "last try" metadata as marked as true.
Il giorno gio 6 ott 2022 alle ore 07:27 Luis Filipe Nassif < @.***> ha scritto:
What would happen if file A depends on condition C, and file B depends on file A being processed, but condition C never happens?
— Reply to this email directly, view it on GitHub https://github.com/sepinf-inc/IPED/issues/1355#issuecomment-1269856249, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG247S62F7J22H6HF4Q5VZ3WB2ZTXANCNFSM6AAAAAAQ53YVCU . You are receiving this because you authored the thread.Message ID: @.***>
If C is met example: In sequence: 1 - "A" will be postponed to the last queue, as C is not met. 2 - "B" will be postponed as "A" didn't finished. 3 - Next try to process "A" will be postponed again, now to the last position of the last queue. This will be the last postpone. 5 - Process "D" makes the changes to attend C condition. 4 - Next try to process "B" will be postponed again, now to the last position of the last queue. This will be the last postpone. 6 - Next try to process "A" will finish successfully. 7 - Next try to process "B" will finish successfully.
I still think this may not work if B is seen before A and because items in the same queue could be processed concurrently.
I suggest leaving this as a future improvement for when it is really needed and when current approach is not enough to implement something.
Right. I agree.
Il giorno mar 11 ott 2022 alle ore 09:04 Luis Filipe Nassif < @.***> ha scritto:
I suggest leaving this as a future improvement for when it is really needed and when current approach is not enough to implement something.
— Reply to this email directly, view it on GitHub https://github.com/sepinf-inc/IPED/issues/1355#issuecomment-1274656561, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG247S5NDGQMDC5NPUYIFODWCVQVBANCNFSM6AAAAAAQ53YVCU . You are receiving this because you modified the open/close state.Message ID: @.***>
I'm closing this for now. We can reopen in the future if it is needed.