IPED icon indicating copy to clipboard operation
IPED copied to clipboard

Parsing that depends on context conditions could be postponed

Open patrickdalla opened this issue 3 years ago • 21 comments

Parsing that depends on context conditions should be postponed. For example: parsing that use information from other items should expect those items to be processed first, so the processing should be postponed someway.

I thought in implementing an Exception subclass of TikaException, (ContextConditionNotMetException or ContextConditionPending) that could be thrown by the Parser implementation and could be catched by the ParsingTask. If that exception is thrown, the parsing task would put the evidence item in the next priority queue to try to parse again. This way we don't need to change the API. But still remains a problem: if the item was postponed to reach the last position on the last queue, the parser should choose to parse it anyway, even without the condition met, to parse other informations not dependable of the prior condition. So we have to inform that this another condition is met: "last parsing try".

patrickdalla avatar Oct 05 '22 19:10 patrickdalla

Well it is not so simple as I thought. The condition is discovered only in the ParsingTask, so it should be paused and continued latter, in the next priority queue processing.

patrickdalla avatar Oct 05 '22 19:10 patrickdalla

The logic that sends items to next queues in inside AbstractTask.sendToNextTask(evidence) method.

lfcnassif avatar Oct 05 '22 20:10 lfcnassif

Line 211 is the main one.

lfcnassif avatar Oct 05 '22 20:10 lfcnassif

Yes. I'm there analyzing.

patrickdalla avatar Oct 05 '22 20:10 patrickdalla

Well, i could successfully process my little case. I created a hashmap of "deprioritized items" associated with the respective task that deprioritized the item, so when it is time to reprocess it, it restarts from where it was stopped. This hashmap was created in class ProcessingQueues (deprioritizedItems).

But the counters became a little confused, show negative times to end.

Tomorrow I will continue.

patrickdalla avatar Oct 05 '22 20:10 patrickdalla

I think that context variables must also be preserved somewhere to be restablished when the item starts to be processed again.

patrickdalla avatar Oct 05 '22 21:10 patrickdalla

What would happen if file A depends on condition C, and file B depends on file A processing, but condition C never happens?

PS: I think current approach handles this specific situation.

lfcnassif avatar Oct 06 '22 11:10 lfcnassif

The method "maxDeprioritizeItem" returns a boolean indicating if the postpone could be done. If the postpone wasn't done, the Parsing task would mark some "last try" metadata variable and try to call the parser the last time. The parser can identify this "last try" metadata an run any parsing that does not depends on the condition, and alert that the parse was done based on some unsatisfied condition.

In sequence: 1 - "A" will be postponed to the last queue, as C is not met. 2 - "B" will be postponed as "A" hasn't finished. 3 - Next try to process "A" will be postponed again, now to the last position of the last queue. This will be the last postpone. 4 - Next try to process "B" will be postponed again, now to the last position of the last queue. This will be the last postpone. 5 - Next try to process "A" will not be postponed again, as there is no more reason to postpone. The parsing task will process "A" with the "last try" metadata as marked as true. 6 - Next try to process "B" will not be postponed again, as there is no more reason to postpone. The parsing task will process "A" with the "last try" metadata as marked as true.

Il giorno gio 6 ott 2022 alle ore 07:27 Luis Filipe Nassif < @.***> ha scritto:

What would happen if file A depends on condition C, and file B depends on file A being processed, but condition C never happens?

— Reply to this email directly, view it on GitHub https://github.com/sepinf-inc/IPED/issues/1355#issuecomment-1269856249, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG247S62F7J22H6HF4Q5VZ3WB2ZTXANCNFSM6AAAAAAQ53YVCU . You are receiving this because you authored the thread.Message ID: @.***>

patrickdalla avatar Oct 06 '22 12:10 patrickdalla

If C is met example, with B executing before A:

In sequence: 1 - "B" will be postponed as "A" didn't finished. 2 - "A" will be postponed to the last queue, as C is not met. 3 - Next try to process "B" will be postponed again, now to the last position of the last queue. This will be the last postpone. 4 - Process "D" makes the changes to attend C condition. 6 - Next try to process "A" will finish successfully. 7 - Next try to process "B" will finish successfully.

Il giorno gio 6 ott 2022 alle ore 08:43 Patrick Bernardina < @.***> ha scritto:

If C is met example:

In sequence: 1 - "A" will be postponed to the last queue, as C is not met. 2 - "B" will be postponed as "A" didn't finished. 3 - Next try to process "A" will be postponed again, now to the last position of the last queue. This will be the last postpone. 5 - Process "D" makes the changes to attend C condition. 4 - Next try to process "B" will be postponed again, now to the last position of the last queue. This will be the last postpone. 6 - Next try to process "A" will finish successfully. 7 - Next try to process "B" will finish successfully.

Il giorno gio 6 ott 2022 alle ore 08:40 Patrick Bernardina < @.***> ha scritto:

The method "maxDeprioritizeItem" returns a boolean indicating if the postpone could be done. If the postpone wasn't done, the Parsing task would mark some "last try" metadata variable and try to call the parser the last time. The parser can identify this "last try" metadata an run any parsing that does not depends on the condition, and alert that the parse was done based on some unsatisfied condition.

In sequence: 1 - "A" will be postponed to the last queue, as C is not met. 2 - "B" will be postponed as "A" didn't finished. 3 - Next try to process "A" will be postponed again, now to the last position of the last queue. This will be the last postpone. 4 - Next try to process "B" will be postponed again, now to the last position of the last queue. This will be the last postpone. 5 - Next try to process "A" will not be postponed again, as there is no more reason to postpone. The parsing task will process "A" with the "last try" metadata as marked as true. 6 - Next try to process "B" will not be postponed again, as there is no more reason to postpone. The parsing task will process "A" with the "last try" metadata as marked as true.

Il giorno gio 6 ott 2022 alle ore 07:27 Luis Filipe Nassif < @.***> ha scritto:

What would happen if file A depends on condition C, and file B depends on file A being processed, but condition C never happens?

— Reply to this email directly, view it on GitHub https://github.com/sepinf-inc/IPED/issues/1355#issuecomment-1269856249, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG247S62F7J22H6HF4Q5VZ3WB2ZTXANCNFSM6AAAAAAQ53YVCU . You are receiving this because you authored the thread.Message ID: @.***>

patrickdalla avatar Oct 06 '22 12:10 patrickdalla

I think that context variables must also be preserved somewhere to be restablished when the item starts to be processed again.

Thinking better, context variables are of the context, not evidence dependant. Any changes will already be visible to next calls.

patrickdalla avatar Oct 06 '22 12:10 patrickdalla

What would happen if file A depends on condition C, and file B depends on file A processing, but condition C never happens?

PS: I think current approach handles this specific situation.

The current approach should be more independent from code. Maybe in a

patrickdalla avatar Oct 06 '22 13:10 patrickdalla

3 - Next try to process "A" will be postponed again, now to the last position of the last queue. This will be the last postpone. 4 - Next try to process "B" will be postponed again, now to the last position of the last queue. This will be the last postpone.

I think B could be postponed before A, if B is saw first. And all items in each processing queue (including the last queue) currently are processed in parallel.

lfcnassif avatar Oct 06 '22 13:10 lfcnassif

"currently are processed in parallel." Well remembered. This can be a problem.

patrickdalla avatar Oct 06 '22 13:10 patrickdalla

I have thought in the past, instead of hard fixing to what queue number each item goes, to automatically build the processing priorities (what item goes to each queue) using annotations in parsers, e.g.: Parser A depends on Parser B results. This would easy to plug parsers as plugins. But as the number of parser dependencies grows and if third party plugin parsers, not under our control, are installed, I think this eventually could cause some cyclic dependency...

So I left the priority configuration as is. One simple improvement would be to move it to iped-parsers module or to externalize it to a configuration file

lfcnassif avatar Oct 06 '22 13:10 lfcnassif

patrickdalla closed this as completed 19 minutes ago

Was this intentional?

Anyway, do you think this is needed to implement #281?

lfcnassif avatar Oct 06 '22 13:10 lfcnassif

"So I left the priority configuration as is. One simple improvement would be to move to iped-parsers module or to externalize it to a configuration file" Yes, this can be a better option.

We can declare something like this: <attendingParsers> RegistryParser </attendingParsers> <dependableParsers> AresParser </dependableParsers>

With this kind of declarations we can build the sequence of queues to execute first the attending parsers.

Had you thoght

patrickdalla avatar Oct 06 '22 13:10 patrickdalla

patrickdalla closed this as completed 19 minutes ago

Was this intentional?

Anyway, do you think this is needed to implement #281?

No. Sorry.

patrickdalla avatar Oct 06 '22 13:10 patrickdalla

"Anyway, do you think this is needed to implement https://github.com/sepinf-inc/IPED/issues/281?" Yes.

As one of the intentions is to extract date and ARES dates are LocalTimes without timezone information, the timezone information should be get from somewhere else.

Maybe I can use the current fixed scheme.

patrickdalla avatar Oct 06 '22 13:10 patrickdalla

As one of the intentions is to extract date and ARES dates are LocalTimes without timezone information, the timezone information should be get from somewhere else.

Maybe I can use the current fixed scheme.

I think configuring to process Registry files in queue 1 and Ares in queue 2 would be enough.

lfcnassif avatar Oct 06 '22 13:10 lfcnassif

I think configuring to process Registry files in queue 1 and Ares in queue 2 would be enough.

Actually, maybe no change is needed, because today Registry files are processed in queue 0 and Ares in queue 1.

lfcnassif avatar Oct 06 '22 13:10 lfcnassif

Yes. You are right.

Il giorno gio 6 ott 2022 alle ore 09:40 Luis Filipe Nassif < @.***> ha scritto:

I think configuring to process Registry files in queue 1 and Ares in queue 2 would be enough.

Actually, maybe no change is needed, because today Registry files are processed in queue 0 and Ares in queue 1.

— Reply to this email directly, view it on GitHub https://github.com/sepinf-inc/IPED/issues/1355#issuecomment-1270069998, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG247S5HLQCUJMN26PKXCMDWB3JELANCNFSM6AAAAAAQ53YVCU . You are receiving this because you modified the open/close state.Message ID: @.***>

patrickdalla avatar Oct 06 '22 13:10 patrickdalla

If C is met example:

In sequence: 1 - "A" will be postponed to the last queue, as C is not met. 2 - "B" will be postponed as "A" didn't finished. 3 - Next try to process "A" will be postponed again, now to the last position of the last queue. This will be the last postpone. 5 - Process "D" makes the changes to attend C condition. 4 - Next try to process "B" will be postponed again, now to the last position of the last queue. This will be the last postpone. 6 - Next try to process "A" will finish successfully. 7 - Next try to process "B" will finish successfully.

Il giorno gio 6 ott 2022 alle ore 08:40 Patrick Bernardina < @.***> ha scritto:

The method "maxDeprioritizeItem" returns a boolean indicating if the postpone could be done. If the postpone wasn't done, the Parsing task would mark some "last try" metadata variable and try to call the parser the last time. The parser can identify this "last try" metadata an run any parsing that does not depends on the condition, and alert that the parse was done based on some unsatisfied condition.

In sequence: 1 - "A" will be postponed to the last queue, as C is not met. 2 - "B" will be postponed as "A" didn't finished. 3 - Next try to process "A" will be postponed again, now to the last position of the last queue. This will be the last postpone. 4 - Next try to process "B" will be postponed again, now to the last position of the last queue. This will be the last postpone. 5 - Next try to process "A" will not be postponed again, as there is no more reason to postpone. The parsing task will process "A" with the "last try" metadata as marked as true. 6 - Next try to process "B" will not be postponed again, as there is no more reason to postpone. The parsing task will process "A" with the "last try" metadata as marked as true.

Il giorno gio 6 ott 2022 alle ore 07:27 Luis Filipe Nassif < @.***> ha scritto:

What would happen if file A depends on condition C, and file B depends on file A being processed, but condition C never happens?

— Reply to this email directly, view it on GitHub https://github.com/sepinf-inc/IPED/issues/1355#issuecomment-1269856249, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG247S62F7J22H6HF4Q5VZ3WB2ZTXANCNFSM6AAAAAAQ53YVCU . You are receiving this because you authored the thread.Message ID: @.***>

patrickdalla avatar Oct 11 '22 07:10 patrickdalla

If C is met example: In sequence: 1 - "A" will be postponed to the last queue, as C is not met. 2 - "B" will be postponed as "A" didn't finished. 3 - Next try to process "A" will be postponed again, now to the last position of the last queue. This will be the last postpone. 5 - Process "D" makes the changes to attend C condition. 4 - Next try to process "B" will be postponed again, now to the last position of the last queue. This will be the last postpone. 6 - Next try to process "A" will finish successfully. 7 - Next try to process "B" will finish successfully.

I still think this may not work if B is seen before A and because items in the same queue could be processed concurrently.

lfcnassif avatar Oct 11 '22 13:10 lfcnassif

I suggest leaving this as a future improvement for when it is really needed and when current approach is not enough to implement something.

lfcnassif avatar Oct 11 '22 13:10 lfcnassif

Right. I agree.

Il giorno mar 11 ott 2022 alle ore 09:04 Luis Filipe Nassif < @.***> ha scritto:

I suggest leaving this as a future improvement for when it is really needed and when current approach is not enough to implement something.

— Reply to this email directly, view it on GitHub https://github.com/sepinf-inc/IPED/issues/1355#issuecomment-1274656561, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG247S5NDGQMDC5NPUYIFODWCVQVBANCNFSM6AAAAAAQ53YVCU . You are receiving this because you modified the open/close state.Message ID: @.***>

patrickdalla avatar Oct 11 '22 13:10 patrickdalla

I'm closing this for now. We can reopen in the future if it is needed.

lfcnassif avatar Jan 24 '23 19:01 lfcnassif