llmware icon indicating copy to clipboard operation
llmware copied to clipboard

Reduce loops in input ingestion comparison

Open osi1880vr opened this issue 1 year ago • 1 comments

Here we change the flow a little to get rid of really large loops. In the original there is 3 loops 1 loop the distinct result from the store to create a array with only file names and no path 2 loop is file_list times the 3 loop 3 loop is distinct_files times

in my case it became 1 loop 700000 times 2 loop 5000 times 3 loop 700000 times so the combination of loop 2 and 3 is a larger loop of 5000*700000 = 3.500.000.000 thats 3.5 billion.....

what I did is in loop 1 I already reduce the output to just the data needed for this iteration which is such files that are in the store and in file_list

then there is no loop 2 and no loop 3 as now I just substract the found_list from the file_list and that will give us not_found_list

this is a big win in computation time.

additionaly I added a shortcut at the beginning so that if file_list is just empty we do nothing and return empty arrays as that would be the result anyways

My guess is the same would also work for input_ingestion_comparison_from_parser_state but since I could not test that one as I have no usecase yet I did leave it untouched, But if someone sees this and knows the bigger picture have a look if it could work there the same way

osi1880vr avatar Feb 16 '24 16:02 osi1880vr

@osi1880vr - thanks for this contribution and focus on this issue - it is an important area of optimization. I will go through it this afternoon - please give me 1-2 days, as I will run it through a lot of tests as part of integrating it into the main code base. 😄

doberst avatar Feb 16 '24 16:02 doberst

👍

doberst avatar Feb 18 '24 14:02 doberst