dedupe icon indicating copy to clipboard operation
dedupe copied to clipboard

scoring pairs is much slower after training then after loading settings file.

Open fgregg opened this issue 2 years ago • 4 comments

this is going to be a pain to debug, i think.


To reproduce.

Get code for this linking project: https://github.com/labordata/fmcs-f7/tree/37e6e805ceb6ec8dee7844fbe7f45b71609066ad

make update_raw
rm link.csv
make link.csv

this will train dedupe and then do scoring and clustering. the scoring and clustering will be very slow

rm link.csv
make link.csv

this will use the settings file created in previous run and scoring and clustering will be much faster

fgregg avatar Mar 01 '22 05:03 fgregg

I can confirm same is happening for my custom data loaded. Once settings file is created in previous run, then loaded, scoring and clustering is way much faster.

caligoig avatar Mar 03 '22 22:03 caligoig

I played with this a bit. It seems the difference in runtime starts in the fillQueue function however as far as I could tell the inputs to that function were the same both times. Much more memory was in use when doing the training and onward (according to psutil.Process(os.getpid()).memory_info().rss) so that could have to do with the performance difference.

Changing the chunk_size parameter of fillQueue from 20,000 to 1000 seemed to greatly improve the performance when having done the training and slightly improve the performance running it the other way.

In order to get this to run on my computer I reduced the data size by adding in:

    data_d = readData(input_file)
    data_d = {k: data_d[k] for k in list(data_d)[:3000]}

adamzev avatar Jun 02 '22 18:06 adamzev

thanks for this!

fgregg avatar Jun 02 '22 18:06 fgregg

this makes me think that the data model is not getting cleaned up (related to the #956). I would have thought the fixes to that would have address this too, but maybe not.

fgregg avatar Jun 02 '22 19:06 fgregg