dedupe
dedupe copied to clipboard
scoring pairs is much slower after training then after loading settings file.
this is going to be a pain to debug, i think.
To reproduce.
Get code for this linking project: https://github.com/labordata/fmcs-f7/tree/37e6e805ceb6ec8dee7844fbe7f45b71609066ad
make update_raw
rm link.csv
make link.csv
this will train dedupe and then do scoring and clustering. the scoring and clustering will be very slow
rm link.csv
make link.csv
this will use the settings file created in previous run and scoring and clustering will be much faster
I can confirm same is happening for my custom data loaded. Once settings file is created in previous run, then loaded, scoring and clustering is way much faster.
I played with this a bit. It seems the difference in runtime starts in the fillQueue
function however as far as I could tell the inputs to that function were the same both times. Much more memory was in use when doing the training and onward (according to psutil.Process(os.getpid()).memory_info().rss
) so that could have to do with the performance difference.
Changing the chunk_size
parameter of fillQueue
from 20,000 to 1000 seemed to greatly improve the performance when having done the training and slightly improve the performance running it the other way.
In order to get this to run on my computer I reduced the data size by adding in:
data_d = readData(input_file)
data_d = {k: data_d[k] for k in list(data_d)[:3000]}
thanks for this!
this makes me think that the data model is not getting cleaned up (related to the #956). I would have thought the fixes to that would have address this too, but maybe not.