dedupe disk has reached capacity issue with moderate record size with >500 gb of free disk space

Related to this issue which I am unsure why it was closed as a resolution does not seem available. #581

I have run into the same disk and memory issue as reported previously. On a large sagemaker instance with 1 tb of disk space on 1m records and locally on 10m records with 500gb of disk space.

Is there a way to set the limit of this tempfile and where is this tempfile located?

Feb 18 '22 16:02 zwarshavsky

can you post a traceback, and your datamodel?

Feb 18 '22 18:02 fgregg

Data models see next comment.

"Step 2" model caused issues on Sagemaker with 1m row input file.

Both times the issue encountered was on the OS vs kernel level. The last local run with 10m rows forced a restart of my machine when I reached dangerously low remaining disk space (starting with 330gb free).

On Sagemaker was something like "disk or database has reached capacity" error which killed the ipython kernel.

ED: removed attached settings files

Feb 18 '22 18:02 zwarshavsky

do you have any sense of where in the program you were?

Feb 18 '22 19:02 fgregg

could you just post the dictionary definition of the data model

Feb 18 '22 19:02 fgregg

model 1:

fields = [
    {'field': 'email', 'type': 'String'},
    {'field': 'phone_number', 'type': 'String'}
    ]

model 2:

fields = [
    {'field': 'full_name', 'type': 'Name'},
    {'field': 'full_address', 'type': 'Address' },
    {'field': 'phone_number', 'type': 'String'}
    ]

Feb 18 '22 19:02 zwarshavsky

On deduper.partition()

Feb 18 '22 19:02 zwarshavsky

from the information we have here, i think this is probably not a bug.

within the partition method, there's a lot of places where potentially very large objects will be written to disk. historically, we have not really done anything to reduce disk usage.

here are places where we write a lot to disk, and some possible mitigations

the blocking map. this is written to a sqlite database. virtual compound predicates might help a little bit, but beyond that, not a lot we can do.
the join that produces the record pairs. if this query leads sqlite to produce a temporary materialization, this could be very big. There's potentially a lot that could be done here.
the scored pairs are written to a memmaped numpy array. if we did some pre-filtering of the scores as we have previous discussed, that will likely significantly help.

I'm open to all these types of changes, but I would want to start with actually knowing where the bottleneck is. @zwarshavsky could you put in some monitoring to see where in the pairs method you run out of disk space?

Feb 19 '22 23:02 fgregg

@fgregg Think it would be useful if dedupe actually had some profiling code built in? Seems like this sort of debugging/guesswork is sorta common. I'm no expert in this, but perhaps following this example it would just require adding a @profile to all the functions we care about. Then for debugging you just ask people to run mprof run myscript.py and post the output of mprof plot. Could add profile as an extra so that it wasn't a required dependency of dedupe. Something similar could be done for disk space usage, though it doesn't look like there's quite as turnkey of a solution.

Feb 20 '22 20:02 NickCrews

interesting, is '@profile' really a no-op?

Feb 20 '22 22:02 fgregg

Good thought, don't know for sure, but looking at the source code my impression is that it will always have overhead. To get around this we could write our own decorator like

def dd_profile(func, *args, **kwargs):
    # Maybe a better way to configure this? Would have to be at import time
    if os.environ["DEDUPE_PROFILE"]:
        # Actually add the profiler wrapper
        return profile(func, *args, **kwargs)
    else:
        # noop
        return func

Feb 21 '22 02:02 NickCrews

interesting idea, can you open another issue for that, @NickCrews ?

Feb 21 '22 14:02 fgregg

I am having the same described by @zwarshavsky above. This happened when I tried to run the code of the small dataset dedupe example code on large data with 1.7m rows without SQL implementation. A large temp file with was written during

clustered_dupes = deduper.partition(data_d, 0.5)

The error was thrown when the temp file was at 180GB in the Windows Appdata folder, although there was about 120GB of free disk space left. I was running the code on a Windows Server machine. Let me know if and how I can be of any help to track this down further.

Sep 20 '22 10:09 hlra

what version of dedupe are your running?

Sep 20 '22 16:09 fgregg

I am mostly using 2.0.13 currently due to this issue #1077. But I ran it again with 2.0.18 now and this is the error message that I get:

Traceback (most recent call last): File "E:.conda\envs\Dissertation\lib\code.py", line 90, in runcode exec(code, self.locals) File "", line 1, in File "C:\Program Files\JetBrains\PyCharm 2022.2.2\plugins\python\helpers\pydev_pydev_bundle\pydev_umd.py", line 198, in runfile pydev_imports.execfile(filename, global_vars, local_vars) # execute the script File "C:\Program Files\JetBrains\PyCharm 2022.2.2\plugins\python\helpers\pydev_pydev_imps_pydev_execfile.py", line 18, in execfile exec(compile(contents+"\n", file, 'exec'), glob, loc) File "C:\Users...\2nd Step - Identify duplicate shareholders.py", line 256, in clustered_dupes = deduper.partition(data_1, 0.5) File "E:.conda\envs..\lib\site-packages\dedupe\api.py", line 177, in partition clusters = list(clusters) File "E:.conda\envs..\lib\site-packages\dedupe\api.py", line 185, in _add_singletons for record_ids, score in clusters: File "E:.conda\envs..\lib\site-packages\dedupe\api.py", line 334, in cluster yield from clustering.cluster(scores, threshold) File "E:.conda\envs..\lib\site-packages\dedupe\clustering.py", line 238, in cluster for sub_graph in dupe_sub_graphs: File "E:.conda\envs..\lib\site-packages\dedupe\clustering.py", line 38, in connected_components edgelist = numpy.memmap( File "E:.conda\envs..\lib\site-packages\numpy\core\memmap.py", line 284, in new self.filename = None OSError: [Errno 28] No space left on device

There is another 105GB of free space on the device though.

Sep 21 '22 08:09 hlra

In both runs the error seems to have been thrown when the temp file was at about 173GB.

Sep 21 '22 08:09 hlra

dedupe dedupe copied to clipboard

disk has reached capacity issue with moderate record size with >500 gb of free disk space

dedupe
dedupe copied to clipboard