SDV icon indicating copy to clipboard operation
SDV copied to clipboard

Crashes on FileNotFoundError for .sample.csv.temp

Open yoid2000 opened this issue 2 years ago • 4 comments

Environment Details

Please indicate the following details about the environment in which you found the bug:

  • SDV version: 0.18.0
  • Python version: Python 3.9.2 (default, Feb 28 2021, 17:03:44), [GCC 10.2.1 20210110] on linux
  • Operating System: 5.15.64.1.amd64-smp
  • IMPORTANT: Occurs when running on a SLURM cluster

Error Description

Crashes with the following:

Error: Sampling terminated. Partial results are stored in a temporary file: .sample.csv.temp. This file will be overridden the next time you sample. Please rename the file if you wish to save these results.
Traceback (most recent call last):
  File "/INS/syndiffix/nobackup/internal-strategy/playground/adaptive-buckets/tests/oneModel.py", line 138, in <module>
    fire.Fire(oneModel)
  File "/home/francis/.local/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
    component_trace = _Fire(component, args, parsed_flag_args, context, name)
  File "/home/francis/.local/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
    component, remaining_args = _CallAndUpdateTrace(
  File "/home/francis/.local/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
    component = fn(*varargs, **kwargs)
  File "/INS/syndiffix/nobackup/internal-strategy/playground/adaptive-buckets/tests/oneModel.py", line 112, in oneModel
    runTest(model,metaData['sdvMetaData'],df,colNames,outPath)
  File "/INS/syndiffix/nobackup/internal-strategy/playground/adaptive-buckets/tests/oneModel.py", line 37, in runTest
    synData = model.sample(num_rows=df.shape[0])
  File "/home/francis/.local/lib/python3.9/site-packages/sdv/lite/tabular.py", line 168, in sample
    sampled = self._model.sample(
  File "/home/francis/.local/lib/python3.9/site-packages/sdv/tabular/base.py", line 545, in sample
    return self._sample_with_progress_bar(
  File "/home/francis/.local/lib/python3.9/site-packages/sdv/tabular/base.py", line 508, in _sample_with_progress_bar
    handle_sampling_error(output_file_path == TMP_FILE_NAME, output_file_path, error)
  File "/home/francis/.local/lib/python3.9/site-packages/sdv/tabular/utils.py", line 165, in handle_sampling_error
    raise sampling_error
  File "/home/francis/.local/lib/python3.9/site-packages/sdv/tabular/base.py", line 499, in _sample_with_progress_bar
    sampled = self._sample_in_batches(
  File "/home/francis/.local/lib/python3.9/site-packages/sdv/tabular/base.py", line 378, in _sample_in_batches
    sampled_rows = self._sample_batch(
  File "/home/francis/.local/lib/python3.9/site-packages/sdv/tabular/base.py", line 327, in _sample_batch
    append_kwargs = {'mode': 'a', 'header': False} if os.path.getsize(
  File "/usr/lib/python3.9/genericpath.py", line 50, in getsize
    return os.stat(filename).st_size
FileNotFoundError: [Errno 2] No such file or directory: '.sample.csv.temp'

I believe that this happens because there are multiple instances of SDV running on the same set of machines, all of which share a file system. Multiple instances are creating and deleting the same file, leading to cases where the file does not exist as expected.

My work-around is to call model.sample() with an output_file_path whose file name is unique. This prevents the temporary file .sample.csv.temp from every being created in the first place:

    tempPath = f"temp.{dataSourceNum}.csv"
    start = time.time()
    model.fit(df)
    synData = model.sample(num_rows=df.shape[0], output_file_path=tempPath)
    end = time.time()
    if os.path.exists(tempPath):
        os.remove(tempPath)

yoid2000 avatar Mar 14 '23 00:03 yoid2000

Hi @yoid2000, thanks for filing the issue with your details.

This is the current functionality:

  • If you specify an output_filepath, the sampling function will write to the filepath after every batch, ensuring that you do not lose previously sampled data if there is a crash
  • If you do not specify an output_filepath, then the sampling function will still save batched results to .sample.csv.temp (still want to ensure that you have access to some data in the event of a crash). Once the sampling completes, we delete this file.

In your case, the recommended solution is to specify a unique output_filepath (as you have noted).

We may have to think more carefully about how to support this natively. Some options:

  1. Append a random identifier to the temporary filename (eg. .sample-4928819.csv.temp)
  2. Do not create a temporary file (in the event of a crash, the data will be lost)
  3. Natively support parallel sampling that takes this into account

npatki avatar Mar 14 '23 15:03 npatki

For me, the general problem is that you don't correctly support parallel sampling (running multiple instances of sampling in the same directory). I don't have a need to save previously sampled data.

Since I am running 10s of instances in parallel, I would like to know if there are other potential problems with running parallel executions.

So I'd think that in any event you want to to option 3.

(Assuming you do, my personal preference would be to fix this particular problem with option 2. It seems cleaner, and in any event the user can avoid the problem by specifying an output_filepath.)

yoid2000 avatar Mar 15 '23 00:03 yoid2000

Hi @yoid2000, we appreciate the feedback. If you provide a unique output_filepath to every task, I don't foresee there being being problems for now. There are no other shared files being used.

The SDV Ecosystem is the overall effort of our community -- so if you notice something else or do encounter a different bug, feel free to file another issue.

npatki avatar Mar 15 '23 13:03 npatki

Hello @npatki and @yoid2000, As I commented as well in #1437, sample_from_conditions always creates a file, either temporal or persistent. I implemented a solution similar to this by defining an unique identifier for each instance and then deleting it. I think that the user should decide whether she or he wants to keep the samples in memory or in disk. I did not review the whole code but I presume that when the sample is huge and multiple instances are run in parallel, writing in disk will not be efficient (unless it is required by the user, of course).

I would support option 2 of @npatki (https://github.com/sdv-dev/SDV/issues/1310#issuecomment-1468325937) . I think that adding a new default argument like write_to_disk=False and use it here could solve this issue without breaking many things.

https://github.com/sdv-dev/SDV/blob/6994e64e9ffb681ea4a67766b1d3893dccf3dc85/sdv/single_table/utils.py#L145-L155

This is of course just an idea.

tyrael147 avatar May 23 '23 15:05 tyrael147

I've now created an explicit feature request for this in #2042 so I'm closing this one out as a duplicate. We will use #2042 to track the fix.

npatki avatar May 31 '24 15:05 npatki