SDV
SDV copied to clipboard
Crashes on FileNotFoundError for .sample.csv.temp
Environment Details
Please indicate the following details about the environment in which you found the bug:
- SDV version: 0.18.0
- Python version: Python 3.9.2 (default, Feb 28 2021, 17:03:44), [GCC 10.2.1 20210110] on linux
- Operating System: 5.15.64.1.amd64-smp
- IMPORTANT: Occurs when running on a SLURM cluster
Error Description
Crashes with the following:
Error: Sampling terminated. Partial results are stored in a temporary file: .sample.csv.temp. This file will be overridden the next time you sample. Please rename the file if you wish to save these results.
Traceback (most recent call last):
File "/INS/syndiffix/nobackup/internal-strategy/playground/adaptive-buckets/tests/oneModel.py", line 138, in <module>
fire.Fire(oneModel)
File "/home/francis/.local/lib/python3.9/site-packages/fire/core.py", line 141, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
File "/home/francis/.local/lib/python3.9/site-packages/fire/core.py", line 475, in _Fire
component, remaining_args = _CallAndUpdateTrace(
File "/home/francis/.local/lib/python3.9/site-packages/fire/core.py", line 691, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
File "/INS/syndiffix/nobackup/internal-strategy/playground/adaptive-buckets/tests/oneModel.py", line 112, in oneModel
runTest(model,metaData['sdvMetaData'],df,colNames,outPath)
File "/INS/syndiffix/nobackup/internal-strategy/playground/adaptive-buckets/tests/oneModel.py", line 37, in runTest
synData = model.sample(num_rows=df.shape[0])
File "/home/francis/.local/lib/python3.9/site-packages/sdv/lite/tabular.py", line 168, in sample
sampled = self._model.sample(
File "/home/francis/.local/lib/python3.9/site-packages/sdv/tabular/base.py", line 545, in sample
return self._sample_with_progress_bar(
File "/home/francis/.local/lib/python3.9/site-packages/sdv/tabular/base.py", line 508, in _sample_with_progress_bar
handle_sampling_error(output_file_path == TMP_FILE_NAME, output_file_path, error)
File "/home/francis/.local/lib/python3.9/site-packages/sdv/tabular/utils.py", line 165, in handle_sampling_error
raise sampling_error
File "/home/francis/.local/lib/python3.9/site-packages/sdv/tabular/base.py", line 499, in _sample_with_progress_bar
sampled = self._sample_in_batches(
File "/home/francis/.local/lib/python3.9/site-packages/sdv/tabular/base.py", line 378, in _sample_in_batches
sampled_rows = self._sample_batch(
File "/home/francis/.local/lib/python3.9/site-packages/sdv/tabular/base.py", line 327, in _sample_batch
append_kwargs = {'mode': 'a', 'header': False} if os.path.getsize(
File "/usr/lib/python3.9/genericpath.py", line 50, in getsize
return os.stat(filename).st_size
FileNotFoundError: [Errno 2] No such file or directory: '.sample.csv.temp'
I believe that this happens because there are multiple instances of SDV running on the same set of machines, all of which share a file system. Multiple instances are creating and deleting the same file, leading to cases where the file does not exist as expected.
My work-around is to call model.sample() with an output_file_path whose file name is unique. This prevents the temporary file .sample.csv.temp from every being created in the first place:
tempPath = f"temp.{dataSourceNum}.csv"
start = time.time()
model.fit(df)
synData = model.sample(num_rows=df.shape[0], output_file_path=tempPath)
end = time.time()
if os.path.exists(tempPath):
os.remove(tempPath)
Hi @yoid2000, thanks for filing the issue with your details.
This is the current functionality:
- If you specify an
output_filepath, the sampling function will write to the filepath after every batch, ensuring that you do not lose previously sampled data if there is a crash - If you do not specify an
output_filepath, then the sampling function will still save batched results to.sample.csv.temp(still want to ensure that you have access to some data in the event of a crash). Once the sampling completes, we delete this file.
In your case, the recommended solution is to specify a unique output_filepath (as you have noted).
We may have to think more carefully about how to support this natively. Some options:
- Append a random identifier to the temporary filename (eg.
.sample-4928819.csv.temp) - Do not create a temporary file (in the event of a crash, the data will be lost)
- Natively support parallel sampling that takes this into account
For me, the general problem is that you don't correctly support parallel sampling (running multiple instances of sampling in the same directory). I don't have a need to save previously sampled data.
Since I am running 10s of instances in parallel, I would like to know if there are other potential problems with running parallel executions.
So I'd think that in any event you want to to option 3.
(Assuming you do, my personal preference would be to fix this particular problem with option 2. It seems cleaner, and in any event the user can avoid the problem by specifying an output_filepath.)
Hi @yoid2000, we appreciate the feedback. If you provide a unique output_filepath to every task, I don't foresee there being being problems for now. There are no other shared files being used.
The SDV Ecosystem is the overall effort of our community -- so if you notice something else or do encounter a different bug, feel free to file another issue.
Hello @npatki and @yoid2000, As I commented as well in #1437, sample_from_conditions always creates a file, either temporal or persistent. I implemented a solution similar to this by defining an unique identifier for each instance and then deleting it. I think that the user should decide whether she or he wants to keep the samples in memory or in disk. I did not review the whole code but I presume that when the sample is huge and multiple instances are run in parallel, writing in disk will not be efficient (unless it is required by the user, of course).
I would support option 2 of @npatki (https://github.com/sdv-dev/SDV/issues/1310#issuecomment-1468325937) . I think that adding a new default argument like write_to_disk=False and use it here could solve this issue without breaking many things.
https://github.com/sdv-dev/SDV/blob/6994e64e9ffb681ea4a67766b1d3893dccf3dc85/sdv/single_table/utils.py#L145-L155
This is of course just an idea.
I've now created an explicit feature request for this in #2042 so I'm closing this one out as a duplicate. We will use #2042 to track the fix.