fast-style-transfer icon indicating copy to clipboard operation
fast-style-transfer copied to clipboard

Error running style.py: Not found: ./bin/ptxas not found

Open moldach opened this issue 3 years ago • 0 comments

I'm getting an error trying to train checkpoints using style.py and the traceback seems to point to: Not found: ./bin/ptxas not found as the source of the error.

Do you have any idea what the issue here is?

Submission Script

#!/bin/bash
#$ -pwd

# bash fastTrainer.bash /images/ /outpath/ /testpath/
##
## An embarrassingly parallel script to train many style transfer networks on a HPC
## Access to SLURM job scheduler and fast-style-transfer is required to run this program.
## The three mandatory pathways must be specified in the indicated order.

IMG=$(readlink -f "${1%/}")     # path_to_train_images
OUT_DIR=$(readlink -f "${2%/}")  # path_to_checkpoints
TEST=$(readlink -f "${3%/}")  # path_to_tests

mkdir -p ${OUT_DIR}/jobs

JID=0   # job ID for SLURM job name

for f in ${IMG}/*; do

        let JID=(JID+1)

  cat > ${OUT_DIR}/jobs/style_${JID}.bash << EOT # write job information for each job
#!/bin/bash
#SBATCH --gres=gpu:1        # request GPU
#SBATCH --account=def-mtarailo
#SBATCH --cpus-per-task=10   # maximum CPU cores per GPU request
#SBATCH --time=12:00:00     # request 8 hours of walltime
#SBATCH --mem=10G           # request 10G (or 1G per core)
#SBATCH --job-name="fst_${JID}"
#SBATCH --output=${OUT_DIR}/jobs/%N-%j.out  # %N for node name, %j for jobID
#SBATCH --error=${OUT_DIR}/jobs/%N-%j.err  # %N for node name, %j for jobID

### JOB SCRIPT BELLOW ###

# Load Modules
source activate tf-gpu
module load cuda/10.1

mkdir ${OUT_DIR}/${JID}
#mkdir ${TEST}/${JID}

python style.py --style $f \
  --checkpoint-dir ${OUT_DIR}/${JID} \
  --test examples/content/chicago.jpg \
  --test-dir ${OUT_DIR}/${JID} \
  --content-weight 1.5e1 \
  --checkpoint-iterations 1000 \
  --batch-size 20

EOT
  chmod 754 $(readlink -f "${OUT_DIR}")/jobs/style_${JID}.bash
  sbatch $(readlink -f "${OUT_DIR}")/jobs/style_${JID}.bash
done

Error

Due to MODULEPATH changes, the following have been reloaded:
  1) openmpi/3.1.2

2021-03-22 21:11:45.184962: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2021-03-22 21:11:45.240458: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:1d:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2021-03-22 21:11:45.283971: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2021-03-22 21:11:45.473571: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2021-03-22 21:11:45.634452: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2021-03-22 21:11:45.716070: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2021-03-22 21:11:45.890345: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2021-03-22 21:11:45.927906: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2021-03-22 21:11:46.114498: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-03-22 21:11:46.116365: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2021-03-22 21:11:46.116859: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 AVX512F FMA
2021-03-22 21:11:46.164009: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2400000000 Hz
2021-03-22 21:11:46.165725: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x564a7240ad60 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-03-22 21:11:46.165751: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2021-03-22 21:11:46.168706: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:1d:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2021-03-22 21:11:46.168744: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2021-03-22 21:11:46.168760: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2021-03-22 21:11:46.168775: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2021-03-22 21:11:46.168789: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2021-03-22 21:11:46.168803: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2021-03-22 21:11:46.168817: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2021-03-22 21:11:46.168831: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-03-22 21:11:46.170436: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2021-03-22 21:11:46.170479: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2021-03-22 21:11:46.305358: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-03-22 21:11:46.305405: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0 
2021-03-22 21:11:46.305424: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N 
2021-03-22 21:11:46.308686: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15059 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:1d:00.0, compute capability: 7.0)
2021-03-22 21:11:46.312111: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x564a72cf2f30 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-03-22 21:11:46.312137: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-16GB, Compute Capability 7.0
2021-03-22 21:11:49.670398: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1555] Found device 0 with properties: 
pciBusID: 0000:1d:00.0 name: Tesla V100-SXM2-16GB computeCapability: 7.0
coreClock: 1.53GHz coreCount: 80 deviceMemorySize: 15.78GiB deviceMemoryBandwidth: 836.37GiB/s
2021-03-22 21:11:49.670498: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2021-03-22 21:11:49.670522: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2021-03-22 21:11:49.670543: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2021-03-22 21:11:49.670561: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2021-03-22 21:11:49.670577: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2021-03-22 21:11:49.670594: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2021-03-22 21:11:49.670613: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-03-22 21:11:49.672271: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1697] Adding visible gpu devices: 0
2021-03-22 21:11:49.672324: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1096] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-03-22 21:11:49.672338: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102]      0 
2021-03-22 21:11:49.672349: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] 0:   N 
2021-03-22 21:11:49.674005: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1241] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 15059 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:1d:00.0, compute capability: 7.0)
WARNING:tensorflow:From /home/moldach/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1635: calling BaseResourceVariable.__init__ (from tensorflow.python.ops
: Tesla V100-SXM2-16GB, pci bus id: 0000:1d:00.0, compute capability: 7.0)
WARNING:tensorflow:From /home/moldach/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/tensorflow_core/python/ops/resource_variable_ops.py:1635: calling BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
2021-03-22 21:12:02.532101: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2021-03-22 21:12:04.268532: W tensorflow/stream_executor/gpu/redzone_allocator.cc:312] Not found: ./bin/ptxas not found
Relying on driver to perform ptx compilation. This message will be only logged once.
2021-03-22 21:12:04.438014: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
Traceback (most recent call last):
  File "/home/moldach/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/imageio/plugins/pillow.py", line 669, in pil_try_read
    im.getdata()[0]
  File "/home/moldach/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/PIL/Image.py", line 1271, in getdata
    self.load()
  File "/home/moldach/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/PIL/ImageFile.py", line 260, in load
    "image file is truncated "
OSError: image file is truncated (20 bytes not processed)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "style.py", line 167, in <module>
    main()
  File "style.py", line 147, in main
    for preds, losses, i, epoch in optimize(*args, **kwargs):
  File "src/optimize.py", line 105, in optimize
    X_batch[j] = get_img(img_p, (256,256,3)).astype(np.float32)
  File "src/utils.py", line 18, in get_img
    img = imageio.imread(src, pilmode='RGB') # misc.imresize(, (256, 256, 3))
  File "/home/moldach/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/imageio/core/functions.py", line 265, in imread
    reader = read(uri, format, "i", **kwargs)
  File "/home/moldach/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/imageio/core/functions.py", line 186, in get_reader
    return format.get_reader(request)
  File "/home/moldach/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/imageio/core/format.py", line 170, in get_reader
    return self.Reader(self, request)
  File "/home/moldach/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/imageio/core/format.py", line 221, in __init__
    self._open(**self.request.kwargs.copy())
  File "/home/moldach/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/imageio/plugins/pillow.py", line 429, in _open
    return PillowFormat.Reader._open(self, pilmode=pilmode, as_gray=as_gray)
  File "/home/moldach/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/imageio/plugins/pillow.py", line 135, in _open
    pil_try_read(self._im)
  File "/home/moldach/anaconda3/envs/tf-gpu/lib/python3.7/site-packages/imageio/plugins/pillow.py", line 680, in pil_try_read
    raise ValueError(error_message)
ValueError: Could not load "" 
Reason: "image file is truncated (20 bytes not processed)"
Please see documentation at: http://pillow.readthedocs.io/en/latest/installation.html#external-libraries
(END)

moldach avatar Mar 23 '21 15:03 moldach