FastStyle icon indicating copy to clipboard operation
FastStyle copied to clipboard

Where is output from training?

Open moldach opened this issue 3 years ago • 7 comments

I've ran the following training script. There doesn't seem to be an obvious errors from the logs, so I think it ran successfully - just having trouble finding the output to use for evaluation now:

Script:

#!/bin/bash
#SBATCH --job-name=train-pytorch
#SBATCH --mail-type=END,FAIL
#SBATCH [email protected]
#SBATCH --ntasks=1
#SBATCH --time=12:00:00
#SBATCH --mem=8000
#SBATCH --gres=gpu:p100:2
#SBATCH --cpus-per-task=6
#SBATCH --output=%x_%j.log
#SBATCH --error=%x_%j.err

source tensorflow/bin/activate

python main.py train \
  --style /scratch/moldach/PyTorch-Style-Transfer/experiments/images/matts-styles/birmingham.jpg \
  --dataset datasets/train2014 \
  --weights imagenet-vgg-verydeep-19.mat

I get the following logs:

.err

021-03-29 21:55:54.026157: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-03-29 22:03:24.188858: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
2021-03-29 22:03:24.939154: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:03:00.0 name: Tesla P100-PCIE-12GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 511.41GiB/s
2021-03-29 22:03:24.947032: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 1 with properties: 
pciBusID: 0000:04:00.0 name: Tesla P100-PCIE-12GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 511.41GiB/s
2021-03-29 22:03:24.963393: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-03-29 22:03:25.011845: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-03-29 22:03:25.036597: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-03-29 22:03:25.051713: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2021-03-29 22:03:25.071691: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2021-03-29 22:03:25.076390: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2021-03-29 22:03:25.147072: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-03-29 22:03:25.206992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0, 1
2021-03-29 22:03:25.438640: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN)to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2021-03-29 22:03:28.342003: I tensorflow/core/platform/profile_utils/cpu_utils.cc:104] CPU Frequency: 2200150000 Hz
2021-03-29 22:03:28.588715: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x59faef0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2021-03-29 22:03:28.588828: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2021-03-29 22:03:29.246878: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x5a880d0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2021-03-29 22:03:29.246987: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla P100-PCIE-12GB, Compute Capability 6.0
2021-03-29 22:03:29.247029: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): Tesla P100-PCIE-12GB, Compute Capability 6.0
2021-03-29 22:03:29.382411: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:03:00.0 name: Tesla P100-PCIE-12GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 511.41GiB/s
2021-03-29 22:03:29.384001: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 1 with properties: 
pciBusID: 0000:04:00.0 name: Tesla P100-PCIE-12GB computeCapability: 6.0
coreClock: 1.3285GHz coreCount: 56 deviceMemorySize: 11.91GiB deviceMemoryBandwidth: 511.41GiB/s
2021-03-29 22:03:29.384077: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-03-29 22:03:29.384136: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-03-29 22:03:29.384225: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
2021-03-29 22:03:29.384278: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
2021-03-29 22:03:29.384323: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
2021-03-29 22:03:29.384367: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
2021-03-29 22:03:29.384412: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-03-29 22:03:29.390742: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1858] Adding visible gpu devices: 0, 1
2021-03-29 22:03:29.390834: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
2021-03-29 22:03:33.204879: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1257] Device interconnect StreamExecutor with strength 1 edge matrix:
2021-03-29 22:03:33.204990: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1263]      0 1 
2021-03-29 22:03:33.205025: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 0:   N Y 
2021-03-29 22:03:33.205043: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1276] 1:   Y N 
2021-03-29 22:03:33.361079: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 11121 MB memory) -> physical GPU (device: 0, name: Tesla P100-PCIE-12GB, pci bus id: 0000:03:00.0, compute capability: 6.0)
2021-03-29 22:03:33.424735: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1402] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 11121 MB memory) -> physical GPU (device: 1, name: Tesla P100-PCIE-12GB, pci bus id: 0000:04:00.0, compute capability: 6.0)
2021-03-29 22:03:47.740068: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudnn.so.7
2021-03-29 22:03:48.882829: W tensorflow/stream_executor/gpu/asm_compiler.cc:81] Running ptxas --version returned 256
2021-03-29 22:03:49.009004: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Internal: ptxas exited with non-zero error code 256, output: 
Relying on driver to perform ptx compilation. 
Modify $PATH to customize ptxas location.
This message will be only logged once.
2021-03-29 22:03:50.058719: I tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
2021-03-29 22:17:30.953810: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 158 of 1024
2021-03-29 22:17:41.277361: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 333 of 1024
2021-03-29 22:17:51.193145: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 528 of 1024
2021-03-29 22:18:01.484417: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 661 of 1024
2021-03-29 22:18:11.074078: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 859 of 1024
2021-03-29 22:18:20.985531: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:221] Shuffle buffer filled.
2021-03-29 23:59:09.658659: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 251 of 1024
2021-03-29 23:59:19.551394: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 563 of 1024
2021-03-29 23:59:29.764089: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 806 of 1024
2021-03-29 23:59:39.467575: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:172] Filling up shuffle buffer (this may take a while): 1011 of 1024
2021-03-29 23:59:40.309649: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:221] Shuffle buffer filled.

.log

Epoch 0
=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

Epoch 1
=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

=====================================
            Weights saved!           
=====================================

Total time: 12052.2
=====================================
             All saved!              
=====================================

moldach avatar Mar 30 '21 14:03 moldach