nobrainer
nobrainer copied to clipboard
Kernel Crash During Transfer Learning
I am attempting to perform transfer learning on the existing nobrainer model weights by synthesizing a training dataset complied with manual edits to the brain mask.
My first attempt made it through to epoch 4/5 before the kernel crashed. I've tried rerunning the code multiple times with smaller datasets, different learning rates but I keep getting the same error message:
Train for 1296 steps, validate for 80 steps
2019-12-06 10:47:11.520055: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:150] Filling up shuffle buffer (this may take a while): 1 of 10
2019-12-06 10:47:12.065779: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:199] Shuffle buffer filled.
Killed: 9
My code is below, any help/suggestions would be appreciated.
# Transfer Learning to ADS FreeSurfer Brain Masks
import nobrainer
# initialize
csv_of_filepaths = './nobrainer/code/nobrainer_fs-SkullStripped_trainingdata.csv'
filepaths = nobrainer.io.read_csv(csv_of_filepaths)
# split into train and test
train_paths = filepaths[:324]
evaluate_paths = filepaths[324:]
# convert images to tensorflow records
nobrainer.io.convert(
train_paths,
tfrecords_template='./nobrainer/processed/data-train_shard-{shard:03d}.tfrecords',
volumes_per_shard=3,
num_parallel_calls=24)
nobrainer.io.convert(
evaluate_paths,
tfrecords_template='./nobrainer/processed/data-evaluate_shard-{shard:03d}.tfrecords',
volumes_per_shard=3,
num_parallel_calls=24)
#### preallocation for train/evaluate
n_classes = 1
batch_size = 2
volume_shape = (256, 256, 256)
block_shape = (128, 128, 128)
n_epochs = None
augment = False
shuffle_buffer_size = 10
num_parallel_calls = 24
# train object
dataset_train = nobrainer.volume.get_dataset(
file_pattern='./nobrainer/processed/data-train_shard-*.tfrecords',
n_classes=n_classes,
batch_size=batch_size,
volume_shape=volume_shape,
block_shape=block_shape,
n_epochs=n_epochs,
augment=augment,
shuffle_buffer_size=shuffle_buffer_size,
num_parallel_calls=num_parallel_calls,
)
# evaluate object
dataset_evaluate = nobrainer.volume.get_dataset(
file_pattern='./nobrainer/processed/data-evaluate_shard-*.tfrecords',
n_classes=n_classes,
batch_size=batch_size,
volume_shape=volume_shape,
block_shape=block_shape,
n_epochs=1,
augment=False,
shuffle_buffer_size=None,
num_parallel_calls=1,
)
##################################################
# TRANSFER LEARNING
### get existing model for transfer learning
##################################################
import tensorflow as tf
model_path = tf.keras.utils.get_file(
fname='brain-extraction-unet-128iso-model.h5',
origin='https://github.com/neuronets/nobrainer-models/releases/download/0.1/brain-extraction-unet-128iso-model.h5')
model = tf.keras.models.load_model(model_path, compile=False)
model.summary()
# set L2 regularization for layers
for layer in model.layers:
layer.kernel_regularizer = tf.keras.regularizers.l2(0.01)
# set learning rate
optimizer = tf.keras.optimizers.Adam(learning_rate=1e-05)
# compile model
model.compile(
optimizer=optimizer,
loss=nobrainer.losses.jaccard,
metrics=[nobrainer.metrics.dice],
)
# compute steps given sizes
steps_per_epoch = nobrainer.volume.get_steps_per_epoch(
n_volumes=len(train_paths),
volume_shape=volume_shape,
block_shape=block_shape,
batch_size=batch_size)
validation_steps = nobrainer.volume.get_steps_per_epoch(
n_volumes=len(evaluate_paths),
volume_shape=volume_shape,
block_shape=block_shape,
batch_size=batch_size)
## TRAIN MODEL!!!
model.fit(
dataset_train,
epochs=1,
verbose=1,
steps_per_epoch=steps_per_epoch,
validation_data=dataset_evaluate,
validation_steps=validation_steps,
use_multiprocessing=True,
workers=24)
model.save('./nobrainer/nobrainer-models/ads-transfer-learning_manual-edits_brain-extraction-unet-128iso-model.h5',
save_format='h5')
model.save_weights('./nobrainer/nobrainer-models/ads-transfer-learning_manual-edits_brain-extraction-unet-128iso-weights.h5',
save_format='h5')
Update, not sure how this happened but I get this error message when attempting to call no brainer from the command line
File "/Users/admin/Anvil/opt/miniconda3/lib/python3.7/site-packages/pkg_resources/__init__.py", line 786, in resolve
raise DistributionNotFound(req, requirers)
pkg_resources.DistributionNotFound: The 'cloudpickle==1.1.1' distribution was not found and is required by tensorflow-probability
update - performed a reinstall and this message mysteriously disappeared. the original reported behavior is still present
hi @seldamat - sorry for the delay, and thanks for the report. can you try updating nobrainer and try the above code again?
pip install -U --no-cache-dir https://github.com/neuronets/nobrainer/tarball/master
that might update dependencies, too, so it's best to do in a virtual environment or conda environment.
FYI i have enhanced the tfrecords writing and reading functionality in #79. can you please refer to https://github.com/neuronets/nobrainer/blob/master/guide/transfer_learning.ipynb for how to write and read tfrecords in the new format?
@seldamat - let me know if you have any updates