TFRecordDatasetSampler does not seen to gaurantee example_per_class
TensorFlow: 2.7.0 TensorFlow Similarity 0.14.10
I have large dataset in the format of TFRecords, so I am trying out this sampler. My image classification task has 5 classes, and i like to have 6 instance of each class per batch. I put each class into their own separate TFRecord files (like train_class_0.tfrecords, train_class_1.tfrecords, etc), so they are definitely contiguous all through. (I admittedly am confused on this contiguous requirement as stated in the code comment, but this is the best i can come up with).
sampler = TFRecordDatasetSampler(
shard_path='.',
deserialization_fn=lambda x: decode_func(parse_tf_records_fn(x)),
example_per_class=6,
batch_size=30,
shard_suffix='train_*.tfrecords'
)
And I try to examine the labels for each batch:
x, y = next(iter(sampler))
y
And got:
<tf.Tensor: shape=(30,), dtype=int64, numpy= array([2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4])>
After iterating a few times, It seems to always over-sample the 1st label and under sample the last. I tried to isolate and test the relevant code that construct the ds, it seems to me if you set deterministic=True, then each class will correctly have 6 instances. I understand from comment that non-determinism is needed for creating random batch, so I am speculating something about tf dataset interleave may have change?? Or I have some misunderstanding somewhere.
I further try to read the tf interleave documentation. And I run with this simple test case (modified slightly from that doc):
dataset = tf.data.Dataset.range(1, 6) # ==> [ 1, 2, 3, 4, 5 ]
dataset = dataset.interleave(
lambda x: tf.data.Dataset.from_tensors(x).repeat(-1),
cycle_length=5, block_length=6, num_parallel_calls=AUTO, deterministic=False)
dataset = dataset.map(lambda x: x, num_parallel_calls=AUTO)
dataset = dataset.repeat(count=-1)
dataset = dataset.batch(30)
dataset = dataset.prefetch(10)
# np.array(list(dataset.as_numpy_iterator()))
x = next(iter(dataset))
x
<tf.Tensor: shape=(30,), dtype=int64, numpy= array([1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3, 4, 4, 4, 4, 4, 4, 5, 5, 5, 5, 5, 5, 1])>
This also fails to yield 6 instances of each class. (If you set deterministic to True, then you will).
It seems to me that it will generate on average, 6/class. But if you look at only one batch, there's no guarantee.
Currently, it may still be ok for example_per_class=6, since i haven't seen any class with <2 instances in the batch, although I can't be sure. It may be a problem if example_per_class is only 2. If the similarity loss can "gracefully" handle the occasional "lone class", it may still work.
Alternatively, I wonder if you can utilize tf.data.Dataset.sample_from_datasets. Although it doesn't gaurantee exactly this number of example/class (only probabilistically through the weights).
I've run into similar issues with interleave and have also recently come to feel that deterministic needs to be set to True in order to ensure the correct number of classes per batch. I discussed some of the issues here as well #171. The original idea was to create a bunch of TFRecords that had contiguous blocks of a class, likely of size block_length. Then each time the interleave would load a record and consume it one block_length at a time. However, it seems that when deterministic is False, the interleave is not guaranteed to consume the entire block_length from an iterator before switching over to another one. Additionally, TFRecords can only be consumed sequentially, which means we will always have a fixed order within the block_length, but the hope was that each individual file would be loaded randomly.
My current thought is that we can write many small TFRecords for each class, with each file of size block_length. Then the file list can be shuffled before the interleave to introduce the random sampling. This will allow us to set deterministic as False and guarantee the order.
Regarding the use of tf.data.Dataset.sample_from_datasets, it seems to be very very slow, ... at least in the tests I ran a while back. I can take another look at it though.
I haven't tested this yet, but I think one potential solution could be something like the following:
NUM_SHARDS_PER_CLASS = # num_images_in_class // num_examples_per_class
def create_records(...):
# Here we materialize a sharded TFRecord per class.
# Each TFRecord contains NUM_SHARDS_PER_CLASS = num_images_in_class // num_examples_per_class
pass
def tfrecord_sampler(class_id):
random_shard_id = tf.random.uniform(
shape=(),
maxval=NUM_SHARDS_PER_CLASS-1,
dtype=tf.int32,
)
shard_path = f"path/to/shard/class_{class_id}_{random_shard_id}"
return tf.data.TFRecordDataset(filenames = [shard_path])
class_ids = list(range(200))
ds = tf.data.Dataset.from_tensor_slices(class_ids)
ds = ds.shuffle(len(class_ids))
ds = ds.interleave(
lambda x: tfrecord_sampler,
cycle_length=num_classes_per_batch,
block_length=num_examples_per_class,
num_parallel_calls=tf.data.AUTOTUNE,
deterministic=True,
)
ds = ds.map(lambda x: x, num_parallel_calls=tf.data.AUTOTUNE)
ds = ds.repeat(count=-1)
ds = ds.batch(num_classes_per_batch*num_examples_per_class)
ds = ds.prefetch(tf.data.AUTOTUNE)
@owenvallis I will likely give your code/approach a try when i get a chance, and if i understood this correctly.
So your idea is to break up your single class tfrecord into many small tfrecords (still of the same class)? I am still a little bit confused over variable names you used. ‘Cos it appears each shard (or the little tfrecord) should contain num_examples_per_class examples (however, u said Each TFRecord contains NUM_SHARDS_PER_CLASS….). which should be a very small number? It will help a lot to provide an extremely simple but concrete example, including the specific # of examples and their class within each broken down tfrecord.
I actually went ahead and use the current sampler even knowing this imperfection. It seems to train ok (maybe lucky not to have any lone class within a batch).
Aside: The primary reason i m interesting in a tfrecord based samplers is to eventually run training on TPU. In the past, I always run into issues on first trial and many issues have to do with unimplemented op. If some of TPU “bug” still not fixed, i anticipate your notebook will fail on TPU due to use of data augmentation layers within the model. I sort of always workaround that by doing data augmentation within the tf dataset pipeline (done on CPU). If i really do hit that later, i will probably just document it as an issue.
Hi @kechan, I'll try and setup a working example later today, and apologies for any confusing var names above. To clarify, I'm hoping the above approach will enable use to:
- Use shuffle to load a random set of class ids into interleave().
- Use the class id to load a random shard from that class.
- Each shard should be of size
num_example_per_class.
This should make the interleave() function consume an entire shard and then load another random class id. While the examples within each shard are fixed, we should still get a good mix within the batches as the collection of shards will be fairly random.
And thanks for bringing the TPU issue to our attention. I'll see if I can run some tests on TPUs as well.
@owenvallis Thanks a lot.
One concern I just have is this approach will generate a large # of tfrecord files? I never try more than 50 before. Do you anticipate other performance bottleneck? For TPU, it seems to have very fast link to cloud storage, so i hope there’s no surprise there.
@owenvallis
Update for tf.data.Dataset.sample_from_datasets
I went ahead and attempted a solution using tf.data.Dataset.sample_from_datasets(...) and ran it on TPU, it is quite fast at ~125ms/step, in the range i expected for my dataset. I suspect the slowness you saw, could be due to the prior bug I raised related to tf.convert_to_tensor(...)