similarity icon indicating copy to clipboard operation
similarity copied to clipboard

Question - TFRecords and contiguous examples

Open ksachdeva opened this issue 2 years ago • 9 comments

Hi,

From the documentation:

WARNING: This samplers assume that classes examples are contigious, at least enough that you can get example_per_class numbers of them consecutively. This requirements is needed to make the sampling efficient and makes dataset constuctionn oftentime easier as there is no need to worry about shuffling. Somewhat contigious means its fine to have the same class in multiples shards as long as the examples for the same classes are contigious in that shard.

Most likely I am confused with the definition of shards here.

For my given experiment, I have 12 classes. For every class, I have its corresponding tfrecord file. In other words, examples for a class are contiguous since a tfrecord file contains example of a particular class.

I used TFRecordDatasetSampler with default parameters (i.e. example_per_class = 2, batch_size = 32) but I do not get the representatives from all classes.

# example output
tf.Tensor(
[10 10 10 10 10 10 10 10 10  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1
 11 11 11 11 11 11 11 11], shape=(32,), dtype=int64)

Would appreciate it if you could help provide some insight into how one should create the tfrecord files to utilize this sampler.

Regards & thanks Kapil

ksachdeva avatar Oct 05 '21 17:10 ksachdeva

Hi Kapil, I just pushed some updates to the TFRecordDatasetSampler and they should be available on pypi now. The sampler should now work as expected if each class is a separate tfrecord file. Please let me know if you still run into issues though.

owenvallis avatar Oct 19 '21 05:10 owenvallis

See #151 for the changes.

owenvallis avatar Oct 19 '21 05:10 owenvallis

Thanks @owenvallis, Much appreciated you looking into this.

I tried the new package on my dataset but did not get the result I was expecting.

Then took a step up to create some fake data to reproduce the problem.

It does seem to be solved however I see 2 issues:

a) The first few iterations/fetch do not have the equal representation. Look at the first few iterations in the image below

image

b) The iteration is endless. I was thinking it would end once the data is exhausted. You will see in the script that I am existing earlier based on some condition.

Below is a script to generate the tfrecords and then read them using similarity package.

import os
import tensorflow as tf
import tensorflow_similarity as tfsim
from tensorflow_similarity.samplers.tfrecords_samplers import TFRecordDatasetSampler

TFRECORDS_FOLDER_PATH = "tmp/tfsim-test"


def decode_fn(record_bytes):
    return tf.io.parse_single_example(
        # Data
        record_bytes,
        # Schema
        {
            "x": tf.io.FixedLenFeature([], dtype=tf.float32),
            "y": tf.io.FixedLenFeature([], dtype=tf.int64)
        })


def generate_tfrecords_file(class_num: int, file_path: str):

    x_tensors = tf.range(0, 100)

    with tf.io.TFRecordWriter(file_path) as file_writer:
        for x in x_tensors:

            print(x)

            record_bytes = tf.train.Example(features=tf.train.Features(
                feature={
                    "x":
                    tf.train.Feature(float_list=tf.train.FloatList(value=[x])),
                    "y":
                    tf.train.Feature(int64_list=tf.train.Int64List(
                        value=[class_num])),
                })).SerializeToString()
            file_writer.write(record_bytes)


def generate_tfrecords_dataset():
    os.makedirs("tmp/tfsim-test", exist_ok=True)
    for i in range(10):
        generate_tfrecords_file(i, f"{TFRECORDS_FOLDER_PATH}/r{i}.tfrecords")


def read_tfrecord_file(file_path: str):
    tfdata = tf.data.TFRecordDataset([file_path]).map(decode_fn)

    for d in tfdata:
        print(f"x= {d['x']}, y= {d['y']}")


def read_using_tfsim():
    sampler = tfsim.samplers.TFRecordDatasetSampler(
        shard_path=TFRECORDS_FOLDER_PATH,
        deserialization_fn=decode_fn,
        shard_suffix="*.tfrecords",
        batch_size=20,
        example_per_class=2,
    )

    ITERATIONS = 10

    count = 0
    for d in sampler:
        # print(f"x= {d['x']}")
        print(f"y= {d['y']}")
        count = count + 1

        if count >= ITERATIONS:
            break


# generate_tfrecords_dataset()
# read_tfrecord_file(f"{OUTPUT_FOLDER_PATH}/r1.tfrecords")
read_using_tfsim()

ksachdeva avatar Oct 19 '21 13:10 ksachdeva

quick update to say I'm going to try and look at this later this week.

owenvallis avatar Nov 11 '21 00:11 owenvallis

So I've been working on this and the TFRecordDatasetSampler is a little tricky. Ideally we would like to pass a set of TFRecord datasets to choose_from_datasets, but this turns out to be pretty slow. Instead we use the interleave to randomly choose cycle_length worth of tf record files from disk and read block_length number of examples from each tf record iterator. If we make each tf record a single class, this makes it easy to ensure that a batch contains cycle_length worth of classes with block_length number of examples per class.

However, this assumes that the number of tf records in each file is a multiple of block_length and that our batch_size is equal to cycle_length * block_length. If one of the iterators is shorter it will run out of examples early and another tf record will be loaded. If that happens, then the subsequent batches will be offset. For example:

tfA = [0,0,0,0]
tfB = [1,1,1]  # <- this iterator is short one item
tfC = [2,2,2,2]
cycle_length = 2
block_length = 2

batch_1 = [0,0,1,1] # first batch is fine
batch_2 = [0,0,1,2] # tfB runs out so we load tfC
batch_3 = [2,2,2,1] # now all subsequent batches are offset

The other issue is that each tf record is loaded as an iterator, meaning that there is no way to shuffle the items within the record, meaning we will always consume the record examples in the same order. Additionally, if the iterator is longer than block_length we will need to exhaust the iterator before we are able to load a new class.

The current solution that I'm using is to write multiple tf records per class where I randomly sample the elements and make the size equal to block_length. This ensures that each tf record is a random sample of examples from the class and that the tf record is exhausted after a single batch. So the previous example would then be something like.

tfA_0 = [0,0]
tfA_1 = [0,0]
tfB_0 = [1,1]
tfB_1 = [1,1]
tfC_0 = [2,2]
tfC_1 = [2,2]
cycle_length = 2
block_length = 2

batch_1 = [0,0,1,1]  # first batch is fine tfA_0, tfB_1
batch_2 = [2,2,1,1] # now we don't need to finish the first iterator and can load new classes tfC_1, tfB_1
batch_3 = [2,2,0,0] # and the offsets are now gone tfC_0, tfA_1

Regarding the infinite dataset, we currently have a repeat added after the deserialization so the dataset will repeat forever. I'll look at adding a parameter to set the number of repeats as part of the next release.

owenvallis avatar Nov 24 '21 18:11 owenvallis

Support for setting the number of repeats was just added in PR #196

owenvallis avatar Nov 30 '21 18:11 owenvallis

@owenvallis

Hello, regarding this ,

WARNING: This sampler requires that each TF Record file contain contiguous blocks of classes where the size of each block is a multiple of example_per_class.

Here is an imagenet dataset in tfrecord format. I'm not confident if these tfrecord contain contiguous blocks of classes. In this case, how should I approach this tfrecord dataset with the tf-similarity tfrecord-sampler? Any tips?

innat avatar Jul 20 '22 07:07 innat

Also, unlike TFDatasetMultiShotMemorySampler, the TFRecordDatasetSampler doesn't have any positional paramo for augmentation. Shouldn't it be included?

innat avatar Jul 20 '22 11:07 innat

I need to adopt the following approach, but the problem is that I don't have control over the occurrence of number of each class per batch.

def dataset_for_class(i):
    i = tf.cast(i, tf.int32)
    return dataset.filter(lambda label: label == i)

dataset = [1,2,3,1,2,3,1,1,4,5,1,1,5,1,5,4] 
len_ds = len(dataset)

dataset = tf.data.Dataset.from_tensor_slices(dataset)
dataset = tf.data.Dataset.range(len_ds).interleave(
    dataset_for_class,
    cycle_length=1,
)
dataset = dataset.prefetch(tf.data.AUTOTUNE)
list(dataset.as_numpy_iterator())
[1, 1, 1, 1, 1, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 5]

However, the expected tfrecord format described here is not always smooth to achieve. For example, an existing tfrecord format which I may want to use for similarity modeling, need to first desterilize and modify the class order accordingly and save it again as per needed.

innat avatar Jul 27 '22 08:07 innat