deepmind-research icon indicating copy to clipboard operation
deepmind-research copied to clipboard

Creating Own Dataset For 'Learning to simulate complex physics with Graph networks'

Open piba941 opened this issue 3 years ago • 10 comments

I am trying to evaluate the model using my own dataset but i am facing issues while generating and converting into TFrecords. Could you please help.

piba941 avatar Apr 01 '21 03:04 piba941

same purpose, what your current progress?

zxy-ml84 avatar Jun 22 '21 06:06 zxy-ml84

Hi there, we are happy to help if we can.

Could you provide some more information on what your issues are?

jg8610 avatar Aug 16 '21 10:08 jg8610

Note that you can use your own datasets without writing them as "tf.records". Instead, tf.data.Dataset has a from_generator method, that lets you load a dataset using arbitrary python code, so you may replace these two lines here to load your dataset in whatever format you have it stored with from_generator and simply make the tensors look like those read from our records datasets :)

alvarosg avatar Aug 18 '21 11:08 alvarosg

First, thanks to @alvarosg for the answer.

I have made very simple arbitrary data which resembles the structure of the parsed version of the test.tfrecord as follows:

particle_type =  np.array([3, 3, 3, 3])
key = 0
position = np.array([[[0.42, 0.55],
                             [0.41, 0.55],
                             [0.40, 0.54],
                             [0.39, 0.53]],
                            [[0.56, 0.14],
                             [0.56, 0.13],
                             [0.54, 0.12],
                             [0.53, 0.11]]])

What is a good way to make this a compatible dataset? I first tried to make a generator and put it into a from_generator, but I am not sure how to make this dataset compatible with the training data.

yjchoi1 avatar Aug 19 '21 04:08 yjchoi1

The TFRecord uses the SequenceExample format, which can be generated as shown below:

# Import modules and this file should be outside learning_to_simulate code folder
import functools
import os
import json
import pickle

import tensorflow.compat.v1 as tf
import numpy as np

from learning_to_simulate import reading_utils

# Set datapath and validation set
data_path = './datasets/WaterDropSample'
filename = 'valid.tfrecord'

# Read metadata
def _read_metadata(data_path):
    with open(os.path.join(data_path, 'metadata.json'), 'rt') as fp:
        return json.loads(fp.read())

# Fetch metadata
metadata = _read_metadata(data_path)

print(metadata)

# Read TFRecord
ds_org = tf.data.TFRecordDataset([os.path.join(data_path, filename)])
ds = ds_org.map(functools.partial(reading_utils.parse_serialized_simulation_example, metadata=metadata))

# Convert to list
# @tf.function
def list_tf(ds):
    return(list(ds))
	
lds = list_tf(ds)

particle_types = []
keys = []
positions = []
for _ds in ds:
    context, features = _ds
    particle_types.append(context["particle_type"].numpy().astype(np.int64))
    keys.append(context["key"].numpy().astype(np.int64))
    positions.append(features["position"].numpy().astype(np.float32))
    
# The following functions can be used to convert a value to a type compatible
# with tf.train.Example.

def _bytes_feature(value):
    """Returns a bytes_list from a string / byte."""
    if isinstance(value, type(tf.constant(0))):
        value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
    return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))

def _float_feature(value):
    """Returns a float_list from a float / double."""
    return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))

def _int64_feature(value):
    """Returns an int64_list from a bool / enum / int / uint."""
    return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))


# Write TF Record
with tf.python_io.TFRecordWriter('test.tfrecord') as writer:
    
    for step, (particle_type, key, position) in enumerate(zip(particle_types, keys, positions)):
        seq = tf.train.SequenceExample(
                context=tf.train.Features(feature={
                    "particle_type": _bytes_feature(particle_type.tobytes()),
                    "key": _int64_feature(key)
                }),
                feature_lists=tf.train.FeatureLists(feature_list={
                    'position': tf.train.FeatureList(
                        feature=[_bytes_feature(position.flatten().tobytes())],
                    ),
                    'step_context': tf.train.FeatureList(
                        feature=[_bytes_feature(np.float32(step).tobytes())]
                    ),
                })
            )

        writer.write(seq.SerializeToString())


dt = tf.data.TFRecordDataset(['test.tfrecord'])
dt = dt.map(functools.partial(reading_utils.parse_serialized_simulation_example, metadata=metadata))

# Check if the original TFRecord and the newly generated TFRecord are the same
for ((_ds_context, _ds_feature), (_dt_context, _dt_feature)) in zip(ds, dt):
    if not np.allclose(_ds_context["key"].numpy(), _dt_context["key"].numpy()):
        break

    if not np.allclose(_ds_context["particle_type"].numpy(), _dt_context["particle_type"].numpy()):
        break

    if not np.allclose(_ds_feature["position"].numpy(), _dt_feature["position"].numpy()):
        break

else:
    print("TFRecords are similar!")

kks32 avatar Nov 23 '21 16:11 kks32

Note that you can use your own datasets without writing them as "tf.records". Instead, tf.data.Dataset has a from_generator method, that lets you load a dataset using arbitrary python code, so you may replace these two lines here to load your dataset in whatever format you have it stored with from_generator and simply make the tensors look like those read from our records datasets :)

Thanks. Can you provide an explanation about the format of your datasets?

yq60523 avatar Jan 25 '22 07:01 yq60523

Thanks. Can you provide an explanation about the format of your datasets?

This varies a bit on a per dataset basis, but in the most simple case (dataset with only positions), it is mostly two dicts with fields:

context["particle_type"]  # shape [num_particles]
features["position"] # shape [num_steps, num particles, num_dimensions]

If you want to connect your own data to this, I would recommend to run these two lines:

    ds = tf.data.TFRecordDataset([os.path.join(data_path, f'{split}.tfrecord')])
    ds = ds.map(functools.partial(
        reading_utils.parse_serialized_simulation_example, metadata=metadata))

inspect the spec of the output, and then write your own tf.Dataset that outputs your data in the same format.

alvarosg avatar Jan 25 '22 14:01 alvarosg

Thanks. Can you provide an explanation about the format of your datasets?

This varies a bit on a per dataset basis, but in the most simple case (dataset with only positions), it is mostly two dicts with fields:

context["particle_type"]  # shape [num_particles]
features["position"] # shape [num_steps, num particles, num_dimensions]

If you want to connect your own data to this, I would recommend to run these two lines:

    ds = tf.data.TFRecordDataset([os.path.join(data_path, f'{split}.tfrecord')])
    ds = ds.map(functools.partial(
        reading_utils.parse_serialized_simulation_example, metadata=metadata))

inspect the spec of the output, and then write your own tf.Dataset that outputs your data in the same format.

Many thanks to Alvaro. That's what I want.

yq60523 avatar Jan 26 '22 01:01 yq60523

The GNS article mentioned generating the Water-3D dataset using SPlisHSPlasH. How did you do it? I want to make some test cases myself. thanks a lot!!!

Vesuvius6 avatar Mar 18 '23 12:03 Vesuvius6

what is step_context args for the data? is it forced to be add when creating the dataset

when I debug for open tfrecord, I see in each element the dict 1 has only position information, the step_context is not included

BoyuanTang331 avatar Nov 15 '23 09:11 BoyuanTang331