neural-structured-learning icon indicating copy to clipboard operation
neural-structured-learning copied to clipboard

Adding support for reading and writing to multiple tfrecords in `nsl.tools.pack_nbrs`

Open srihari-humbarwadi opened this issue 2 years ago • 8 comments

The current implementation of nsl.tools.pack_nbrs does not support reading and writing to multiple tfrecord files. Given the extensive optimizations made available by the tf.data API when working with multiple tfrecords, supporting this would yield significant performance gain in distributed training. I would be willing to contribute to this

Relevant parts of the code

  • for reading https://github.com/tensorflow/neural-structured-learning/blob/c21dad4feff187cdec041a564193ea7b619b8906/neural_structured_learning/tools/pack_nbrs.py#L63-L71
  • for writing https://github.com/tensorflow/neural-structured-learning/blob/c21dad4feff187cdec041a564193ea7b619b8906/neural_structured_learning/tools/pack_nbrs.py#L264-L270

srihari-humbarwadi avatar Jul 11 '21 14:07 srihari-humbarwadi

Thanks for your interest in contributing, @srihari-humbarwadi! :) To begin with, could you share more specifics on what APIs you plan to change and how you plan to change them? Once we come to an agreement there, you can go ahead with the implementation. I've assigned one of my colleagues, @aheydon-google to this thread, who will be able to work with you on this.

arjung avatar Jul 12 '21 18:07 arjung

Here is the signature of the current implementation of pack_nbrs ---

def pack_nbrs(labeled_examples_path,
              unlabeled_examples_path,
              graph_path,
              output_training_data_path,
              add_undirected_edges=False,
              max_nbrs=None,
              id_feature_name='id'):

labeled_examples_path and unlabeled_examples_path are paths to a single TFRecord file, one for each of them. The proposed modification changes this to additionally support a list of paths. This would enable users to load examples that are split across multiple TFRecord 'shards'; which is often the case when training on multiple accelerators.

labeled_examples_path = 'train.tfrecord'  # current implementation supports this.

# proposed modification adds support for the following as well
labeled_examples_path = [
    'train-0001-of-0004.tfrecord',
    'train-0002-of-0004.tfrecord',
    'train-0003-of-0004.tfrecord',
    'train-0004-of-0004.tfrecord'
    ]

The _read_tfrecord_examples function (defined here) that reads examples currently from a single file would require minimal changes to support reading from multiple files.

https://github.com/tensorflow/neural-structured-learning/blob/c21dad4feff187cdec041a564193ea7b619b8906/neural_structured_learning/tools/pack_nbrs.py#L66-L68

This modified code would look something like this

for raw_record in tf.data.TFRecordDataset(filenames):  # filenames is list of tfrecord paths
  tf_example = parse_example(raw_record)
  result[get_id(tf_example)] = tf_example

For writing the newly generated examples, the current pack_nbrs implementation writes them into a single TFRecord at a path given by the output_training_data_path argument. The proposed modification adds an optional functionality to split the newly generated examples across multiple TFRecord shards. num_shards, a new argument in the pack_nbrs will control the number of TFRecord shards generated. This again would require minimal code addition, changing https://github.com/tensorflow/neural-structured-learning/blob/c21dad4feff187cdec041a564193ea7b619b8906/neural_structured_learning/tools/pack_nbrs.py#L264-L266 to

writers = []
for i in range(num_shards):
  # there could be a better way to generate output TFRecord names
  output_path = '{}-{}-of-{}'.format(output_training_data_path, i, num_shards)
  writers.append(tf.io.TFRecordWriter(output_path))

for i, merged_ex in enumerate(_join_examples(seed_exs, nbr_exs, graph, max_nbrs)):
  writers[i % num_shards].write(merged_ex.SerializeToString())

#  close all writers
for writer in writers:
  writer.close()

srihari-humbarwadi avatar Jul 13 '21 17:07 srihari-humbarwadi

Hi, Srihari.

Thanks for supplying those details and for offering to contribute to NSL! What you propose sounds like a nice improvement. If you send me a pull request with your proposed changes, I can review it.

Thanks!

  • Allan

aheydon-google avatar Jul 29 '21 01:07 aheydon-google

Thank you @aheydon-google, I will start working on this!

srihari-humbarwadi avatar Aug 05 '21 01:08 srihari-humbarwadi

Looking forward to this feature.

For bigger datasets packing all the examples into a single TFRecord file will introduce a substantial amount of bottleneck in the overall data pipeline.

sayakpaul avatar Aug 20 '21 04:08 sayakpaul

Hi, Srihari. Do you have any updates to report on this issue? I think it could be quite useful! Thanks, - Allan

aheydon-google avatar Sep 22 '21 18:09 aheydon-google

@aheydon-google I'll push some changes in a couple of days!

srihari-humbarwadi avatar Sep 23 '21 14:09 srihari-humbarwadi

Hi again, Srihari. Are you still planning to work on this issue? I think it would be a great contribution if you can do it!

Thanks,

  • Allan

aheydon-google avatar Jan 12 '22 19:01 aheydon-google

Since this issue has been dormant for quite some time, I'm going to close it. Feel free to send a pull request if you want to implement the improvement!

aheydon-google avatar Aug 09 '23 18:08 aheydon-google

Closing the issue for now.

csferng avatar Aug 09 '23 18:08 csferng