java icon indicating copy to clipboard operation
java copied to clipboard

How to use TfRecordDataset DatasetToTfRecord tf.io.tfRecordReader

Open mullerhai opened this issue 2 years ago • 10 comments

tensorflow-java 0.4 spark 3.1 java 11

Hi :
Now I use tensorflow-java to read tfrecord file ,but can not get the data, and our not have example for it ,the TfRecordDataset DatasetToTfRecord tf.io.tfRecordReader java class have not same api like python ,could we give some example for how to use them. thank


    import org.tensorflow.{Operand, Session,EagerSession}
    import org.tensorflow.op.Ops
    import org.tensorflow.op.data.TfRecordDataset
    import org.tensorflow.op.data.{DatasetToTfRecord, TfRecordDataset}
    val session = EagerSession.create
    val tf = Ops.create(session)
    val  scope = tf.scope()
//    val fileName  =tf.constant( "/Users/zhanghaining/Downloads/tfrecord-kk2-test/")
    val fileName = tf.constant("/Users/zhanghaining/Downloads/BigDL/spark/dl/src/test/resources/tf/mnist_train.tfrecord")
    val compress = tf.constant("")
    val bufferSize = tf.constant(0l)
    val recordDataSet = TfRecordDataset.create(scope,fileName,compress,bufferSize)

    val record = DatasetToTfRecord.create(scope, recordDataSet,fileName,compress)

    val reader =  tf.io.tfRecordReader()

    println(record.op().name() )
    println(record.op().`type`())
    println(recordDataSet.op().numOutputs() )
    println(recordDataSet.asOutput().dataType())

mullerhai avatar May 31 '22 07:05 mullerhai

c++ api demo

 std::unique_ptr<tensorflow::RandomAccessFile> file;
  auto tf_status = tensorflow::Env::Default()->NewRandomAccessFile(
      cc->InputSidePackets().Tag(kTFRecordPath).Get<std::string>(), &file);
  RET_CHECK(tf_status.ok())
      << "Failed to open tfrecord file: " << tf_status.ToString();
  tensorflow::io::RecordReader reader(file.get(),
                                      tensorflow::io::RecordReaderOptions());

mullerhai avatar May 31 '22 08:05 mullerhai

Hi @mullerhai ,

Is your goal to iterate through that dataset? If so, you need to create an iterator (e.g. by calling tf.data.makeIterator). Also in your example here, the DatasetToTfRecord is writing to the same file as the dataset you've loaded so I'm not sure what is the expected behavior here, you should try writing to a different file.

If you don't mind adding org.tensorflow:tensorflow-framework to your dependencies, we do have utilities to simplify the usage of dataset, take a look at this one. You can then iterate through the element of the dataset in eager mode like this :

    Dataset dataset = Dataset.tfRecordDataset(tf, "yourfile.tfrecord", "", 0L).batchSize(10);
    for (List<Operand<?>> components : dataset) {
         Operand<?> featureBatch = components.get(0);
         Operand<?> labelBatch = components.get(1);
         
         ... operate on the batches directly
    }

Eager mode tends to be slow though so if you can provide more details of what is your specific use cases, maybe we can give you better examples on how to do it.

karllessard avatar May 31 '22 12:05 karllessard

    for (List<Operand<?>> components : dataset) {
         Operand<?> featureBatch = components.get(0);
         Operand<?> labelBatch = components.get(1);
         
         ... operate on the batches directly
    }

Great ,Thanks , but also I want to know how to convert Dataset to ByteNdArray ,or tfrecord to ByteNdArray,or convert Dataset to example ->org.tensorflow.example.example.{Example, SequenceExample}, Because of I need like this code style

NdArrays.wrap(Shape.of(dimSizes: _*), DataBuffers.of(bytes, true, false))

to make tensor for model train

mullerhai avatar May 31 '22 13:05 mullerhai

Maybe you can do this via parseExampleDataset? There are also a bunch of utilities for parsing examples in the IO package, like this one.

karllessard avatar May 31 '22 20:05 karllessard

Hi @mullerhai ,

Is your goal to iterate through that dataset? If so, you need to create an iterator (e.g. by calling tf.data.makeIterator). Also in your example here, the DatasetToTfRecord is writing to the same file as the dataset you've loaded so I'm not sure what is the expected behavior here, you should try writing to a different file.

If you don't mind adding org.tensorflow:tensorflow-framework to your dependencies, we do have utilities to simplify the usage of dataset, take a look at this one. You can then iterate through the element of the dataset in eager mode like this :

    Dataset dataset = Dataset.tfRecordDataset(tf, "yourfile.tfrecord", "", 0L).batchSize(10);
    for (List<Operand<?>> components : dataset) {
         Operand<?> featureBatch = components.get(0);
         Operand<?> labelBatch = components.get(1);
         
         ... operate on the batches directly
    }

Eager mode tends to be slow though so if you can provide more details of what is your specific use cases, maybe we can give you better examples on how to do it.

in tensorflow-java 0.5.0-SNAPSHOT , EagerSession model, iter the element in dataset ,I find the element class type is OptionalGetValue or some type, I want to print the real value ,but failed

mullerhai avatar Jun 01 '22 09:06 mullerhai

parseExampleDataset

    val fp = tf.constant("/Volumes/Pink4T/transfer/code/github/stanford-tensorflow-tutorials/2017/data/friday.tfrecord")
    val compress = tf.constant("")
    val bufferSize = tf.constant(0l)
    val datazs  =tf.data.tfRecordDataset( fileNamec, compress, bufferSize)
    println(datazs.asTensor())

I get the error: No tensor type has been registered for data type DT_VARIANT

mullerhai avatar Jun 01 '22 09:06 mullerhai

We don't map (yet) DT_VARIANT tensors in the Java space. Can you please provide the full stacktrace? I want to see where such tensor is being accessed from the JVM.

karllessard avatar Jun 02 '22 11:06 karllessard

Hi @mullerhai ,

Is your goal to iterate through that dataset? If so, you need to create an iterator (e.g. by calling tf.data.makeIterator). Also in your example here, the DatasetToTfRecord is writing to the same file as the dataset you've loaded so I'm not sure what is the expected behavior here, you should try writing to a different file.

If you don't mind adding org.tensorflow:tensorflow-framework to your dependencies, we do have utilities to simplify the usage of dataset, take a look at this one. You can then iterate through the element of the dataset in eager mode like this :

    Dataset dataset = Dataset.tfRecordDataset(tf, "yourfile.tfrecord", "", 0L).batchSize(10);
    for (List<Operand<?>> components : dataset) {
         Operand<?> featureBatch = components.get(0);
         Operand<?> labelBatch = components.get(1);
         
         ... operate on the batches directly
    }

Eager mode tends to be slow though so if you can provide more details of what is your specific use cases, maybe we can give you better examples on how to do it.

Hello, is there any way in which you could run this code outside eager mode? I need to access the binary representation of the example to hit a ParseExample node within a graph.

thanks!

albertoandreottiATgmail avatar Jul 25 '22 15:07 albertoandreottiATgmail

Hi @mullerhai , Is your goal to iterate through that dataset? If so, you need to create an iterator (e.g. by calling tf.data.makeIterator). Also in your example here, the DatasetToTfRecord is writing to the same file as the dataset you've loaded so I'm not sure what is the expected behavior here, you should try writing to a different file. If you don't mind adding org.tensorflow:tensorflow-framework to your dependencies, we do have utilities to simplify the usage of dataset, take a look at this one. You can then iterate through the element of the dataset in eager mode like this :

    Dataset dataset = Dataset.tfRecordDataset(tf, "yourfile.tfrecord", "", 0L).batchSize(10);
    for (List<Operand<?>> components : dataset) {
         Operand<?> featureBatch = components.get(0);
         Operand<?> labelBatch = components.get(1);
         
         ... operate on the batches directly
    }

Eager mode tends to be slow though so if you can provide more details of what is your specific use cases, maybe we can give you better examples on how to do it.

Hello, is there any way in which you could run this code outside eager mode? I need to access the binary representation of the example to hit a ParseExample node within a graph.

thanks!

No ,I have not make it real

mullerhai avatar Jul 27 '22 06:07 mullerhai

Hello, is there any way in which you could run this code outside eager mode? I need to access the binary representation of the example to hit a ParseExample node within a graph.

thanks!

Sure, that will work in Graph mode as well, you just need to make sure that the tf instance you are passing to Dataset.tfRecordDataset is executing in a graph environment i.e. var tf = Ops.create(graph);

You won't be able to use a Java for loop though so you'll need to rely on other TF ops and methods exposed by the datasets/iterators to iterate through the examples within your graph.

karllessard avatar Jul 27 '22 16:07 karllessard