ecosystem BytesList with length 0 or 1 is inferred to have StringType instead of ArrayType

BytesList with length 0 or 1 is inferred to have StringType instead of ArrayType

Open jukujala opened this issue 4 years ago • 1 comments

If BytesList in TFRecords has always length of 0 or 1, then the feature is inferred to have StringType instead of ArrayType. Is there a reason for this behavior? With this behavior you can write a DataFrame as TFRecords, but you can't read those TFRecords back to a DataFrame. Zero length BytesList is valid in Tensorflow.

Below is the implementation of the parseBytesList from https://github.com/tensorflow/ecosystem/blob/master/spark/spark-tensorflow-connector/src/main/scala/org/tensorflow/spark/datasources/tfrecords/TensorFlowInferSchema.scala#L144:

  private def parseBytesList(feature: Feature): DataType = {
    val length = feature.getBytesList.getValueCount

    if (length == 0) {
      null
    }
    else if (length > 1) {
      ArrayType(StringType)
    }
    else {
      StringType
    }
  }

May 29 '20 14:05 jukujala

i also hit this problem , do you have any solutions

Jun 11 '22 12:06 liusulizzu

ecosystem ecosystem copied to clipboard

BytesList with length 0 or 1 is inferred to have StringType instead of ArrayType

ecosystem
ecosystem copied to clipboard