SynapseML icon indicating copy to clipboard operation
SynapseML copied to clipboard

Write an image dataframe

Open mlwkshops opened this issue 5 years ago • 13 comments

Is it possible to write the image dataframe? My use case is simple, I read a lot of images, apply some image processing to it and save it to a different location. But I get an an error that write is not supported for the image data type. Here is a simplified version of what I am doing.

from mmlspark import ImageTransformer

df_image = spark.read.format("image").load("car-images")
img_transformer = ImageTransformer().setOutputCol("flipped").flip()
flipped_images = img_transformer.transform(df_image).select("flipped")
flipped_images.write.format("image").save("car-images-flipped")

resulting in the following error:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-4-7a8ced5adfe6> in <module>()
----> 1 flipped_images.write.format("image").save("car-images-flipped3")

~/lib/spark/python/pyspark/sql/readwriter.py in save(self, path, format, mode, partitionBy, **options)
    734             self._jwrite.save()
    735         else:
--> 736             self._jwrite.save(path)
    737 
    738     @since(1.4)

~/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

~/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

~/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling o104.save.
: java.lang.UnsupportedOperationException: Write is not supported for image data source
    at org.apache.spark.ml.source.image.ImageFileFormat.prepareWrite(ImageFileFormat.scala:48)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:103)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
    at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

Am I doing it wrong or is it simply not yet supported?

mlwkshops avatar Aug 21 '19 06:08 mlwkshops

👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it.

welcome[bot] avatar Aug 21 '19 06:08 welcome[bot]

@mlwkshops sorry about the trouble you are having, yes, this looks like a bug - not sure if it is spark or ImageTransformer though, will take a look

imatiach-msft avatar Aug 26 '19 03:08 imatiach-msft

Hey @mlwkshops thanks for reaching out, we added ability to write images a while ago, check this test for usage:

https://github.com/Azure/mmlspark/blob/9805996143d4cf174895ff2e08bb61fd2c99c4f1/src/test/scala/com/microsoft/ml/spark/io/image/ImageReaderSuite.scala#L149

Feel free to reopen if you need more help

mhamilton723 avatar Aug 26 '19 14:08 mhamilton723

@mhamilton723 Thanks for the quick response. Can you please provide the corresponding code for saving image data frames in Python as I am using PySpark?

mlwkshops avatar Sep 09 '19 13:09 mlwkshops

@mlwkshops use the following writing format:

....
.format("org.apache.spark.ml.source.image.PatchedImageFileFormat")
....

mhamilton723 avatar Sep 11 '19 16:09 mhamilton723

This helped. However, there were additional steps I had to do. This is how I finally could make it work: https://gist.github.com/mlwkshops/dd4f0b3f9888a07741be2fce8319ee86

Although it works with .format("org.apache.spark.ml.source.image.PatchedImageFileFormat"), the expected behavior is that it should work with .format("image").

@mlwkshops use the following writing format:

....
.format("org.apache.spark.ml.source.image.PatchedImageFileFormat")
....

mlwkshops avatar Sep 19 '19 07:09 mlwkshops

Failed to find data source: org.apache.spark.ml.source.image.PatchedImageFileFormat

jeevann1 avatar Jan 14 '20 08:01 jeevann1

@mlwkshops this give me error .format("org.apache.spark.ml.source.image.PatchedImageFileFormat") Failed to find data source: org.apache.spark.ml.source.image.PatchedImageFileFormat

jeevann1 avatar Jan 14 '20 08:01 jeevann1

@mlwkshops this give me error .format("org.apache.spark.ml.source.image.PatchedImageFileFormat") Failed to find data source: org.apache.spark.ml.source.image.PatchedImageFileFormat @jeevann1 I think the issue is with the way you're creating the spark session import pyspark spark = pyspark.sql.SparkSession.builder.appName("MyApp") \ .config("spark.jars.packages", "com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc1") \ .config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven") \ .getOrCreate() You also have to change the schema to match it with the schema needed for mmlspark.

Sumit2889 avatar Apr 13 '20 09:04 Sumit2889

Can we write images with mmlspark ? - using Azure Databricks

  • We have 1 folder of 10k images that has to be split to 10 folders of 1k images.

  • We have to resize the images and write them.

`from mmlspark import ImageTransformer ######Read Images image_df = spark.read.format("image").load("/mnt/images/*", inferschema = True)

Resize

tr = (ImageTransformer().setOutputCol("image").resize(height = 200, width = 200)) smallImgs = tr.transform(image_df)

Write Images

smallImgs.write.format("org.apache.spark.ml.source.image.PatchedImageFileFormat").save("/mnt/smallimgs")`

Error: IllegalArgumentException: 'Image data source supports: \n\tStructType(StructField(image,StructType(StructField(origin,StringType,true), StructField(height,IntegerType,false), StructField(width,IntegerType,false), StructField(nChannels,IntegerType,false), StructField(mode,IntegerType,false), StructField(data,BinaryType,false)),true), StructField(filenames,StringType,true))\n\tyou have :\n\tStructType(StructField(image,StructType(StructField(origin,StringType,true), StructField(height,IntegerType,false), StructField(width,IntegerType,false), StructField(nChannels,IntegerType,false), StructField(mode,IntegerType,false), StructField(data,BinaryType,false)),true)).'

This says something like a schema mismatch, could you please help here?

@Sumit2889,how to specify the schema as you mentioned? Thanks.

SriramAvatar avatar Apr 17 '20 08:04 SriramAvatar

Can we write images with mmlspark ? - using Azure Databricks

  • We have 1 folder of 10k images that has to be split to 10 folders of 1k images.
  • We have to resize the images and write them.

`from mmlspark import ImageTransformer ######Read Images image_df = spark.read.format("image").load("/mnt/images/*", inferschema = True)

Resize

tr = (ImageTransformer().setOutputCol("image").resize(height = 200, width = 200)) smallImgs = tr.transform(image_df)

Write Images

smallImgs.write.format("org.apache.spark.ml.source.image.PatchedImageFileFormat").save("/mnt/smallimgs")`

Error: IllegalArgumentException: 'Image data source supports: \n\tStructType(StructField(image,StructType(StructField(origin,StringType,true), StructField(height,IntegerType,false), StructField(width,IntegerType,false), StructField(nChannels,IntegerType,false), StructField(mode,IntegerType,false), StructField(data,BinaryType,false)),true), StructField(filenames,StringType,true))\n\tyou have :\n\tStructType(StructField(image,StructType(StructField(origin,StringType,true), StructField(height,IntegerType,false), StructField(width,IntegerType,false), StructField(nChannels,IntegerType,false), StructField(mode,IntegerType,false), StructField(data,BinaryType,false)),true)).'

This says something like a schema mismatch, could you please help here?

@Sumit2889,how to specify the schema as you mentioned? Thanks.

Hi, you have to change the schema for using mmlspark. You can see in the schema required for image data source there is an additional field "StructField(filenames,StringType,true)". So, you have to add an extra column to the dataframe. For adding the field to the existing dataframe 'df' you can add this, df_new = df.withColumn('filenames', F.col("image.origin").alias('filenames')) Then use write, df_new.write.mode('overwrite').format("org.apache.spark.ml.source.image.PatchedImageFileFormat").save("images")

Everything works fine and the data gets saved but there is a data loss and only the schema gets stored.

Sumit2889 avatar Apr 17 '20 08:04 Sumit2889

Can we write images with mmlspark ? - using Azure Databricks

  • We have 1 folder of 10k images that has to be split to 10 folders of 1k images.
  • We have to resize the images and write them.

`from mmlspark import ImageTransformer ######Read Images image_df = spark.read.format("image").load("/mnt/images/*", inferschema = True)

Resize

tr = (ImageTransformer().setOutputCol("image").resize(height = 200, width = 200)) smallImgs = tr.transform(image_df)

Write Images

smallImgs.write.format("org.apache.spark.ml.source.image.PatchedImageFileFormat").save("/mnt/smallimgs")Error:IllegalArgumentException: 'Image data source supports: \n\tStructType(StructField(image,StructType(StructField(origin,StringType,true), StructField(height,IntegerType,false), StructField(width,IntegerType,false), StructField(nChannels,IntegerType,false), StructField(mode,IntegerType,false), StructField(data,BinaryType,false)),true), StructField(filenames,StringType,true))\n\tyou have :\n\tStructType(StructField(image,StructType(StructField(origin,StringType,true), StructField(height,IntegerType,false), StructField(width,IntegerType,false), StructField(nChannels,IntegerType,false), StructField(mode,IntegerType,false), StructField(data,BinaryType,false)),true)).'` This says something like a schema mismatch, could you please help here? @Sumit2889,how to specify the schema as you mentioned? Thanks.

Hi, you have to change the schema for using mmlspark. You can see in the schema required for image data source there is an additional field "StructField(filenames,StringType,true)". So, you have to add an extra column to the dataframe. For adding the field to the existing dataframe 'df' you can add this, df_new = df.withColumn('filenames', F.col("image.origin").alias('filenames')) Then use write, df_new.write.mode('overwrite').format("org.apache.spark.ml.source.image.PatchedImageFileFormat").save("images")

Everything works fine and the data gets saved but there is a data loss and only the schema gets stored.

@Sumit2889 , Thank you, yes now able to save but as you say there are no images :| image

SriramAvatar avatar Apr 17 '20 09:04 SriramAvatar

Can we write images with mmlspark ? - using Azure Databricks

  • We have 1 folder of 10k images that has to be split to 10 folders of 1k images.
  • We have to resize the images and write them.

`from mmlspark import ImageTransformer ######Read Images image_df = spark.read.format("image").load("/mnt/images/*", inferschema = True)

Resize

tr = (ImageTransformer().setOutputCol("image").resize(height = 200, width = 200)) smallImgs = tr.transform(image_df)

Write Images

smallImgs.write.format("org.apache.spark.ml.source.image.PatchedImageFileFormat").save("/mnt/smallimgs")Error:IllegalArgumentException: 'Image data source supports: \n\tStructType(StructField(image,StructType(StructField(origin,StringType,true), StructField(height,IntegerType,false), StructField(width,IntegerType,false), StructField(nChannels,IntegerType,false), StructField(mode,IntegerType,false), StructField(data,BinaryType,false)),true), StructField(filenames,StringType,true))\n\tyou have :\n\tStructType(StructField(image,StructType(StructField(origin,StringType,true), StructField(height,IntegerType,false), StructField(width,IntegerType,false), StructField(nChannels,IntegerType,false), StructField(mode,IntegerType,false), StructField(data,BinaryType,false)),true)).'` This says something like a schema mismatch, could you please help here? @Sumit2889,how to specify the schema as you mentioned? Thanks.

Hi, you have to change the schema for using mmlspark. You can see in the schema required for image data source there is an additional field "StructField(filenames,StringType,true)". So, you have to add an extra column to the dataframe. For adding the field to the existing dataframe 'df' you can add this, df_new = df.withColumn('filenames', F.col("image.origin").alias('filenames')) Then use write, df_new.write.mode('overwrite').format("org.apache.spark.ml.source.image.PatchedImageFileFormat").save("images") Everything works fine and the data gets saved but there is a data loss and only the schema gets stored.

@Sumit2889 , Thank you, yes now able to save but as you say there are no images :| image

i run the code success too ,but there are no images,does it solve?

mxq-151 avatar Feb 13 '23 07:02 mxq-151