SynapseML Write an image dataframe

Is it possible to write the image dataframe? My use case is simple, I read a lot of images, apply some image processing to it and save it to a different location. But I get an an error that write is not supported for the image data type. Here is a simplified version of what I am doing.

from mmlspark import ImageTransformer

df_image = spark.read.format("image").load("car-images")
img_transformer = ImageTransformer().setOutputCol("flipped").flip()
flipped_images = img_transformer.transform(df_image).select("flipped")
flipped_images.write.format("image").save("car-images-flipped")

resulting in the following error:

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-4-7a8ced5adfe6> in <module>()
----> 1 flipped_images.write.format("image").save("car-images-flipped3")

~/lib/spark/python/pyspark/sql/readwriter.py in save(self, path, format, mode, partitionBy, **options)
    734             self._jwrite.save()
    735         else:
--> 736             self._jwrite.save(path)
    737 
    738     @since(1.4)

~/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
   1255         answer = self.gateway_client.send_command(command)
   1256         return_value = get_return_value(
-> 1257             answer, self.gateway_client, self.target_id, self.name)
   1258 
   1259         for temp_arg in temp_args:

~/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
     61     def deco(*a, **kw):
     62         try:
---> 63             return f(*a, **kw)
     64         except py4j.protocol.Py4JJavaError as e:
     65             s = e.java_exception.toString()

~/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling o104.save.
: java.lang.UnsupportedOperationException: Write is not supported for image data source
    at org.apache.spark.ml.source.image.ImageFileFormat.prepareWrite(ImageFileFormat.scala:48)
    at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:103)
    at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
    at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
    at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
    at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
    at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
    at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
    at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
    at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
    at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
    at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
    at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
    at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)

Am I doing it wrong or is it simply not yet supported?

Aug 21 '19 06:08 mlwkshops

👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it.

Aug 21 '19 06:08 welcome[bot]

@mlwkshops sorry about the trouble you are having, yes, this looks like a bug - not sure if it is spark or ImageTransformer though, will take a look

Aug 26 '19 03:08 imatiach-msft

Hey @mlwkshops thanks for reaching out, we added ability to write images a while ago, check this test for usage:

https://github.com/Azure/mmlspark/blob/9805996143d4cf174895ff2e08bb61fd2c99c4f1/src/test/scala/com/microsoft/ml/spark/io/image/ImageReaderSuite.scala#L149

Feel free to reopen if you need more help

Aug 26 '19 14:08 mhamilton723

@mhamilton723 Thanks for the quick response. Can you please provide the corresponding code for saving image data frames in Python as I am using PySpark?

Sep 09 '19 13:09 mlwkshops

@mlwkshops use the following writing format:

....
.format("org.apache.spark.ml.source.image.PatchedImageFileFormat")
....

Sep 11 '19 16:09 mhamilton723

This helped. However, there were additional steps I had to do. This is how I finally could make it work: https://gist.github.com/mlwkshops/dd4f0b3f9888a07741be2fce8319ee86

Although it works with .format("org.apache.spark.ml.source.image.PatchedImageFileFormat"), the expected behavior is that it should work with .format("image").

@mlwkshops use the following writing format:
....
.format("org.apache.spark.ml.source.image.PatchedImageFileFormat")
....

Sep 19 '19 07:09 mlwkshops

Failed to find data source: org.apache.spark.ml.source.image.PatchedImageFileFormat

Jan 14 '20 08:01 jeevann1

@mlwkshops this give me error .format("org.apache.spark.ml.source.image.PatchedImageFileFormat") Failed to find data source: org.apache.spark.ml.source.image.PatchedImageFileFormat

Jan 14 '20 08:01 jeevann1

@mlwkshops this give me error .format("org.apache.spark.ml.source.image.PatchedImageFileFormat") Failed to find data source: org.apache.spark.ml.source.image.PatchedImageFileFormat @jeevann1 I think the issue is with the way you're creating the spark session import pyspark spark = pyspark.sql.SparkSession.builder.appName("MyApp") \ .config("spark.jars.packages", "com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc1") \ .config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven") \ .getOrCreate() You also have to change the schema to match it with the schema needed for mmlspark.

Apr 13 '20 09:04 Sumit2889

Can we write images with mmlspark ? - using Azure Databricks

We have 1 folder of 10k images that has to be split to 10 folders of 1k images.
We have to resize the images and write them.

`from mmlspark import ImageTransformer ######Read Images image_df = spark.read.format("image").load("/mnt/images/*", inferschema = True)

Resize

tr = (ImageTransformer().setOutputCol("image").resize(height = 200, width = 200)) smallImgs = tr.transform(image_df)

Write Images

smallImgs.write.format("org.apache.spark.ml.source.image.PatchedImageFileFormat").save("/mnt/smallimgs")`

Error: IllegalArgumentException: 'Image data source supports: \n\tStructType(StructField(image,StructType(StructField(origin,StringType,true), StructField(height,IntegerType,false), StructField(width,IntegerType,false), StructField(nChannels,IntegerType,false), StructField(mode,IntegerType,false), StructField(data,BinaryType,false)),true), StructField(filenames,StringType,true))\n\tyou have :\n\tStructType(StructField(image,StructType(StructField(origin,StringType,true), StructField(height,IntegerType,false), StructField(width,IntegerType,false), StructField(nChannels,IntegerType,false), StructField(mode,IntegerType,false), StructField(data,BinaryType,false)),true)).'

This says something like a schema mismatch, could you please help here?

@Sumit2889,how to specify the schema as you mentioned? Thanks.

Apr 17 '20 08:04 SriramAvatar

Can we write images with mmlspark ? - using Azure Databricks

We have 1 folder of 10k images that has to be split to 10 folders of 1k images.

We have to resize the images and write them.

`from mmlspark import ImageTransformer ######Read Images image_df = spark.read.format("image").load("/mnt/images/*", inferschema = True)

Resize

tr = (ImageTransformer().setOutputCol("image").resize(height = 200, width = 200)) smallImgs = tr.transform(image_df)

Write Images

smallImgs.write.format("org.apache.spark.ml.source.image.PatchedImageFileFormat").save("/mnt/smallimgs")`

Error: IllegalArgumentException: 'Image data source supports: \n\tStructType(StructField(image,StructType(StructField(origin,StringType,true), StructField(height,IntegerType,false), StructField(width,IntegerType,false), StructField(nChannels,IntegerType,false), StructField(mode,IntegerType,false), StructField(data,BinaryType,false)),true), StructField(filenames,StringType,true))\n\tyou have :\n\tStructType(StructField(image,StructType(StructField(origin,StringType,true), StructField(height,IntegerType,false), StructField(width,IntegerType,false), StructField(nChannels,IntegerType,false), StructField(mode,IntegerType,false), StructField(data,BinaryType,false)),true)).'

This says something like a schema mismatch, could you please help here?

@Sumit2889,how to specify the schema as you mentioned? Thanks.

Hi, you have to change the schema for using mmlspark. You can see in the schema required for image data source there is an additional field "StructField(filenames,StringType,true)". So, you have to add an extra column to the dataframe. For adding the field to the existing dataframe 'df' you can add this, df_new = df.withColumn('filenames', F.col("image.origin").alias('filenames')) Then use write, df_new.write.mode('overwrite').format("org.apache.spark.ml.source.image.PatchedImageFileFormat").save("images")

Everything works fine and the data gets saved but there is a data loss and only the schema gets stored.

Apr 17 '20 08:04 Sumit2889

Can we write images with mmlspark ? - using Azure Databricks

We have 1 folder of 10k images that has to be split to 10 folders of 1k images.

We have to resize the images and write them.

`from mmlspark import ImageTransformer ######Read Images image_df = spark.read.format("image").load("/mnt/images/*", inferschema = True)

Resize

tr = (ImageTransformer().setOutputCol("image").resize(height = 200, width = 200)) smallImgs = tr.transform(image_df)

Write Images

smallImgs.write.format("org.apache.spark.ml.source.image.PatchedImageFileFormat").save("/mnt/smallimgs")Error:IllegalArgumentException: 'Image data source supports: \n\tStructType(StructField(image,StructType(StructField(origin,StringType,true), StructField(height,IntegerType,false), StructField(width,IntegerType,false), StructField(nChannels,IntegerType,false), StructField(mode,IntegerType,false), StructField(data,BinaryType,false)),true), StructField(filenames,StringType,true))\n\tyou have :\n\tStructType(StructField(image,StructType(StructField(origin,StringType,true), StructField(height,IntegerType,false), StructField(width,IntegerType,false), StructField(nChannels,IntegerType,false), StructField(mode,IntegerType,false), StructField(data,BinaryType,false)),true)).'` This says something like a schema mismatch, could you please help here? @Sumit2889,how to specify the schema as you mentioned? Thanks.

Hi, you have to change the schema for using mmlspark. You can see in the schema required for image data source there is an additional field "StructField(filenames,StringType,true)". So, you have to add an extra column to the dataframe. For adding the field to the existing dataframe 'df' you can add this, df_new = df.withColumn('filenames', F.col("image.origin").alias('filenames')) Then use write, df_new.write.mode('overwrite').format("org.apache.spark.ml.source.image.PatchedImageFileFormat").save("images")

Everything works fine and the data gets saved but there is a data loss and only the schema gets stored.

@Sumit2889 , Thank you, yes now able to save but as you say there are no images :|

Apr 17 '20 09:04 SriramAvatar

Can we write images with mmlspark ? - using Azure Databricks

We have 1 folder of 10k images that has to be split to 10 folders of 1k images.

We have to resize the images and write them.

`from mmlspark import ImageTransformer ######Read Images image_df = spark.read.format("image").load("/mnt/images/*", inferschema = True)

Resize

tr = (ImageTransformer().setOutputCol("image").resize(height = 200, width = 200)) smallImgs = tr.transform(image_df)

Write Images

smallImgs.write.format("org.apache.spark.ml.source.image.PatchedImageFileFormat").save("/mnt/smallimgs")Error:IllegalArgumentException: 'Image data source supports: \n\tStructType(StructField(image,StructType(StructField(origin,StringType,true), StructField(height,IntegerType,false), StructField(width,IntegerType,false), StructField(nChannels,IntegerType,false), StructField(mode,IntegerType,false), StructField(data,BinaryType,false)),true), StructField(filenames,StringType,true))\n\tyou have :\n\tStructType(StructField(image,StructType(StructField(origin,StringType,true), StructField(height,IntegerType,false), StructField(width,IntegerType,false), StructField(nChannels,IntegerType,false), StructField(mode,IntegerType,false), StructField(data,BinaryType,false)),true)).'` This says something like a schema mismatch, could you please help here? @Sumit2889,how to specify the schema as you mentioned? Thanks.

Hi, you have to change the schema for using mmlspark. You can see in the schema required for image data source there is an additional field "StructField(filenames,StringType,true)". So, you have to add an extra column to the dataframe. For adding the field to the existing dataframe 'df' you can add this, df_new = df.withColumn('filenames', F.col("image.origin").alias('filenames')) Then use write, df_new.write.mode('overwrite').format("org.apache.spark.ml.source.image.PatchedImageFileFormat").save("images") Everything works fine and the data gets saved but there is a data loss and only the schema gets stored.

@Sumit2889 , Thank you, yes now able to save but as you say there are no images :|

i run the code success too ,but there are no images,does it solve?

Feb 13 '23 07:02 mxq-151

SynapseML SynapseML copied to clipboard

Write an image dataframe

Resize

Write Images

Resize

Write Images

Resize

Write Images

Resize

Write Images

SynapseML
SynapseML copied to clipboard