SynapseML
SynapseML copied to clipboard
Write an image dataframe
Is it possible to write the image dataframe? My use case is simple, I read a lot of images, apply some image processing to it and save it to a different location. But I get an an error that write is not supported for the image data type. Here is a simplified version of what I am doing.
from mmlspark import ImageTransformer
df_image = spark.read.format("image").load("car-images")
img_transformer = ImageTransformer().setOutputCol("flipped").flip()
flipped_images = img_transformer.transform(df_image).select("flipped")
flipped_images.write.format("image").save("car-images-flipped")
resulting in the following error:
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-4-7a8ced5adfe6> in <module>()
----> 1 flipped_images.write.format("image").save("car-images-flipped3")
~/lib/spark/python/pyspark/sql/readwriter.py in save(self, path, format, mode, partitionBy, **options)
734 self._jwrite.save()
735 else:
--> 736 self._jwrite.save(path)
737
738 @since(1.4)
~/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py in __call__(self, *args)
1255 answer = self.gateway_client.send_command(command)
1256 return_value = get_return_value(
-> 1257 answer, self.gateway_client, self.target_id, self.name)
1258
1259 for temp_arg in temp_args:
~/lib/spark/python/pyspark/sql/utils.py in deco(*a, **kw)
61 def deco(*a, **kw):
62 try:
---> 63 return f(*a, **kw)
64 except py4j.protocol.Py4JJavaError as e:
65 s = e.java_exception.toString()
~/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o104.save.
: java.lang.UnsupportedOperationException: Write is not supported for image data source
at org.apache.spark.ml.source.image.ImageFileFormat.prepareWrite(ImageFileFormat.scala:48)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:103)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:159)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:104)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult(commands.scala:102)
at org.apache.spark.sql.execution.command.DataWritingCommandExec.doExecute(commands.scala:122)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:131)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$1.apply(SparkPlan.scala:127)
at org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:155)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.sql.execution.SparkPlan.executeQuery(SparkPlan.scala:152)
at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:127)
at org.apache.spark.sql.execution.QueryExecution.toRdd$lzycompute(QueryExecution.scala:80)
at org.apache.spark.sql.execution.QueryExecution.toRdd(QueryExecution.scala:80)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
at org.apache.spark.sql.DataFrameWriter$$anonfun$runCommand$1.apply(DataFrameWriter.scala:668)
at org.apache.spark.sql.execution.SQLExecution$$anonfun$withNewExecutionId$1.apply(SQLExecution.scala:78)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:125)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:73)
at org.apache.spark.sql.DataFrameWriter.runCommand(DataFrameWriter.scala:668)
at org.apache.spark.sql.DataFrameWriter.saveToV1Source(DataFrameWriter.scala:276)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:270)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:228)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748)
Am I doing it wrong or is it simply not yet supported?
👋 Thanks for opening your first issue here! If you're reporting a 🐞 bug, please make sure you include steps to reproduce it.
@mlwkshops sorry about the trouble you are having, yes, this looks like a bug - not sure if it is spark or ImageTransformer though, will take a look
Hey @mlwkshops thanks for reaching out, we added ability to write images a while ago, check this test for usage:
https://github.com/Azure/mmlspark/blob/9805996143d4cf174895ff2e08bb61fd2c99c4f1/src/test/scala/com/microsoft/ml/spark/io/image/ImageReaderSuite.scala#L149
Feel free to reopen if you need more help
@mhamilton723 Thanks for the quick response. Can you please provide the corresponding code for saving image data frames in Python as I am using PySpark?
@mlwkshops use the following writing format:
....
.format("org.apache.spark.ml.source.image.PatchedImageFileFormat")
....
This helped. However, there were additional steps I had to do. This is how I finally could make it work: https://gist.github.com/mlwkshops/dd4f0b3f9888a07741be2fce8319ee86
Although it works with .format("org.apache.spark.ml.source.image.PatchedImageFileFormat")
, the expected behavior is that it should work with .format("image")
.
@mlwkshops use the following writing format:
.... .format("org.apache.spark.ml.source.image.PatchedImageFileFormat") ....
Failed to find data source: org.apache.spark.ml.source.image.PatchedImageFileFormat
@mlwkshops this give me error .format("org.apache.spark.ml.source.image.PatchedImageFileFormat") Failed to find data source: org.apache.spark.ml.source.image.PatchedImageFileFormat
@mlwkshops this give me error .format("org.apache.spark.ml.source.image.PatchedImageFileFormat") Failed to find data source: org.apache.spark.ml.source.image.PatchedImageFileFormat @jeevann1 I think the issue is with the way you're creating the spark session
import pyspark
spark = pyspark.sql.SparkSession.builder.appName("MyApp") \ .config("spark.jars.packages", "com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc1") \ .config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven") \ .getOrCreate()
You also have to change the schema to match it with the schema needed for mmlspark.
Can we write images with mmlspark ? - using Azure Databricks
-
We have 1 folder of 10k images that has to be split to 10 folders of 1k images.
-
We have to resize the images and write them.
`from mmlspark import ImageTransformer ######Read Images image_df = spark.read.format("image").load("/mnt/images/*", inferschema = True)
Resize
tr = (ImageTransformer().setOutputCol("image").resize(height = 200, width = 200)) smallImgs = tr.transform(image_df)
Write Images
smallImgs.write.format("org.apache.spark.ml.source.image.PatchedImageFileFormat").save("/mnt/smallimgs")`
Error:
IllegalArgumentException: 'Image data source supports: \n\tStructType(StructField(image,StructType(StructField(origin,StringType,true), StructField(height,IntegerType,false), StructField(width,IntegerType,false), StructField(nChannels,IntegerType,false), StructField(mode,IntegerType,false), StructField(data,BinaryType,false)),true), StructField(filenames,StringType,true))\n\tyou have :\n\tStructType(StructField(image,StructType(StructField(origin,StringType,true), StructField(height,IntegerType,false), StructField(width,IntegerType,false), StructField(nChannels,IntegerType,false), StructField(mode,IntegerType,false), StructField(data,BinaryType,false)),true)).'
This says something like a schema mismatch, could you please help here?
@Sumit2889,how to specify the schema as you mentioned? Thanks.
Can we write images with mmlspark ? - using Azure Databricks
- We have 1 folder of 10k images that has to be split to 10 folders of 1k images.
- We have to resize the images and write them.
`from mmlspark import ImageTransformer ######Read Images image_df = spark.read.format("image").load("/mnt/images/*", inferschema = True)
Resize
tr = (ImageTransformer().setOutputCol("image").resize(height = 200, width = 200)) smallImgs = tr.transform(image_df)
Write Images
smallImgs.write.format("org.apache.spark.ml.source.image.PatchedImageFileFormat").save("/mnt/smallimgs")`
Error:
IllegalArgumentException: 'Image data source supports: \n\tStructType(StructField(image,StructType(StructField(origin,StringType,true), StructField(height,IntegerType,false), StructField(width,IntegerType,false), StructField(nChannels,IntegerType,false), StructField(mode,IntegerType,false), StructField(data,BinaryType,false)),true), StructField(filenames,StringType,true))\n\tyou have :\n\tStructType(StructField(image,StructType(StructField(origin,StringType,true), StructField(height,IntegerType,false), StructField(width,IntegerType,false), StructField(nChannels,IntegerType,false), StructField(mode,IntegerType,false), StructField(data,BinaryType,false)),true)).'
This says something like a schema mismatch, could you please help here?
@Sumit2889,how to specify the schema as you mentioned? Thanks.
Hi, you have to change the schema for using mmlspark.
You can see in the schema required for image data source there is an additional field "StructField(filenames,StringType,true)". So, you have to add an extra column to the dataframe.
For adding the field to the existing dataframe 'df' you can add this,
df_new = df.withColumn('filenames', F.col("image.origin").alias('filenames'))
Then use write,
df_new.write.mode('overwrite').format("org.apache.spark.ml.source.image.PatchedImageFileFormat").save("images")
Everything works fine and the data gets saved but there is a data loss and only the schema gets stored.
Can we write images with mmlspark ? - using Azure Databricks
- We have 1 folder of 10k images that has to be split to 10 folders of 1k images.
- We have to resize the images and write them.
`from mmlspark import ImageTransformer ######Read Images image_df = spark.read.format("image").load("/mnt/images/*", inferschema = True)
Resize
tr = (ImageTransformer().setOutputCol("image").resize(height = 200, width = 200)) smallImgs = tr.transform(image_df)
Write Images
smallImgs.write.format("org.apache.spark.ml.source.image.PatchedImageFileFormat").save("/mnt/smallimgs")
Error:
IllegalArgumentException: 'Image data source supports: \n\tStructType(StructField(image,StructType(StructField(origin,StringType,true), StructField(height,IntegerType,false), StructField(width,IntegerType,false), StructField(nChannels,IntegerType,false), StructField(mode,IntegerType,false), StructField(data,BinaryType,false)),true), StructField(filenames,StringType,true))\n\tyou have :\n\tStructType(StructField(image,StructType(StructField(origin,StringType,true), StructField(height,IntegerType,false), StructField(width,IntegerType,false), StructField(nChannels,IntegerType,false), StructField(mode,IntegerType,false), StructField(data,BinaryType,false)),true)).'` This says something like a schema mismatch, could you please help here? @Sumit2889,how to specify the schema as you mentioned? Thanks.Hi, you have to change the schema for using mmlspark. You can see in the schema required for image data source there is an additional field "StructField(filenames,StringType,true)". So, you have to add an extra column to the dataframe. For adding the field to the existing dataframe 'df' you can add this,
df_new = df.withColumn('filenames', F.col("image.origin").alias('filenames'))
Then use write,df_new.write.mode('overwrite').format("org.apache.spark.ml.source.image.PatchedImageFileFormat").save("images")
Everything works fine and the data gets saved but there is a data loss and only the schema gets stored.
@Sumit2889 , Thank you, yes now able to save but as you say there are no images :|
Can we write images with mmlspark ? - using Azure Databricks
- We have 1 folder of 10k images that has to be split to 10 folders of 1k images.
- We have to resize the images and write them.
`from mmlspark import ImageTransformer ######Read Images image_df = spark.read.format("image").load("/mnt/images/*", inferschema = True)
Resize
tr = (ImageTransformer().setOutputCol("image").resize(height = 200, width = 200)) smallImgs = tr.transform(image_df)
Write Images
smallImgs.write.format("org.apache.spark.ml.source.image.PatchedImageFileFormat").save("/mnt/smallimgs")
Error:
IllegalArgumentException: 'Image data source supports: \n\tStructType(StructField(image,StructType(StructField(origin,StringType,true), StructField(height,IntegerType,false), StructField(width,IntegerType,false), StructField(nChannels,IntegerType,false), StructField(mode,IntegerType,false), StructField(data,BinaryType,false)),true), StructField(filenames,StringType,true))\n\tyou have :\n\tStructType(StructField(image,StructType(StructField(origin,StringType,true), StructField(height,IntegerType,false), StructField(width,IntegerType,false), StructField(nChannels,IntegerType,false), StructField(mode,IntegerType,false), StructField(data,BinaryType,false)),true)).'` This says something like a schema mismatch, could you please help here? @Sumit2889,how to specify the schema as you mentioned? Thanks.Hi, you have to change the schema for using mmlspark. You can see in the schema required for image data source there is an additional field "StructField(filenames,StringType,true)". So, you have to add an extra column to the dataframe. For adding the field to the existing dataframe 'df' you can add this,
df_new = df.withColumn('filenames', F.col("image.origin").alias('filenames'))
Then use write,df_new.write.mode('overwrite').format("org.apache.spark.ml.source.image.PatchedImageFileFormat").save("images")
Everything works fine and the data gets saved but there is a data loss and only the schema gets stored.@Sumit2889 , Thank you, yes now able to save but as you say there are no images :|
i run the code success too ,but there are no images,does it solve?