cobrix icon indicating copy to clipboard operation
cobrix copied to clipboard

ADLS Support

Open alexdesroches opened this issue 3 years ago • 5 comments

Hi team,

Does cobrix support reading from Azure Blob storage? I'm having some issues doing the following:

copybook_path = "abfss://[email protected]/example_data/companies_copybook.cpy" file_path = "abfss://[email protected]/example_data/companies_data/COMP.DETAILS.SEP30.DATA.dat"

df = ( spark.read.format("za.co.absa.cobrix.spark.cobol.source") .option("copybook", copybook_path) .load(file_path) )

alexdesroches avatar Mar 29 '21 18:03 alexdesroches

Hi,

Cobrix uses Hadoop Filesystem abstraction, so as long as the fs is supported by Hadoop, it should support the os.

Do other data sources work for you (spark-csv, for instance) with abfss?

yruslan avatar Mar 29 '21 19:03 yruslan

Thanks for the quick response Ruslan.

Yes, if I do something like this:

file_path = "abfss://[email protected]/example_data/test.csv"
df2 = ( spark.read.format("csv")
             .load(file_path)
     )

I can read the CSV into a DataFrame directly with abfss.

Here is the full trace for the error I'm getting:

Py4JJavaError: An error occurred while calling o509.load.
: Failure to initialize configuration
	at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.services.SimpleKeyProvider.getStorageAccountKey(SimpleKeyProvider.java:51)
	at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AbfsConfiguration.getStorageAccountKey(AbfsConfiguration.java:529)
	at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.initializeClient(AzureBlobFileSystemStore.java:1528)
	at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystemStore.<init>(AzureBlobFileSystemStore.java:207)
	at shaded.databricks.azurebfs.org.apache.hadoop.fs.azurebfs.AzureBlobFileSystem.initialize(AzureBlobFileSystem.java:116)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2669)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:370)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:295)
	at za.co.absa.cobrix.spark.cobol.source.parameters.CobolParametersValidator$.validatePath$1(CobolParametersValidator.scala:72)
	at za.co.absa.cobrix.spark.cobol.source.parameters.CobolParametersValidator$.validateOrThrow(CobolParametersValidator.scala:95)
	at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createRelation(DefaultSource.scala:51)
	at za.co.absa.cobrix.spark.cobol.source.DefaultSource.createRelation(DefaultSource.scala:47)
	at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:385)
	at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:424)
	at org.apache.spark.sql.DataFrameReader.$anonfun$load$3(DataFrameReader.scala:391)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:391)
	at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:278)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:380)
	at py4j.Gateway.invoke(Gateway.java:295)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:251)
	at java.lang.Thread.run(Thread.java:748)


alexdesroches avatar Mar 29 '21 19:03 alexdesroches

I see. It seems it tries to use the filesystem, but for some reason fails to initialize it.

Sorry, we can't help you with it at the moment since we don't have access to an Azure account and also no one from our team has experience with it.

If you figure out what is wrong and what can be done from Cobrix's side to make it work, we'll gladly include the fix. Or you could contribute a PR, that would be fantastic and the most welcome.

Again, I'm really sorry. I want to help, but just don't know how.

yruslan avatar Mar 29 '21 19:03 yruslan

You can mount ADLS on Databricks and use it as local storage. That is the right way to use ADLS in Databricks.

vijayinani avatar Apr 13 '21 11:04 vijayinani

You can mount ADLS on Databricks and use it as local storage. That is the right way to use ADLS in Databricks.

@vijayinani not true. this is dependent on security guidelines. mounting a storage account will expose data to all users, but if user is unauthorized then that's a problem.

@alexdesroches i think the stack trace gives you the clue. you're using python but cobrix is assembled in scala and most likely cannot access the authentication you set in pyspark with spark.conf.set

We can access the underlying hadoop configuration as per this link sc._jsc.hadoopConfiguration().set('my.mapreduce.setting', 'someVal')

Can you try changing spark.conf.set to sc._jsc.hadoopConfiguration().set and report back the results?

kennydataml avatar Jul 12 '21 20:07 kennydataml