spark-excel icon indicating copy to clipboard operation
spark-excel copied to clipboard

How to read the hyperlink from excel files

Open vamsikt opened this issue 2 years ago • 8 comments

Hello,

How to read hyperlinks from excel files?

vamsikt avatar Feb 15 '22 15:02 vamsikt

Do they not get returned if you read them as text?

nightscape avatar Feb 15 '22 18:02 nightscape

@nightscape hi, what is the option we should use to read them as text in version 1?

vamsikt avatar Feb 15 '22 19:02 vamsikt

Basically, I would expect it to work right off the bat. If it doesn't, try providing a schema manually (see README.md).

nightscape avatar Feb 17 '22 08:02 nightscape

@nightscape can you provide an example to read them as text? I specified a manual schema with spark string type but still, I don't see a hyper link it was showing 0.0..

vamsikt avatar Feb 23 '22 22:02 vamsikt

@vamsikt let's do it the other way round 😉 Please provide an example Excel file and the code you use for reading it. In the best case, you would fill out the Issue Template (which should actually have been inserted into the issue when you created it).

nightscape avatar Feb 24 '22 11:02 nightscape

Hi @nightscape, here is the Excel file out of which we want to extract the url = htttps://google.com and NOT just 'View Link' text.

We're wondering if we can load the entire hyperlink cell content as text first and then extract the url from the string. image

Our code:

import org.apache.spark.sql._
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._

object ExcelHyperlink {
  def main(args: Array[String]): Unit = {

    val customSchema = StructType(Array(
      StructField("A", StringType, nullable = false),
      StructField("B", IntegerType, nullable = false)))

    val spark = SparkSession
      .builder()
      .appName("MySparkApp")
      .config("spark.master", "local")
      .getOrCreate()

    val df = spark.read
      .format(
        "com.crealytics.spark.excel"
      )
      .option("dataAddress", "A1")
      .option("header", "true")
      .option("treatEmptyValuesAsNulls", "false")
      .option("addColorColumns", "false")
      .option("usePlainNumberFormat", "false")
      .schema(customSchema)
      .load(".../src/main/resources/SampleHyperlinkFile.xlsx")
      .na
      .drop(how = "all")

    df.printSchema()
    df.show


  }

}

Output image image

We also tried specifying Column A to be loaded as BinaryType and then tried converting it to StringType but without success.

Please let us know if this can be accomplished with the current library state or if we need to fill out the Issue Template. Thanks in advance!

kuzmicni avatar Apr 28 '22 15:04 kuzmicni

@kuzmicni reading a hyperlink will probably not work with the current implementation. One would need to implement something like this in spark-excel here. I don't have time to do this unfortunately, but we're open to PRs 😃

nightscape avatar May 02 '22 08:05 nightscape

Hello, has any feature been implemented in the com.crealytics.spark.excel library?... so that I can read hyperlinks like in the question presented above? I'm having the same problem now. Is there an intention to develop or implement something along these lines to resolve the issue? Thank you and best regards.

HeronCarlos avatar Apr 16 '24 20:04 HeronCarlos