spark-scala3 icon indicating copy to clipboard operation
spark-scala3 copied to clipboard

Coursera Scala Course's Capstone Uses Your Library, but it may not work in that condition.

Open codeaperature opened this issue 1 year ago • 6 comments

Hi Vincenzo,

To me, it's unclear how to use your library and it's possible that Coursera Scala Course's Capstone (in the build file) has pointed to information that's not longer valid in the readme. I posted this to stackoverflow. This course is hard without being able to do the simple things - it would be nice if you updated your README markdown to help work this issue of TypeTags out. You can note that I tried to make the code on the stackoverflow match Spark's advice, but I also tried to follow the markdown, but didn't post that. In the coursera project, I don't think we can change the build file.

Stefan

codeaperature avatar Oct 07 '23 23:10 codeaperature

Hi @codeaperature thank you for opening the issue!

To use our encoders, all you need is import scala3encoders.given, then they are available in the implicit scope and you can obtain a reference with summon.

I can adapt your stackoverflow snippets as follows:

import scala3encoders.given
import org.apache.spark.sql.Encoder

case class StationX(stnId: Int, wbanId: Int, lat: Double, lon: Double)

object Station extends App:
  val ss = summon[Encoder[StationX]]
  println(ss.schema)

and

package observatory
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.Encoder
import scala.reflect.ClassTag
import scala.deriving.Mirror
import scala3udf.{Udf => udf}
import scala3encoders.given

case class CC(i: Int)
object SparkInstance extends App {
  val spark = SparkSession
    .builder()
    .appName("Spark SQL UDF scalar example")
    .getOrCreate()

  def getSchema[T: Mirror.ProductOf: ClassTag] = summon[Encoder[T]].schema
  val random = udf(() => Math.random())
  val plusOne = udf((x: Int) => x + 1)
  val ss = getSchema[CC]
}

You should not need to write a function such as getSchema

vincenzobaz avatar Oct 09 '23 08:10 vincenzobaz

I'm a little flustered and worried that an actual course uses spark together with Scala 3 - I would consider this combination experimental and not suited for beginners (although Scala 3 IMHO is much better than Scala 2).

michael72 avatar Oct 09 '23 08:10 michael72

@michael72 IIRC the course is offered in both Scala 2 and Scala 3. The assignments were tested in Scala 3 and many students have completed it successfully.

But it has been out for a while, maybe the course manager should investigate whether the scala 3 version has caused more problems...

vincenzobaz avatar Oct 09 '23 08:10 vincenzobaz

I finally got back to this (I have a regular Data Eng job too) ... I do not believe the parameters of the project mean I can add in extra libraries and it seems that this part does not work in the project:

.../observatory/src/main/scala/observatory/SparkInstance.scala:8:8 Not found: scala3udf import scala3udf.{Udf => udf}

Maybe I made some other changes. BTW - Did you download the project or just check this in another way?

Since there is no requirement to use Spark and the assignment actually uses a jarred resource ... and per the course suggestion: the data needs to be stream-loaded into memory and then pushed into a spark dataframe/dataset to be processed. I think it's just unnecessary overhead in terms of memory, code and socket open/close time,... I can simply use parallel collections to do a simple join.

I'm going to drop this issue as I am taking a different path, but I am still curious if Coursera provided a bunk suggestion to use your library without supplying the proper tooling in the build.sbt.

Thanks for your past attention to look into this item.

codeaperature avatar Oct 14 '23 21:10 codeaperature

I think I understand better the issue now. The assignment does not involve udfs, @michael72 implemented the udf a long time after the release of the course. I could reach out the new person in charge of the courses and tell them to include the udf dependency.

I will also ask if other people reported this issue. I am sorry for the frustration this has caused you. I collaborated with the course authors so I know it is not easy to maintain a large codebase and still make it extensible.

vincenzobaz avatar Oct 16 '23 09:10 vincenzobaz

Yeah - I tried to do some things differently ... for example a UDF to convert deg C -> F, but this could be done in another way. Also, I wanted to use datasets with StructTypes automatically derived from case classes.

Thanks for looking into this item for me.

codeaperature avatar Oct 16 '23 14:10 codeaperature