geni icon indicating copy to clipboard operation
geni copied to clipboard

ArityException with cookbook sample 4

Open jriekhof opened this issue 3 years ago • 5 comments

  • [x] I have read through the quick start and installation sections of the README.

Info

Info Value
Operating System macOS Big Sur 11.2.1
Geni Version 0.0.38
Spark Java lib Version org.apache.spark/spark-XXX_2.12 "3.1.0"
JDK openjdk version "1.8.0_282"
Spark Version 3.0.2

Problem / Steps to reproduce

Standard lein new geni …, bitnami/spark 3.0.2 docker, then used code from geni cookbook chapter 4.

The following code from cookbook example 4 fails with ArityException:

(def null-counts
    (-> raw-weather-mar-2012
        (g/agg (->> (g/column-names raw-weather-mar-2012)
                    (map #(vector % (g/null-count %)))
                    (into {})))
        (g/first)))

Exception is

Execution error (AnalysisException) at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt/failAnalysis (package.scala:42).
cannot resolve '`Precip. Amount (mm)`' given input columns: [Climate ID, Date/Time (LST), Day, Dew Point Temp (°C), Dew Point Temp Flag, Hmdx, Hmdx Flag, Latitude (y), Longitude (x), Month, Precip. Amount (mm), Precip. Amount Flag, Rel Hum (%), Rel Hum Flag, Station Name, Stn Press (kPa), Stn Press Flag, Temp (°C), Temp Flag, Time (LST), Visibility (km), Visibility Flag, Weather, Wind Chill, Wind Chill Flag, Wind Dir (10s deg), Wind Dir Flag, Wind Spd (km/h), Wind Spd Flag, Year];
'Aggregate [sum(cast(cast(isnull(Latitude (y)#13272) as int) as bigint)) AS Latitude (y)#16363L, sum(cast(cast(isnull(Stn Press Flag#13295) as int) as bigint)) AS Stn Press Flag#16364L, sum(cast(cast(isnull(Temp (°C)#13280) as int) as bigint)) AS Temp (°C)#16365L, sum(cast(cast(isnull(Wind Spd (km/h)#13290) as int) as bigint)) AS Wind Spd (km/h)#16366L, sum(cast(cast(isnull(Rel Hum Flag#13285) as int) as bigint)) AS Rel Hum Flag#16367L, sum(cast(cast(isnull(Date/Time (LST)#13275) as int) as bigint)) AS Date/Time (LST)#16368L, sum(cast(cast(isnull(Visibility Flag#13293) as int) as bigint)) AS Visibility Flag#16369L, sum(cast(cast(isnull(Visibility (km)#13292) as int) as bigint)) AS Visibility (km)#16370L, sum(cast(isnull('Precip. Amount (mm)) as int)) AS Precip. Amount (mm)#16371, sum(cast(cast(isnull(Dew Point Temp Flag#13283) as int) as bigint)) AS Dew Point Temp Flag#16372L, sum(cast(isnull('Precip. Amount Flag) as int)) AS Precip. Amount Flag#16373, sum(cast(cast(isnull(Station Name#13273) as int) as bigint)) AS Station Name#16374L, sum(cast(cast(isnull(Wind Chill Flag#13299) as int) as bigint)) AS Wind Chill Flag#16375L, sum(cast(cast(isnull(Longitude (x)#13271) as int) as bigint)) AS Longitude (x)#16376L, sum(cast(cast(isnull(Time (LST)#13279) as int) as bigint)) AS Time (LST)#16377L, sum(cast(cast(isnull(Dew Point Temp (°C)#13282) as int) as bigint)) AS Dew Point Temp (°C)#16378L, sum(cast(cast(isnull(Rel Hum (%)#13284) as int) as bigint)) AS Rel Hum (%)#16379L, sum(cast(cast(isnull(Wind Dir Flag#13289) as int) as bigint)) AS Wind Dir Flag#16380L, sum(cast(cast(isnull(Climate ID#13274) as int) as bigint)) AS Climate ID#16381L, sum(cast(cast(isnull(Wind Dir (10s deg)#13288) as int) as bigint)) AS Wind Dir (10s deg)#16382L, sum(cast(cast(isnull(Stn Press (kPa)#13294) as int) as bigint)) AS Stn Press (kPa)#16383L, sum(cast(cast(isnull(Year#13276) as int) as bigint)) AS Year#16384L, sum(cast(cast(isnull(Temp Flag#13281) as int) as bigint)) AS Temp Flag#16385L, sum(cast(cast(isnull(Hmdx#13296) as int) as bigint)) AS Hmdx#16386L, ... 6 more fields]
+- Project [Longitude (x)#13211 AS Longitude (x)#13271, Latitude (y)#13212 AS Latitude (y)#13272, Station Name#13213 AS Station Name#13273, Climate ID#13214 AS Climate ID#13274, Date/Time (LST)#13215 AS Date/Time (LST)#13275, Year#13216 AS Year#13276, Month#13217 AS Month#13277, Day#13218 AS Day#13278, Time (LST)#13219 AS Time (LST)#13279, Temp (°C)#13220 AS Temp (°C)#13280, Temp Flag#13221 AS Temp Flag#13281, Dew Point Temp (°C)#13222 AS Dew Point Temp (°C)#13282, Dew Point Temp Flag#13223 AS Dew Point Temp Flag#13283, Rel Hum (%)#13224 AS Rel Hum (%)#13284, Rel Hum Flag#13225 AS Rel Hum Flag#13285, Precip. Amount (mm)#13226 AS Precip. Amount (mm)#13286, Precip. Amount Flag#13227 AS Precip. Amount Flag#13287, Wind Dir (10s deg)#13228 AS Wind Dir (10s deg)#13288, Wind Dir Flag#13229 AS Wind Dir Flag#13289, Wind Spd (km/h)#13230 AS Wind Spd (km/h)#13290, Wind Spd Flag#13231 AS Wind Spd Flag#13291, Visibility (km)#13232 AS Visibility (km)#13292, Visibility Flag#13233 AS Visibility Flag#13293, Stn Press (kPa)#13234 AS Stn Press (kPa)#13294, ... 6 more fields]
   +- Relation[Longitude (x)#13211,Latitude (y)#13212,Station Name#13213,Climate ID#13214,Date/Time (LST)#13215,Year#13216,Month#13217,Day#13218,Time (LST)#13219,Temp (°C)#13220,Temp Flag#13221,Dew Point Temp (°C)#13222,Dew Point Temp Flag#13223,Rel Hum (%)#13224,Rel Hum Flag#13225,Precip. Amount (mm)#13226,Precip. Amount Flag#13227,Wind Dir (10s deg)#13228,Wind Dir Flag#13229,Wind Spd (km/h)#13230,Wind Spd Flag#13231,Visibility (km)#13232,Visibility Flag#13233,Stn Press (kPa)#13234,... 6 more fields] csv

Also manual select with column named "Precip. Amount (mm)" does not work. It seems that they have backticks around them internally.

I tried to rename all columns with

(g/to-df weather-data
               "Longitude (x)" "Latitude (y)" "Station Name" "Climate ID" "Date/Time (LST)" "Year" "Month" "Day"
               "Time (LST)" "Temp (°C)" "Temp Flag" "Dew Point Temp (°C)" "Dew Point Temp Flag" "Rel Hum (%)"
               "Rel Hum Flag" "Precip. Amount (mm)" "Precip. Amount Flag" "Wind Dir (10s deg)" "Wind Dir Flag"
               "Wind Spd (km/h)" "Wind Spd Flag" "Visibility (km)" "Visibility Flag" "Stn Press (kPa)" "Stn Press Flag"
               "Hmdx" "Hmdx Flag" "Wind Chill" "Wind Chill Flag" "Weather"))

but the problem persists, still backticks. Crashes: (g/select raw-weather-mar-2012 "Precip. Amount (mm)") Works: (g/select raw-weather-mar-2012 "`Precip. Amount (mm)`")

(g/column-names (g/select raw-weather-mar-2012 "`Precip. Amount (mm)`")) yields "Precip. Amount (mm)" without backticks.

This lead me to believe that there is some issue in geni or spark with these column names.

jriekhof avatar Feb 24 '21 15:02 jriekhof

Huh... That's very odd... Thank you for flagging this up! I've added this to my TODO list. I think I'll have time to look into it next week. I hope that's okay!

anthony-khong avatar Feb 25 '21 22:02 anthony-khong

Hi Anthony,

thank you for your quick reply! I figured that I got to this due to a small glitch in Cookbook Chapter 4, where it reads the initial csv without {:kebab-columns true}. Just adding that "fixed" it and also is consistent with the chapter sample output. However, still interesting why it does not work without the kebap special char mangling, I believe it should.

Just started with spark and geni the other day, and I can already say it is one of the best APIs I used in years, congrats! Such a lot of fun! I will for sure continue to use it. It also seems to be quite fast even on my local machine.

Hope you find the cause of this quickly, have a good day!

Ciao

...Jochen

Am 25.02.2021 um 23:10 schrieb Anthony Khong [email protected]:

Huh... That's very odd... Thank you for flagging this up! I've added this to my TODO list. I think I'll have time to look into it next week. I hope that's okay!

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/zero-one-group/geni/issues/315#issuecomment-786261238, or unsubscribe https://github.com/notifications/unsubscribe-auth/AETGETQSM7C7BHKQARJFKPTTA3DEHANCNFSM4YEW56FQ.

jriekhof avatar Feb 26 '21 07:02 jriekhof

Hi Jochen, thank you for the kind words!! Glad you’re enjoying it. Please let me know if you have any feature requests.

On this issue, I think this is related: https://mungingdata.com/pyspark/avoid-dots-periods-column-names/

Basically Spark doesn’t like column names with dots. One thing we can do is to auto-escape it, but I’m not sure if this is the best solution, because you lose the Spark correspondence. I’m leaning towards having a ‘safe-mode’ that is on by default in the read functions that basically scans the column names for dots, and replace it with underscores 🤔

anthony-khong avatar Feb 26 '21 10:02 anthony-khong

Hi Anthony...

thank you for digging into this! Yes you are right I think, it looks like it is indeed the described issue.

My feeling is to just document this in the geni docs and recommend kebab-case in these cases. It fixes it very well in the Cookbook example :-).

Ciao

...Jochen

Am 26.02.2021 um 11:46 schrieb Anthony Khong [email protected]:

Hi Jochen, thank you for the kind words!! Glad you’re enjoying it. Please let me know if you have any feature requests.

On this issue, I think this is related: https://mungingdata.com/pyspark/avoid-dots-periods-column-names/ https://mungingdata.com/pyspark/avoid-dots-periods-column-names/ Basically Spark doesn’t like column names with dots. One thing we can do is to auto-escape it, but I’m not sure if this is the best solution, because you lose the Spark correspondence. I’m leaning towards having a ‘safe-mode’ that is on by default in the read functions that basically scans the column names for dots, and replace it with underscores 🤔

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/zero-one-group/geni/issues/315#issuecomment-786568182, or unsubscribe https://github.com/notifications/unsubscribe-auth/AETGETV3GMO2F4B3XXXZRHDTA53XLANCNFSM4YEW56FQ.

jriekhof avatar Feb 26 '21 13:02 jriekhof

Hi Anthony...

just a quick followup on our recent discussion regarding the dot/backticks column name issue.

I read up a little about this in spark docs, in Spark SQL docs I found Spark to be really restrictive, allowing just letters (a-zA-Z), digits and underscore. https://spark.apache.org/docs/latest/sql-ref-identifier.html https://spark.apache.org/docs/latest/sql-ref-identifier.html

So, your :kebab-columns is doing it perfect for clojure I think. An issue could be present for people exporting parquet files for use in other languages, that like underscores better.

Instead of :kebab-columns true one could use something like :snake-columns true. Using {:convert-column-names :kebab | :snake} might be a bit more elegant, but well, this would change the API :-).

I appended my repl code (reusing your data-sources code in case you would like to play with it?!

Have a good day Anthony!

Ciao

...Jochen

(require '[zero-one.geni.core.data-sources :as ds] '[zero-one.geni.interop :as interop] 'camel-snake-kebab.core)

(defn ->normalized-columns "Returns a new Dataset with all columns renamed using passed rename-fn." [dataset rename-fn] (let [remove-punctuations #'ds/remove-punctuations ; access privates deaccent #'ds/deaccent new-columns (->> dataset .columns (map remove-punctuations) (map deaccent) (map rename-fn))] (.toDF dataset (interop/->scala-seq new-columns))))

(comment ; plain (-> (g/read-csv! (str "data/cookbook/weather/weather-2012-3.csv")) (g/select "Precip. Amount (mm)"))

; kebap-columns (-> (g/read-csv! (str "data/cookbook/weather/weather-2012-3.csv")) (->normalized-columns camel-snake-kebab.core/->kebab-case) (g/select "precip-amount-mm"))

; snake-columns (-> (g/read-csv! (str "data/cookbook/weather/weather-2012-3.csv")) (->normalized-columns camel-snake-kebab.core/->snake_case) (g/select "precip_amount_mm")) )

Am 26.02.2021 um 11:46 schrieb Anthony Khong [email protected]:

Hi Jochen, thank you for the kind words!! Glad you’re enjoying it. Please let me know if you have any feature requests.

On this issue, I think this is related: https://mungingdata.com/pyspark/avoid-dots-periods-column-names/ https://mungingdata.com/pyspark/avoid-dots-periods-column-names/ Basically Spark doesn’t like column names with dots. One thing we can do is to auto-escape it, but I’m not sure if this is the best solution, because you lose the Spark correspondence. I’m leaning towards having a ‘safe-mode’ that is on by default in the read functions that basically scans the column names for dots, and replace it with underscores 🤔

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/zero-one-group/geni/issues/315#issuecomment-786568182, or unsubscribe https://github.com/notifications/unsubscribe-auth/AETGETV3GMO2F4B3XXXZRHDTA53XLANCNFSM4YEW56FQ.

jriekhof avatar Mar 08 '21 08:03 jriekhof