spark-sas7bdat
spark-sas7bdat copied to clipboard
Splittable SAS (.sas7bdat) Input Format for Hadoop and Spark SQL
Spark SAS Data Source (sas7bdat)
A library for reading SAS data (.sas7bdat) with Spark.
Requirements:
Download:
The latest jar can be downloaded from spark-packages.
| Version | Scala Version | Spark Version |
|---|---|---|
| 3.0.0-s_2.11 | 2.11.x | 2.4.x |
| 3.0.0-s_2.12 | 2.12.x | 3.0.x |
Features:
- This package allows reading SAS files from local and distributed filesystems, into Spark DataFrames.
- Schema is automatically inferred from metadata embedded in the SAS file. (Behaviour can be customised, see parameters below)
- The SAS format is splittable when not file-system compressed, thus we are able to convert a 200GB (1.5Bn rows) .sas7bdat file to .csv files using 2000 executors in under 2 minutes.
- This library uses parso for parsing as it is the only public available parser that handles both forms of SAS compression (CHAR and BINARY).
NOTE: this package does not support writing sas7bdat files
Docs:
Parameters:
extractLabel(Default:false)- Boolean: extract column labels as column comments for Parquet/Hive
forceLowercaseNames(Default:false)- Boolean: force column names to lower case
inferDecimal(Default:false)- Boolean: infer numeric columns with format width >0 and format precision >0, as Decimal(Width, Precision)
inferDecimalScale(Default:each column's format width)- Int: scale of inferred decimals
inferFloat(Default:false)- Boolean: infer numeric columns with <=4 bytes, as Float
inferInt(Default:false)- Boolean: infer numeric columns with <=4 bytes, format width >0 and format precision =0, as Int
inferLong(Default:false)- Boolean: infer numeric columns with <=8 bytes, format width >0 and format precision =0, as Long
inferShort(Default:false)- Boolean: infer numeric columns with <=2 bytes, format width >0 and format precision =0, as Short
metadataTimeout(Default:60)- Int: number of seconds to allow reading of file metadata (stops corrupt files hanging)
minSplitSize(Default:mapred.min.split.size)- Long: minimum byte length of input splits (splits are always at least 1MB, to ensure correct reads)
maxSplitSize(Default:mapred.max.split.size)- Long: maximum byte length of input splits, (can be decreased to force higher parallelism)
NOTE:
- the order of precedence for numeric type inference is: Long -> Int -> Short -> Decimal -> Float -> Double
- sas doesn’t have a concept of Long/Int/Short, instead people typically use column formatters with 0 precision
Scala API
val df = {
spark.read
.format("com.github.saurfang.sas.spark")
.option("forceLowercaseNames", true)
.option("inferLong", true)
.load("cars.sas7bdat")
}
df.write.format("csv").option("header", "true").save("newcars.csv")
You can also use the implicit readers:
import com.github.saurfang.sas.spark._
// DataFrameReader
val df = spark.read.sas("cars.sas7bdat")
df.write.format("csv").option("header", "true").save("newcars.csv")
// SQLContext
val df2 = sqlContext.sasFile("cars.sas7bdat")
df2.write.format("csv").option("header", "true").save("newcars.csv")
(Note: you cannot use parameters like inferLong with the implicit readers.)
Python API
df = spark.read.format("com.github.saurfang.sas.spark").load("cars.sas7bdat", forceLowercaseNames=True, inferLong=True)
df.write.csv("newcars.csv", header=True)
R API
df <- read.df("cars.sas7bdat", source = "com.github.saurfang.sas.spark", forceLowercaseNames = TRUE, inferLong = TRUE)
write.df(df, path = "newcars.csv", source = "csv", header = TRUE)
SQL API
SAS data can be queried in pure SQL by registering the data as a (temporary) table.
CREATE TEMPORARY VIEW cars
USING com.github.saurfang.sas.spark
OPTIONS (path="cars.sas7bdat")
SAS Export Runner
We included a simple SasExport Spark program that converts .sas7bdat to .csv or .parquet files:
sbt "run input.sas7bdat output.csv"
sbt "run input.sas7bdat output.parquet"
To achieve more parallelism, use spark-submit script to run it on a Spark cluster. If you don't have a spark
cluster, you can always run it in local mode and take advantage of multi-core.
Spark Shell
spark-shell --master local[4] --packages saurfang:spark-sas7bdat:3.0.0-s_2.12
Caveats
spark-csvwrites outnullas "null" in csv text output. This means if you read it back for a string type, you might actually read "null" instead ofnull. The safest option is to export in parquet format where null is properly recorded. See https://github.com/databricks/spark-csv/pull/147 for alternative solution.
Related Work
Acknowledgements
This project would not be possible without parso continued improvements and generous contributions from @mulya, @thesuperzapper, and many others. We are hornored to be a recipient of 2020 WiseWithData ELEVATE Awards and appreciate their generous donations.