Spark - Reading from Excel files, converting to parquet

Post date: Dec 19, 2019 1:37:56 PM

I am running my examples using jupyter docker container

docker run -d -p 8001:8888 --name notebook -e SPARK_OPTS="--driver-memory 4g --executor-memory=4g" jupyter/all-spark-notebook


Using Apache Toree kernel, my first cell is to download my dependencies

%ShowTypes on 

%AddDeps com.crealytics spark-excel_2.11 0.12.2 --transitive 



And my second cell is to configure our SparkSession

val spark = SparkSession      .builder()      .master("local[*]")      .config("spark.executor.memory", "4g")      .config("spark.driver.memory", "8g")      .config("spark.memory.offHeap.enabled",true)      .config("spark.memory.offHeap.size","8g")         .appName("Peter Henell Spark Snowflake Example")      .getOrCreate()

We also create functions to load data from excel files.

(Documentation of excel reader can be found here https://github.com/crealytics/spark-excel)

import org.apache.spark.sql.DataFrame def readExcel(file: String, sheet: String): DataFrame =      spark.read         .format("com.crealytics.spark.excel")         .option("dataAddress", sheet) // Optional, default: "A1"         .option("useHeader", "true") // Required         //.option("treatEmptyValuesAsNulls", "false") // Optional, default: true         .option("inferSchema", "true") // Optional, default: false         //.option("addColorColumns", "true") // Optional, default: false         //.option("timestampFormat", "MM-dd-yyyy HH:mm:ss") // Optional, default: yyyy-mm-dd hh:mm:ss[.fffffffff]         //.option("maxRowsInMemory", 20) // Optional, default None. If set, uses a streaming reader which can help with big files         .option("excerptSize", 10) // Optional, default: 10. If set and if schema inferred, number of rows to infer schema from         //.option("workbookPassword", "pass") // Optional, default None. Requires unlimited strength JCE for older JVMs         //.schema(myCustomSchema) // Optional, default: Either inferred schema, or all columns are Strings         .load(file)

def excelAsTable(file: String, sheet: String, tableName: String): Unit =     readExcel(file, sheet).createOrReplaceTempView(tableName)

def parquetAsTable(file: String, tableName: String): Unit =      spark.read.format("parquet").load(file).createOrReplaceTempView(tableName)

To be able to store our results from any analysis done on the excel we need to rename columns with invalid names.

From our excel file called resultat.xlsx, take the data from sheet "TestResults" starting from the cell A1. The result will be stored in testresult.parquet

import org.apache.spark.sql.SaveMode readExcel("resultat.xlsx", "'TestResults'!A1")     .withColumnRenamed("Duration Seconds","DurationSeconds")     .withColumnRenamed("Cost In EURO","CostInEUR")     .write.mode(SaveMode.Ignore).parquet("testresult.parquet")

We can now consume the parquest file and store it as a temporary view for future queries

parquetAsTable("testresult.parquet", "testresults")

Perform our first simple test of the data

spark.sql("""  SELECT avg(DurationSeconds), count(1) as testsConducted   FROM testresults  order by 1 asc""").show(false)