spark-gotchas icon indicating copy to clipboard operation
spark-gotchas copied to clipboard

Spark Gotchas. A subjective compilation of the Apache Spark tips and tricks

Spark Gotchas

DOI

Table of Contents

  • Introduction
    • Version compatibility
  • Understanding Spark Architecture
    • Lack of Global Shared State and Spark Closures
    • Distributed Processing and its Scope
    • Feature Parity and Architecture of the Guest Languages
  • Spark Application Deployment
    • Dynamic allocation
      • Yet Another Dynamic Resource Allocation YA(D)RN
  • Spark Application Building
  • RDD actions and Transformations by Example
    • Be Smart About groupByKey
      • What Exactly Is Wrong With groupByKey
      • How Not to Optimize
      • Not All groupBy Methods Are Equal
      • When to Use groupByKey and When to Avoid It
      • Hidden groupByKey
    • Immutability of a Data Structure Does Not Imply Immutability of the Data
  • Spark SQL and Dataset API
    • Non-lazy Evaluation in DataFrameReader.load
      • Explicit Schema
      • Sampling for Schema Inference
      • PySpark Specific Considerations
    • DataFrame Schema Nullablility
      • nullability by Reflection
      • Marking StructFields Excplicitly as Nullable
      • Impact of Nullable
    • Reading Data Using JDBC Source
      • Parallelizing Reads
      • MySQL: Dates, Timestamps and Lies
    • Window Functions
      • Understanding Window Functions
      • Window Definitions
      • Example Usage
      • Requirements and Performance Considerations
  • Data Preparation
    • DataFrame Metadata
      • Metadata in ML pipelines
      • Setting custom column metadata
  • Iterative Algorithms
    • Iterative Applications and Lineage
      • Checkpointing
      • "Flat Transformations"
      • Truncating Lineage in Dataset API
    • Controling Number of Partitions in Iterative Applications
  • PySpark Applications
  • Serialization
    • JVM
      • Java Serialization
      • Kryo Serialization
    • PySpark Serialization
      • Python Serializers Characteristics
      • PySpark Serialization Strategies
      • Configuring PySpark Serialization
      • PySpark and Kryo
    • SerDe During JVM - Guest Communication

License

This work, excluding code examples, is licensed under Creative Commons Attribution-ShareAlike 4.0 International license.

Accompanying code and code snippets are licensed under MIT license.