spark-gotchas
spark-gotchas copied to clipboard
Spark Gotchas. A subjective compilation of the Apache Spark tips and tricks
Spark Gotchas
Table of Contents
- Introduction
- Version compatibility
- Understanding Spark
Architecture
- Lack of Global Shared State and Spark Closures
- Distributed Processing and its Scope
- Feature Parity and Architecture of the Guest Languages
- Spark Application Deployment
- Dynamic allocation
- Yet Another Dynamic Resource Allocation YA(D)RN
- Dynamic allocation
- Spark Application Building
- RDD actions and Transformations by
Example
- Be Smart About groupByKey
- What Exactly Is Wrong With groupByKey
- How Not to Optimize
- Not All groupBy Methods Are Equal
- When to Use groupByKey and When to Avoid It
- Hidden groupByKey
- Immutability of a Data Structure Does Not Imply Immutability of the Data
- Be Smart About groupByKey
- Spark SQL and Dataset API
- Non-lazy Evaluation in
DataFrameReader.load- Explicit Schema
- Sampling for Schema Inference
- PySpark Specific Considerations
- DataFrame Schema Nullablility
- nullability by Reflection
- Marking StructFields Excplicitly as Nullable
- Impact of Nullable
- Reading Data Using JDBC
Source
- Parallelizing Reads
- MySQL: Dates, Timestamps and Lies
- Window Functions
- Understanding Window Functions
- Window Definitions
- Example Usage
- Requirements and Performance Considerations
- Non-lazy Evaluation in
- Data Preparation
- DataFrame Metadata
- Metadata in ML pipelines
- Setting custom column metadata
- DataFrame Metadata
- Iterative Algorithms
- Iterative Applications and
Lineage
- Checkpointing
- "Flat Transformations"
- Truncating Lineage in
DatasetAPI
- Controling Number of Partitions in Iterative Applications
- Iterative Applications and
Lineage
- PySpark Applications
- Serialization
- JVM
- Java Serialization
- Kryo Serialization
- PySpark Serialization
- Python Serializers Characteristics
- PySpark Serialization Strategies
- Configuring PySpark Serialization
- PySpark and Kryo
- SerDe During JVM - Guest Communication
- JVM
License
This work, excluding code examples, is licensed under Creative Commons Attribution-ShareAlike 4.0 International license.
Accompanying code and code snippets are licensed under MIT license.