Mastering Apache Spark

A professional developer must balance himself between being an expert in a very narrow field and staying up-to-date with the changing challenges of our industry. Big Data is one of those areas that a professional software engineer just can not ignore. Environments where infrastructure failure is part of the everyday routines, collected data that can not fit on a single machine or data that needs to be processed quickly (in seconds, not hours) – are just a few of the challenges that Big Data projects face. Apache Spark is a fast and general engine for large-scale data processing. It was right from the start designed to cope with the aforementioned issues. No matter if you deal with Big Data right now or not, with exponential growth of our industry (both in terms of complexity and user space) and emerging new fields such as data science – chances are that in-depth knowledge of Apache Spark might be crucial and essential.

Concepts and skills

Among the many concepts that participants will get familiar with are:

Big Data, Map-Reduce, Partitioning, HA & resilience in cluster environment, HDFS Apache Spark architecture (logical & physical), Driver program, Master node, Worker node, Executor, RDD (Resilient Distributed Dataset), Transformations & Actions, Partitions, Tasks, Pipelining, Shuffling, DAG (Directed Acyclic Graph), data locality, Spark Execution Model, Partitioner, Caching, Checkpointing, Broadcast, Accumulators, Spark Memory Model, DataFrames, Datasets, Working with semi-structured & structured data (json, parquet, avro), Spark Catalyst optimizer, predicate push down, Project Tungsten, Streaming, DStream, Structured Streaming, Back pressure & Elastic Scaling,

Course programme:

  1. Introduction
    • What is Apache Spark?
    • What was before?
    • Challenges, problem & issues with MapReduce
    • The Big Picture
  2. Spark Core
    • RDD
    • Transformations vs Actions
    • Partitions & Tasks
    • RDDs for key-value
    • Pipelining & Shuffling
    • DAG & Stages
    • Resilience
    • Performance issues and how to handle them
      1. Common pitfalls (groupBy)
      2. Deeper knowledge of how cluster works
      3. Classpath & Serialization issues
      4. Spark configuration
      5. Spark UI & Spark history server
    • Caching & Checkpointing
    • Broadcasts & Accumulators
    • Joins
  3. Spark Core – internals
    • The 5 things defining RDD
    • Shuffling algorithms
    • Spark Memory Model
  4. Spark SQL
    • Advantages of semi-structured data
    • SQL
      • Introduction
      • Joins
      • Hive
    • Dataframes
      • Introduction
      • Joins
      • JSON
      • Parquet
      • Avro
    • DataSets
      • Problems with Dataframes
      • Dataset to the rescue!
    • What makes Spark SQL run faster
      • Structure vs Expression
      • Catalyst Optimizer
      • Predicate Push Down
      • Project Tungsten
  5. Spark SQL – internals
    • How Dataframes & Datasets are implemented
    • How Catalyst Optimizer works?
  6. Spark Streaming
    • Why Streaming?
    • General overview
    • Receiver – long running task
    • Resilience
    • Transformations
      • Stateless
      • Stateful (Window & Sliding duration, .reduceByWindow, .updateStateByKey vs .mapWithState)
    • Ultimate Apache Kafka Example
    • Structured Streaming
  7. Other topics
    • Back pressure & Elastic Scaling
    • Running Spark on Mesos
    • Q&A

Ask for this course