Understanding joins with Apache Spark

Understanding joins with Apache Spark

Workshop date & duration: June 20, 2020, 9:30 – 14:00, 30 min break included
Trainer: Valentina Crisan, Maria Catana
Location: Online
Price: 150 RON (including VAT)
Number of places: 10
Languages: Scala & SQL


For a (mainly) in memory processing platform like Spark – getting the best performance is most of the time about 

  1. Optimizing the amount of data needed in order to perform a certain action   
  2. Having a partitioning strategy that distributed optimally the data to the Spark Cluster executors (this is many times correlated to the underlying storage data distribution for initial data distribution, but as well is related to how data is partitioned during the join itself, given that before running the actual join operation, the partitions are first sorted)
  3. And in case your action is a join one – choosing the right strategy for your joins 


This workshop will mainly focus on two of the above mentioned steps: partitioning and join strategy, making these aspects more clear through exercises and hands on sessions. Understanding what happens in the background in different join cases and how partition optimally the data to Spark executors are keys in working optimally with joins. This workshop is addressed to those that are a bit familiar with the Spark framework capabilities – since we will not focus on those and consider them known, the workshop will briefly touch upon some architecture characteristics and then focus on the partitioning and joining aspects. 


Spark architecture overview – theory

Understand the stages of a Spark job

Using Spark UI to monitor Spark Jobs

Partitioning of Data

  • Hash partitioning vs. range partitioning
  • Skewed data and shuffle blocks
  • Repartitioning of data

Types of Joins and possible use cases: 

  • Sort Merge Joins
  • Shuffle Hash Joins
  • Broadcast Joins 

Issues during joins and possible solutions:

  • Data skewness
  • Executors memory 


The price for the workshop is 150 RON (including VAT). In order to register for this workshop it is required to complete the registration form and the payment step.

1. Complete registration form:

2. Payment: https://mpy.ro/6dgzhpev