Understanding joins with Apache Spark

Understanding joins with Apache Spark

Workshop date & duration: June 20, 2020, 9:30 – 14:00, 30 min break included
TrainerValentina CrisanMaria Catana
Location: Online
Price: 150 RON (including VAT)
Number of places: 10
Languages: Scala & SQL


For a (mainly) in memory processing platform like Spark – getting the best performance is most of the time about:

  1. Optimizing the amount of data needed in order to perform a certain action   
  2. Having a partitioning strategy that distributed optimally the data to the Spark Cluster executors (this is many times correlated to the underlying storage data distribution for initial data distribution, but as well is related to how data is partitioned during the join itself, given that before running the actual join operation, the partitions are first sorted)
  3. And in case your action is a join one – choosing the right strategy for your joins 

This workshop will mainly focus on two of the above mentioned steps: partitioning and join strategy, making these aspects more clear through exercises and hands on sessions.

You can check out the agenda and register here.