Understanding joins with Apache Spark
Workshop date & duration: June 20, 2020, 9:30 – 14:00, 30 min break included
Trainer: Valentina Crisan, Maria Catana
Location: Online
Price: 150 RON (including VAT)
Number of places: 10
Languages: Scala & SQL
DESCRIPTION:
For a (mainly) in memory processing platform like Spark – getting the best performance is most of the time about:
- Optimizing the amount of data needed in order to perform a certain action
- Having a partitioning strategy that distributed optimally the data to the Spark Cluster executors (this is many times correlated to the underlying storage data distribution for initial data distribution, but as well is related to how data is partitioned during the join itself, given that before running the actual join operation, the partitions are first sorted)
- And in case your action is a join one – choosing the right strategy for your joins
This workshop will mainly focus on two of the above mentioned steps: partitioning and join strategy, making these aspects more clear through exercises and hands on sessions.
You can check out the agenda and register here.