ETL with Apache Spark
Workshop date & duration: March 28th, 2020, 9:30 – 14:00, 30 min break included
Trainer: Valentina Crisan, Maria Catana
eSolutions Academy, Budişteanu Office Building, strada General Constantin Budişteanu Nr. 28C, etaj 1, Sector 1, Bucureşti. This workshop will take place online.
Price: 150 RON (including VAT)
Number of places:
10 no more places left
Languages: Scala & SQL
One of the many uses of Apache Spark is to transform data from different formats and sources, both batch and streaming data. In this workshop that will be mainly hands on we will focus on just that: understanding how we can read/write/transform/manage schema/join different formats of data and how is best to handle those data when it comes to Apache Spark. So, if you know a bit about Spark but did not manage to play too much with its ETL capabilities or even if you don’t know too much but would like to find out – this workshop might be of interest. We will work in Spark shell most of the exercises, thus you need to have an SSH client present on your computer. In case we decide to use Apache Zeppelin as well, we recommend Google Chrome to be installed.
- Reading/writing data (with/without user defined schema, infer schema) – both batch and streaming data
- How we handle schema: nested structures
- Structure of Parquet files, data organization
- Storage optimization: Parquet vs CSV, Compression
- Predicates push down, Column Pruning
- Nested structures in Parquet schema
- Schema handling
- What happens when the data does not follow the declared schema (Dealing with corrupt files or corrupt records)
- Working with Null values in Spark
- Column operations in Spark
- Split, explode (lateral view)
- Join between a batch and a stream
The price for the workshop is 150 RON (including VAT). In order to register for this workshop it is required to complete the registration form and the payment step.
1. Complete registration form: