Understanding Big Data Architecture

Open Course: Understanding Big Data Architecture E2E (Use case including Cassandra + Kafka + Spark + Zeppelin)  
Timeline & Duration: July 27th – August 14th, 6 X 4 hours online sessions, during 3 weeks (2 sessions/week, Monday + Thursday) . An online setup will be available for exercises/hands-on sessions for the duration of the course. 
Main trainer: Valentina Crisan
Location: Online (Zoom)
Price: 250 EUR 
Pre-requisites: knowledge of distributed systems, Hadoop ecosystem (HDFS, MapReduce), know a bit of SQL.

There are a few concepts and solutions that solutions architects should be aware of when evaluating or building a big data solution: what data partitioning means, how to model your data in order to get the best performance from your distributed system, what is the best format of your data, what is the best storage or the best way to analyze your data. Solutions like HDFS, Hive, Cassandra, Hbase, Spark, Kafka, YARN should be known – not necessarily because you will work specifically with them – but mainly because knowing the concepts of these solutions will help you understand other similar solutions in the big data space. This course is designed to make sure the participants will understand the usage and applicability of big data technologies like Spark, Cassandra, Kafka ,..  and which aspects to consider when starting to build a Big Data architecture.

With this course the participants will get to build an end to end big data use case: from ingesting data through Kafka (and processing streaming data with ksql), processing data with Spark and Spark Structured Streaming, storing data in Cassandra and then applying Spark ML/NLP and visualizing data. The course is structured in 6 X 4-hours sessions – for update on theory and some hands on exercises – but some exercises and reading/studying will be needed in between the sessions (materials will be provided online). The online setup will be available throughout the course period for completion of exercises and hands on part.     

This course, although includes some theory, aims to have a balance inclined on the practical side. While learning about Cassandra, Spark, Kafka from a theoretical perspective we will as well understand the solutions through hands on exercises: we will work with Cassandra Query language and mainly with SQL for Spark SQL and KSQL hands on sessions. For Spark we will use a bit of Scala as well – but the focus will be to understand the behavior of the system, not how to program it. 

So, if you are a solutions architect, product manager or just someone who would like to understand how big data architecture could fit in your path/solution, but as well how to build such a solution – this course will answer both theoretical and practical questions. The trainer is experienced in both courses and real life projects and will help you understand the components of big data solutions, the concepts and the way these components could be used in real life examples.   

Main topics for the course: 

Session 1

  • Big Data Architecture overview: components and roles in an architecture
  • Data ingestion: Apache Kafka and ksql DB 

Session 2

  • Data processing: Apache Spark (+ Apache Zeppelin)
  • Spark Structured Streaming with Apache Kafka 

Session 3

  • Data storage: Apache Cassandra

Session 4

  • Data analytics: Apache Spark + Cassandra ( + Apache Zeppelin) 

Session 5

  • Data visualization & ML: Apache Spark (+ Zeppelin + Vegas/Helium as libraries for visualization)

Session 6

  • End2End use case
  • Big Data Architecture Review

If you are interested in participating please complete form here: