Machine Learning with Decision Trees and Random Forest

Following our bigdata.ro Spark (completed) and Druid+Superset working group (ongoing) , we are now opening registration on a new working group with the topic: Machine Learning with Decision Trees and Random Forest.

Timeline: September 27th – December 4th (1 online meeting/ 2 weeks, 5 meetings)
Predefined topic: Solving a regression/classification problem using decision trees and random forest. 
We will use: Python, Jupyter notebooks, sklearn, dtreeviz for tree visualisation, matplotlib/seaborn for data visualization

LearnGraph

The group’s purpose is to go through the entire ML flow while learning about decision trees and random forest.
You can check out the group’s description and register here.

Data ingest using Apache NiFi

Data ingest using Apache NiFi

Course date duration: August 8th, 2020, 9:30 – 14:00, 30 min break included
Trainer: Lucian Neghina
Location: Online (using Zoom)
Price: 150 RON (including VAT)
Number of places: 10 3 places left
When you need to design a data solution one of the earliest questions is where your data is coming from and how you will make it available to the solution/solutions that processes/stores the data. Especially since data we might deal with IoT data, thus various sources, and data will be as well processed and stored by several components of your solution. Even more nowadays that we work mainly with streams not with static data such a solution that is able to design and run the flow of events from the source/sources to the processing/storage stage it’s extremely important.  Apache NiFi has been built to automate that data flow from one system to another. Apache NiFi is a data flow management system that comes with a web UI that helps to build data flows in real time. It supports flow-based programming.

You can check out the agenda and register here.

Understanding Big Data Architecture E2E (Use case including Cassandra + Kafka + Spark + Zeppelin)  

Open Course: Understanding Big Data Architecture E2E (Use case including Cassandra + Kafka + Spark + Zeppelin)  
Timeline & Duration: July 27th – August 14th, 6 X 4 hours online sessions, during 3 weeks (2 sessions/week, Monday + Thursday) . An online setup will be available for exercises/hands-on sessions for the duration of the course. 
Main trainer: Valentina Crisan
Location: Online (Zoom)
Price: 250 EUR 
Pre-requisites: knowledge of distributed systems, Hadoop ecosystem (HDFS, MapReduce), know a bit of SQL.

More details and registration here.

Big Data Learning – Druid working group

Learning a new solution or building an architecture for a specific use case is never easy, especially when you are trying to embark alone on such an endeavour – thus in 2020 bigdata.ro started a new way of learning specific big data solutions/use cases: working groups. And with the first working group (centered around Spark Structured Streaming + NLP) on its way to completion in July, we are now opening registration for a new working group – this time centered around Apache Druid: Building live dashboards with Apache Druid + Superset. The working group aims to take place End of July – October and will bring together a team of 5-6 participants that will define the scope, select the data (open data), install the needed components, implement the needed flow. Besides the participants for this group we will have a team of advisors (with experience in Druid and big data in general) that will advise the participants on how to solve different issues that will arise in the project.

Find more details of the working group here.

Understanding joins with Apache Spark

Understanding joins with Apache Spark

Workshop date & duration: June 20, 2020, 9:30 – 14:00, 30 min break included
TrainerValentina CrisanMaria Catana
Location: Online
Price: 150 RON (including VAT)
Number of places: 10
Languages: Scala & SQL

DESCRIPTION:

For a (mainly) in memory processing platform like Spark – getting the best performance is most of the time about:

  1. Optimizing the amount of data needed in order to perform a certain action   
  2. Having a partitioning strategy that distributed optimally the data to the Spark Cluster executors (this is many times correlated to the underlying storage data distribution for initial data distribution, but as well is related to how data is partitioned during the join itself, given that before running the actual join operation, the partitions are first sorted)
  3. And in case your action is a join one – choosing the right strategy for your joins 

This workshop will mainly focus on two of the above mentioned steps: partitioning and join strategy, making these aspects more clear through exercises and hands on sessions.

You can check out the agenda and register here.

ETL with Apache Spark

ETL WITH APACHE SPARK

Workshop date & duration: March 28th, 2020, 9:30 – 14:00, 30 min break included
TrainerValentina CrisanMaria Catana
Location:  eSolutions Academy, Budişteanu Office Building, strada General Constantin Budişteanu Nr. 28C, etaj 1, Sector 1, Bucureşti
Price: 150 RON (including VAT)
Number of places: 10 no more places left
Languages: Scala & SQL

DESCRIPTION:

One of the many uses of Apache Spark is to transform data from different formats and sources, both batch and streaming data. In this workshop that will be mainly hands on we will focus on just that: understanding how we can read/write/transform/manage schema/join different formats of data and how is best to handle those data when it comes to Apache Spark. So, if you know a bit about Spark but did not manage to play too much with its ETL capabilities or even if you don’t know too much but would like to find out – this workshop might be of interest.

You can check out the agenda and register here.

Fast and Scalable: SpringBoot + SOLR

Fast and Scalable: SpringBoot + SOLR

Workshop date & duration: March 21st, 2020, 9:30 – 14:00, 30 min break included
Trainer: Oana Brezai
Location:  eSolutions Academy, Budişteanu Office Building, strada General Constantin Budişteanu Nr. 28C, etaj 1, Sector 1, Bucureşti
Price: 150 RON (including VAT)
Number of places: 10
Languages: Java

Description:

Probably you have heard so far of SOLR,  it’s only the open source search platform that powers the search and navigation features of many of the world’s largest internet sites (e.g. AOL, Apple, Netflix, ..). During this workshop, you will build (under my guidance) a basic web shop (catalog + search part) using SOLR and SpringBoot.

 

You can check out the agenda and register here.

Big Data learning – Working groups 2020

Learning a new solution or building an architecture for a specific use case is never easy, especially when you are trying to work alone on such an endeavour – thus this year we will debut a new way of learning specific big data solutions/use cases: working groups.

What will these working groups mean:

  • A predefined topic (see below the topics for 2020) that will be either understanding a big data solution or building a use case;
  • A group of 5 participants and one predefined driver per group – the scope of the driver is (besides being part of the group) to organize the groups, provide the meeting locations and the cloud infrastructure needed for installing the studied solution;
  • 5 physical meetings every 2 weeks (thus a 10 weeks time window for each working group). The meetings will take place either during the week (5PM – 9PM) or Saturdays morning (10AM – 2PM).
  • Active participation/contribution from each participant, for example each participant will have to present in 2 of the meetings to the rest of the group;
  • Some study @ home between the sessions;

More details and registration here.

Intro to Spark Structured Streaming using Scala and Apache Kafka

Intro to Spark Structured Streaming using Scala and Apache Kafka

Workshop date & duration: February 1st, 2020, 9:30 – 14:00, 30 min break included
Trainer: Valentina Crisan, Maria Catana
Location:  eSolutions Academy, Budişteanu Office Building, strada General Constantin Budişteanu Nr. 28C, etaj 1, Sector 1, Bucureşti
Price: 150 RON (including VAT)
Number of places: 10 no more left
Languages: Scala & SQL

Description:

Starting with Spark 2.0 structured streaming processing was introduced, modeling the stream as an unbounded/infinite table –  a big architectural change if we look at the batch model (Dstream) that existed prior to Spark 2.0. The workshop will introduce you into how Spark can read, process & analyze streams of data –  we will use stream data from Apache Kafka and Scala & SQL for reading/processing/analyzing the data. We will discuss as well stateless vs stateful queries and how Spark handles out of order data in case of aggregation queries.

 

You can check out the agenda and register here.

ML on Spark workshop

WORKSHOP MACHINE LEARNING WITH SPARK

Workshop date & duration:  November 5th – Tuesday, 14:00 – 18:00
Trainer: Sorin Peste
Supporting students: Alexandru Petrescu, Laurentiu Partas
Location: TechHub Bucuresti
Price: Free (upon approval by the organizer & trainer)
Number of places: 20 no more places left
Languages: Python

Description:
We are coming back in November with a new workshop on Machine Learning – this time with how to build a model using Spark ML logistic regression and gradient boosting.
So, come join us for an afternoon in which we will explore Apache Spark’s Machine Learning capabilities. We’ll be looking at using Spark to build a Credit Scoring model which estimates the probability of default for current and existing customers.

You can check out the agenda and register here.