Machine Learning with Decision Trees and Random Forest

Following our Spark (completed) and Druid+Superset working group (ongoing) , we are now opening registration on a new working group with the topic: Machine Learning with Decision Trees and Random Forest.

Timeline: September 27th – December 4th (1 online meeting/ 2 weeks, 5 meetings)

What will this working group mean:

A predefined topic: Solving a regression/classification problem using decision trees and random forest. 


The group’s purpose is to go through the entire ML flow while learning about decision trees and random forest. Knowing your data and preparing it  is an important part of the ML flow, this means we’ll learn how to do: 

  • Data analysis using different types of graph visualisations
  • Data cleanup and feature transformation
  • Model training and evaluation
  • Model prediction on new data

Also, for this group we will use smaller datasets and our main focus will be to understand all the steps required for creating the ML flow and for using decision trees/random forest. Please note this working group is not related to big data technologies, so do not expect Spark ML – we will stick to Sklearn and Python and a visualization library co-developed by a friend of ours in, Tudor Lapusan. The driver of this group is Maria Catana.  

We will use: Python, Jupyter notebooks, sklearn, dtreeviz for tree visualisation, matplotlib/seaborn for data visualization

Datasets (we will choose one regression and one classification dataset out of these):

  • Regression: 

Predict auction prices:
Predict sale prices:
Predict price Airbnb Berlin:

  • Classification:

Credit card fraud:
Customer credit rating –
Census income dataset –


  • Python intro level or similar programming language. Note: we will not focus on learning Python, so it would be useful if you have some basic programming experience in Python or other programming language.
  • We plan to work with local installations of Python and Jupyter (including sklearn and other Python libraries) – although we are looking into installing a Jupyter notebook in the cloud to allow a better collaboration. Anyway, please have systems that allow these local installations. 
  • Google account – we will use Google hangouts for communication – a group will be created and the online meetings will take place with Hangouts as well.   

Mandatory resources to read/watch before first meeting: 

Decision trees:

Random forest:
First lesson from:

Classification project example:

A group of 5-6 participants and one predefined driver per group (driver of this group: Maria Catana) – the scope of the driver is (besides being part of the group) to organize/manage the groups and the timelines in order for the group to achieve its goal and to find people that can help in case the group reaches a deadlock;

5 online meetings every 2 weeks (thus a 10 weeks time window for each working group, we will use Google Hangouts/Zoom). The meetings will take place Monday-Friday, in the interval 6PM – 9PM;

Active participation/contribution from each participant: expect a 2 hours/week study besides the time for meetings, also each participant will have to present in at least 2 of the meetings to the rest of the group;

Some study @ home between the sessions (~ 2hours/week);

The fee for participating in these working groups is 100 Euro/participant (you will receive confirmation of participation after payment and after agreeing with the rules of the group) – please register with the steps below and you will receive on the email the payment link and the working group participation rules.