Machine Learning with Decision Trees and Random Forest

Following our Spark (completed) and Druid+Superset working group (ongoing) , we are now opening registration on a new working group with the topic: Machine Learning with Decision Trees and Random Forest.

Timeline: September 27th – December 4th (1 online meeting/ 2 weeks, 5 meetings)
Predefined topic: Solving a regression/classification problem using decision trees and random forest. 
We will use: Python, Jupyter notebooks, sklearn, dtreeviz for tree visualisation, matplotlib/seaborn for data visualization


The group’s purpose is to go through the entire ML flow while learning about decision trees and random forest.
You can check out the group’s description and register here.

Starting up with big data

Authors: Cosmin Chauciuc, Valentina Crisan

Throughout our training journeys with one of the most encountered questions at the end of a course (that usually runs in the cloud on an already installed/configured setup) is: “how do I now setup my own small big data system”? Most of the time the question refers to a local installation – not something in the cloud – but to something that can be used for testing different small use cases and get someone started with their big data curiosity/exploration. 

Now, getting started with big data in general can have 2 completely different approaches depending of the intended use case: 

  • if you need HDFS – a distributed file storage system – in order to properly test a kind of a Hadoop eco-system. In this case I recommend going for the Cloudera / Hortonworks small installations. But even the smallest installations require 16GB RAM and a lot of laptops/PC’c cannot afford to allocate this amount of RAM – plus this minimal configuration works pretty slow. Thus, in case a dedicated server – e.g. getting a Hetzner server – might be a better choice versus local installations. 
  • the second option (local storage, no HDFS) – when you just want to have a minimal installation and just understand what you could do with it – is to install a minimal combination of: Confluent Kafka + Cassandra + Zeppelin. Zeppelin comes with an embedded local Spark thus you will basically get something like in the below pic. This is the setup that will be the focus of this post.    

small big data

The blog post below has been drafted and revised by me (Valentina Crisan) but actually all steps were described and detailed by one of my former students in the Big Data Architecture and Technologies open courses: Cosmin Chauciuc. I am so proud to see someone taking forward the basis laid by my course and wanting to expand and become proficient in the big data field.

The blog post will have 3 parts:

  • how to install a minimum big data solution on your own computer (current post)
  • building an end to end use case 
  • visualizing data in Spark with Helium

We assume the reader already knows something about the solutions mentioned in this post (Kafka, Cassandra, Spark, Zeppelin) – thus we won’t go into the basics of what these are. 

Build your own small big data solution

We will use Ubuntu Linux distribution for this installation – thus please follow Step 0 for one of the alternatives of installing Ubuntu on your system. Please note for a complete setup you will need a minimum of 8GB RAM – see in next chapter the RAM requirements for each service.

0. Install Ubuntu

Ubuntu is an open source linux distribution, based on Debian, very easy to set up and have a full linux operating system. There are various ways to have Ubuntu on your laptop with your current windows installation.

  1. WSL – for windows 10
    Install a complete Ubuntu terminal environment in minutes on Windows 10 with Windows Subsystem for Linux (WSL).
  2. Docker
    Docker is a set of platform as a service (PaaS) products that use OS-level virtualization to deliver software in packages called containers.
  3. Using VirtualBox
    VirtualBox can run different operating systems using virtualization. See below the steps to install:

    1. Software to download :
    2. Ubuntu ISO to download :
    3. Set up a new machine in Virtualbox
      – Name : Ubuntu – it will auto fill type : linux
      – Allocate ram for virtual machine : minimum recommended 8GB

Now, why 8GB RAM?  See below an estimated RAM for each of the services :

      • Apache Cassandra 2.3GB
      • Confluent Kafka 3.8GB
      • Apache Zeppelin 1.5GB
      • Ubuntu OS 1.1GB

4. Create a virtual hard disk, VDI, with dynamically allocated size. Depends on what you plan to install, allocate 20-30 GB size.

5. The last step is to attach to the newly created virtual machine, the ISO file: settings , storage : Live CD. This will let you boot from the ISO file.

Ubuntu can be used as a Live system or it can be installed on the VDI hard drive. The difference is that the live OS allows you to use it, but it is not persisting changes/files/new software installed, at reboot. If we want to proceed with installation of big data tools, we have to install the system.

After you have Ubuntu OS installed you need to process with the next Installation steps :
1. Confluent Kafka
2. Apache Cassandra
3. Apache Zeppelin – with embedded Apache Spark local
4. Apache Zeppelin configurations for Spark interworking with Kafka and Cassandra configurations

Note: you might observe that all solutions above are Apache besides Kafka that we chose to be a Confluent installation. The reasoning is simple – Confluent has a one node installation available that has all Kafka services in one node: Apache Kafka (Zookeeper + Kafka broker), Schema Registry, Connect, KSQL to name a few. This one node installation is meant for testing purposes (not for commercial/production ones). In production you will need to install these services from scratch (unless you choose a commercial license installation from Confluent), but this is a setup we found best to get you started and give you a good glimpse into what the Kafka ecosystem can give you as functionality. Nonetheless, if you don’t want to go for Confluent Kafka one node installation at the end of this post you have as well the Apache Kafka version – in this case you will have only Kafka – without the Connect, SchemaRegistry and KSQL.

1. Prerequisites for services installations

  • Install curl
sudo apt install curl
  • Install Java 8
sudo apt install openjdk-8-jdk openjdk-8-jre
  • Java version check
$ java -version
openjdk version "1.8.0_265"
OpenJDK Runtime Environment (build 1.8.0_265-8u265-b01-0ubuntu2~20.04-b01)
OpenJDK 64-Bit Server VM (build 25.265-b01, mixed mode)

2. Kafka installation

  • We will download Confluent Kafka – one node installation (no commercial license needed) – it is one node that has all the Kafka components available: Zookeeper, Broker, Kafka Registry, KSQL, Connect
curl -O
  • Extract the contents of the archive
cd confluent-5.5.1/

We’ll use the default configuration files.

  • Starting the services: this Confluent script starts all the services, including KSQL.
~/confluent-5.5.1/bin$ ./confluent local start

    The local commands are intended for a single-node development environment
    only, NOT for production usage.
Using CONFLUENT_CURRENT: /tmp/confluent.kCNzjS0a
Starting zookeeper
zookeeper is [UP]
Starting kafka
kafka is [UP]
Starting schema-registry
schema-registry is [UP]
Starting kafka-rest
kafka-rest is [UP]
Starting connect
connect is [UP]
Starting ksql-server
ksql-server is [UP]
Starting control-center
control-center is [UP]
  • Another way to install Kafka – not using the Confluent script – is to start first Zookeeper and then add each Kafka broker. See the details in Paragraph 7.

3. Apache Cassandra

  • Add the Apache repository of Cassandra to the file cassandra.sources.list
echo "deb 311x main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list
  • Add the Apache Cassandra repository keys to the list of trusted keys on the server
curl | sudo apt-key add -
  • Update
sudo apt-get update
  • Install Cassandra
sudo apt-get install cassandra
  • Start Cassandra service
sudo service cassandra start
sudo service cassandra stop // only if you want to stop the service
  • Check the status of Cassandra
nodetool status

Datacenter: datacenter1

|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  70.03 KiB  256          100.0%            1c169827-bf4c-487f-b79a-38c00855b144  rack1

  • Test CQLSH
$ cqlsh
Connected to Test Cluster at
[cqlsh 5.0.1 | Cassandra 3.11.7 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.

4. Apache Zeppelin

$ wget
  • Extract archived files
$ tar xzvf zeppelin-0.9.0-preview2-bin-all.tgz
  • Start Zeppelin service
$ cd zeppelin-0.9.0-preview2-bin-all
$ ./bin/ start
Log dir doesn't exist, create /home/kosmin/zeppelin-0.9.0-preview2-bin-all/logs
Pid dir doesn't exist, create /home/kosmin/zeppelin-0.9.0-preview2-bin-all/run
Zeppelin start                                             [  OK  ]

  • Configuring users for Zeppelin

Default login is with an anonymous user. The configuration for users is found in conf folder from <zeppelin_path>

$  ~/zeppelin-0.9.0-preview2-bin-all$ cd conf/
$ mv shiro.ini.template shiro.ini
$ vi shiro.ini
#uncomment the line with admin = password1, admin
restart zeppelin service
  • Login and test the spark interpreter
res3: String = 2.4.5

5. Configuring Zeppelin for connection with Kafka and Cassandra – through the embedded Apache Spark

An Apache Zeppelin interpreter is a plugin that enables you to access processing engines and data sources from the Zeppelin UI. In order to connect our installed Zeppelin to Kafka  we will need to add some artifacts in the Spark interpreter in Zeppelin configurations. More info on Zeppelin interpreters can be found here:

Configurations for Kafka in Zeppelin

  • Finding the spark version
sc.version - in the zeppelin notebook
res1: String = 2.4.5
  • Open the interpreter config page 
  • Under the USERNAME , open the menu and click Interpreter
    • Search for SPARK and click edit
    • At Dependencies part , add this under Artifact and click save 

Configurations for Cassandra in Zeppelin

  • Finding the spark version
sc.version - in the zeppelin notebook
res1: String = 2.4.5
  • Open the interpreter config page
    • Under the USERNAME , open the menu and click Interpreter
    • Search for SPARK and click edit and at Dependencies part , add this under Artifact and click save : 

Version compatibility for the Spark-Cassandra connector can be found here.

6. End to end exercise to test setup

Using Kafka producer console : Open a topic words and start typing data into it ( make sure you set 3 partitions for this topic).

kosmin@bigdata:~/confluent-5.5.1/bin$ ./kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 --partitions 3 --topic bigdata
Created topic bigdata.
kosmin@bigdata:~/confluent-5.5.1/bin$ ./kafka-console-producer --broker-list localhost:9092 --topic bigdata
enter some words here :) 
CTRL+C // if you want to close producing events in the topic

Using Zeppelin, open a Notebook: Create a stream DF in Spark that points to words topic in Kafka. Create a query and read all the parameters for each topic – key (null in our case)

val kafkaDF = spark.readStream.format("kafka")
      .option("kafka.bootstrap.servers", "localhost:9092")

val query_s0 = kafkaDF.writeStream.outputMode("append").format("console").start()

Test for Cassandra

import org.apache.spark.sql.functions._
val tables ="org.apache.spark.sql.cassandra").options(Map( "table" -> "tables", "keyspace" -> "system_schema" )).load()

7. Another way to install Kafka – not using the Confluent script – is to start first Zookeeper and then add each Kafka broker

  • Zookeper
kosmin@bigdata:~/confluent-5.5.1$ ./bin/zookeeper-server-start  ./etc/kafka/

In another terminal window, we can check if Zookeeper started. By default, it listens on 2181 port.

$ ss -tulp | grep 2181
tcp    LISTEN   0        50                     *:2181                  *:*      users:(("java",pid=6139,fd=403))
  • Kafka

For starting Kafka service, run this in another terminal :

kosmin@bigdata:~/confluent-5.5.1$ ./bin/kafka-server-start ./etc/kafka/

Check if Kafka started in another terminal window. Default Kafka broker port is 9092.

kosmin@bigdata:~/confluent-5.5.1$ ss -tulp | grep 9092
tcp    LISTEN   0        50                     *:9092                  *:*      users:(("java",pid=6403,fd=408))
  • Test Kafka topics
kosmin@bigdata:~/confluent-5.5.1/bin$ ./kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic Hello
Created topic Hello.
kosmin@bigdata:~/confluent-5.5.1/bin$ ./kafka-topics --list --zookeeper localhost:2181

At this moment you should have your own private small cluster ready for learning some of the solutions in Big Data Architectures. Although with such setup you cannot see the wonder of distributed systems, you can though understand the purpose of Kafka/Cassandra/Spark/Zeppelin in a big data architecture.

Next post in these series will be about building an end to end case with this setup.

Spark working group

A Big Data Analysis of Meetup Events using Spark NLP, Kafka and Vegas Visualization

Finding trending Meetup topics using Streaming Data, Named Entity Recognition and Zeppelin Notebooks – a tale of a super enthusiastic working group during the pandemic times. 

Author : Andrei Deusteanu 

Project Team: Valentina Crisan, Ovidiu Podariu, Maria Catana, Cristian Stanciulescu, Edwin Brinza, Andrei Deusteanu

We started out as a working group from Our main purpose was to learn and practice on Spark Structured Streaming, Machine Learning and Kafka. We built the entire use case and then the architecture from scratch.

This is a learning case story. We did not really know from the beginning what would be possible or not. For sure, looking back, some of the steps could have been optimized. But, hey, that’s how life works in general.  

Since provides data through a real-time API, we used it as our main data source We did not use the data for commercial purposes, just for testing. 

The problems we tried to solve:

  • Allow meetup organizers to identify trending topics related to their meetup. We computed Trending Topics based on the description of the events matching the tags of interest to us. We did this using the John Snow Labs Spark NLP library for extracting entities. 
  • See which Meetup events attract the most responses within our region. Therefore we monitored the RSVPS for meetups using certain tags, related to our domain of interest – Big Data.

For these we developed 2 sets of visualizations:

  • Trending Keywords
  • RSVPs Distribution


Trending Keywords

Project Documentation

  1. The Stream Reader script fetches data on Yes RSVPs filtered by certain tags from the Meetup Stream API. It then selects the relevant columns that we need. After that it saves this data into the rsvps_filtered_stream Kafka topic. 
  2. For each RSVP, the Stream Reader script then fetches event data for it, only if the event_id does not exist in the events.idx file. This way we make sure that we read event data only once. The setup for the Stream Reader script can be found -> Install Kafka and fetch RSVPs 
  3. The Spark ML – NER Annotator reads data from the Kafka topic events and then applies a Named Entity Recognition Pipeline with Spark NLP. Finally it saves the annotated data in the Kafka topic TOPIC_KEYWORDS. The Notebook with the code can be found here
  4. Using KSQL we create 2 subsequent streams to transform the data and finally 1 table that will be used by Spark for the visualization. In Big Data Architectures, SQL Engines only build a logical object that assign metadata to the physical layer objects. In our case these were the streams we built on top of the topics. To detail a bit: we link data from the TOPIC_KEYWORDS to a new stream via KSQL, called KEYWORDS. Then, using a Create as Select, we create a new stream, EXPLODED_KEYWORDS, for exploding the data since all of the keywords were in an array. Now we have 1 row for each keyword. Next on, we count the occurrences of each keyword and save it into a table, KEYWORDS_COUNTED. The steps to set up the streams and the tabels with the KSQL code can be found here: Kafka – Detailed Architecture.
  5. Finally, we use Vegas library to produce the visualizations on Trending Keywords. The Notebook describing all steps can be found here

Detailed Explanation of the NER Pipeline

In order to annotate the data, we need to transform it into a certain format, from text to numbers, and then back to text.

  1. We first use a DocumentAssembler to turn the text into a Document type.
  2. Then, we break the document into sentences using a SentenceDetector.
  3. After this we separate the text into smaller units by finding the boundaries of words using a Tokenizer.
  4. Next we remove HTML tags and numerical tokens from the text using a Normalizer.
  5. After the preparation and cleaning of the text we need to transform it into a numerical format, vectors. We use an English pre-trained WordEmbeddingsModel.
  6. Next comes the actual keyword extraction using an English NerDLModel Annotator. NerDL stands for Named Entity Recognition Deep Learning.
  7. Further on we need to transform the numbers back into a human readable format, a text. For this we use a NerConverter and save the results in a new column called entities.
  8. Before applying the model to our data, we need to run an empty training step. We use the fit method on an empty dataframe because the model is pretrained.
  9. Then we apply the pipeline to our data and select only the fields that we’re interested in.
  10. Finally we write the data in Kafka:TOPIC_KEYWORDS

RSVPs Distribution

Project Documentation (1)

  1. The Stream Reader script fetches data on Yes RSVPs filtered by certain tags from the Meetup Stream API. It then selects the relevant columns that we need. After that it saves this data into the rsvps_filtered_stream Kafka topic.
  2. For each RSVP, the Stream Reader then reads event data for it, only if the event_id does not exist in the events.idx file. This way we make sure that we read event data only once. The setup for the Stream Reader  script can be found here: Install Kafka and fetch RSVPs
  3. Using KSQL we aggregate and join data from the 2 topics to create 1 Stream, RSVPS_JOINED_DATA, and subsequently 1 Table, RSVPS_FINAL_TABLE containing all RSVPs counts. The KSQL operations and their code can be found here: Kafka – Detailed Architecture
  4. Finally, we use Vegas library to produce visualizations on the distribution of RSVPs around the world and in Romania. The Zeppelin notebook can be found here.


We used a machine from Hetzner Cloud with the following specs:

    • CPU: Intel Xeon E3-1275v5 (4 cores/8 threads)
    • Storage: 2×480 GB SSD (RAID 0)
    • RAM: 64GB


RSVPs Distribution

These visualizations are done on data between 8th of May 22:15 UTC and 4th of June 11:23 UTC. 

Worldwide – Top Countries by Number of RSVPs

Top Countries by Number of RSVPs

Worldwide – Top Cities by Number of RSVPs

Top Cities by Number of RSVPs

As you can see, most of the RSVPs occur in the United States, but the city with the highest number of RSVPs is London. 

Worldwide – Top Events by Number of RSVPs

Top Events by Number of RSVPs

Romania – Top Cities in Romania by Number of RSVPs

Top Cities in Romania by Number of RSVPs

As you can see, most of the RSVPs are in the largest cities of the country. This is probably due to the fact that companies tend to establish their offices here and therefore attract talent to these places.

Romania – Top Meetup Events

top meetups

Romania – RSVPs Distribution

RSVPs Distribution in Romania

* This was produced with Grafana using RSVP data processed in Spark and saved locally.

Europa – RSVPs Distribution

RSVPs Distribution in Europe

* This was produced with Grafana using RSVP data processed in Spark and saved locally.

Trending Keywords


This visualization is done on data from July.

  Trending Keywords - Worldwide


This visualization is done on almost 1 week of data from the start of August. See Issues encountered chapter, point 5. 

Trending Keywords - Romania

Issues discovered along the way

All of these are mentioned in the published Notebooks as well.

    1. Visualizing data using Helium Zeppelin add-on and Vegas library directly from the stream did not work. We had to spill the data to disk, then build Dataframes on top of the files and finally do the visualizations.
    2. Spark NLP did not work for us in a Spark standalone local cluster installation (with local file system). Standalone Local Cluster means that the cluster runs on the same physical machine – Spark Cluster Manager & Workers. Such a setup does not need distributed storage such as HDFS. The workaround for us was to configure Zeppelin to use local Spark, local (*), meaning a non-distributed single-JVM deployment mode available in Zeppelin. 
    3. Vegas plug-in could not be enabled initially. Running the github – %dep z.load(“{vegas-version}”) – recommendation always raised an error. The workaround was to add all the dependencies manually in /opt/spark/jars. These dependencies can be found when deploying spark shell with the Vegas library – /opt/spark/bin/spark-shell –packages
    4. Helium Zeppelin addon did not work/couldn’t be enabled. This too raised an error when enabling it from Zeppelin GUI in our configuration. We did not manage to solve this issue. That’s why we used only Vegas, although it does not support Map visualizations. In the end we got creative a bit – we exported the data and loaded it into Grafana for Map visualizations.
    5. The default retention policy for Kafka is 7 days. This means that data that is older than 1 week is deleted. For some of the topics we changed this setting, but for some we forgot to do this and therefore we lost the data. This affected our visualization for the Trending Keywords in Romania.

Conclusions & Learning Points

  • In the world of Big Data you need clarity around the questions you’re trying to answer before building the Data Architecture and then follow through the plan to make sure you’re still working according to those questions. Otherwise, you might end up with something that can’t do what you actually need. It sounds a pretty general statement and pretty “DOH, OBVIOUSLY”. once we’ve seen the visualizations, we realized that we did not create the Kafka objects according to our initial per country keywords distribution visualization – e.g. we created the count aggregation per all countries, in the KEYWORDS_COUNTED Table.  Combine this with the mistake of forgetting to change the Kafka retention period from the default 7 days, by the time we realized the mistake we had lost the historical data as well. Major learning point.  
  • Data should be filtered in advance of the ML/NLP process – we should have removed some keywords that don’t exactly make sense such as “de”, “da”. In order to get more relevant insights maybe several rounds of data cleaning and extracting the keywords might be needed. 
  • After seeing the final visualizations we should probably have filtered a bit more some of the obvious words. For example of course Zoom was the highest scoring keyword since by June everybody was running only online meetups mainly on Zoom.  
  • Staying focused in yet another online meeting after a long day of remote work is hard. When we meet in person we use a lot of non-verbal cues to express ourselves and understand others. In online calls we need to devote an extra level of attention as these elements are non-existent. With time this gets really tiring. — But, hei, this is what working groups look like in COVID time. 
  • Working groups are a great way of learning :-). Just see bellow the feedback from one of our project members after we – finally – managed to end the project. 

Ovidiu -> Double Combo : Meet great people and learn about Big Data

This study group was a great way for me to learn about an end-to-end solution that uses Kafka to ingest streaming data, Spark to process it and Zeppelin to build visualizations on it.

It was great that we all had different backgrounds and we had interesting debates about how to approach problems and how to solve them. Besides working with and getting to know these nice people, I got to learn the fundamentals of Kafka, how to use Spark Streaming to consume Kafka events, how a Named Entity Recognition system works, and how to use Spark NLP to train a NER model and make predictions with it. I’ve especially enjoyed one of our last meetings, where we worked together for a couple of hours and managed to build some great visualizations for meetup keywords and RSVPs using Zeppelin and Grafana.

On the downside, the project took almost twice as long to complete than we originally planned – 18 weeks instead of 10. Towards the end this made it hard for me to work on it as much as I would have wanted to because it overlapped with another project that I had already planned for the summer.

All in all, I would recommend this experience for anyone interested in learning Big Data technologies together with other passionate people, in a casual and friendly environment.

Understanding Big Data Architecture E2E (Use case including Cassandra + Kafka + Spark + Zeppelin)  

Open Course: Understanding Big Data Architecture E2E (Use case including Cassandra + Kafka + Spark + Zeppelin)  
Timeline & Duration: July 27th – August 14th, 6 X 4 hours online sessions, during 3 weeks (2 sessions/week, Monday + Thursday) . An online setup will be available for exercises/hands-on sessions for the duration of the course. 
Main trainer: Valentina Crisan
Location: Online (Zoom)
Price: 250 EUR 
Pre-requisites: knowledge of distributed systems, Hadoop ecosystem (HDFS, MapReduce), know a bit of SQL.

More details and registration here.

Big Data Learning – Druid working group

Learning a new solution or building an architecture for a specific use case is never easy, especially when you are trying to embark alone on such an endeavour – thus in 2020 started a new way of learning specific big data solutions/use cases: working groups. And with the first working group (centered around Spark Structured Streaming + NLP) on its way to completion in July, we are now opening registration for a new working group – this time centered around Apache Druid: Building live dashboards with Apache Druid + Superset. The working group aims to take place End of July – October and will bring together a team of 5-6 participants that will define the scope, select the data (open data), install the needed components, implement the needed flow. Besides the participants for this group we will have a team of advisors (with experience in Druid and big data in general) that will advise the participants on how to solve different issues that will arise in the project.

Find more details of the working group here.

Big Data learning – Working groups 2020

Learning a new solution or building an architecture for a specific use case is never easy, especially when you are trying to work alone on such an endeavour – thus this year we will debut a new way of learning specific big data solutions/use cases: working groups.

What will these working groups mean:

  • A predefined topic (see below the topics for 2020) that will be either understanding a big data solution or building a use case;
  • A group of 5 participants and one predefined driver per group – the scope of the driver is (besides being part of the group) to organize the groups, provide the meeting locations and the cloud infrastructure needed for installing the studied solution;
  • 5 physical meetings every 2 weeks (thus a 10 weeks time window for each working group). The meetings will take place either during the week (5PM – 9PM) or Saturdays morning (10AM – 2PM).
  • Active participation/contribution from each participant, for example each participant will have to present in 2 of the meetings to the rest of the group;
  • Some study @ home between the sessions;

More details and registration here.

Open course Big Data, September 25-28, 2019

Open course big data

Open Course: Big Data Architecture and Technology Concepts
Course duration: 3.5 days, September 25-28 (Wednesday-Friday 9:00 – 17:00, Saturday 9:30-13:00)
Trainers: Valentina Crisan, Felix Crisan
Location: Bucharest, TBD (location will be communicated to participants)
Price: 450 EUR, 10% discount early bird if registration is confirmed until 2nd of September – 405 EUR
Number of places: 10
Pre-requisites: knowledge of distributed systems, Hadoop ecosystem (HDFS, MapReduce), know a bit of SQL.


There are a few concepts and solutions that solutions architects should be aware of when evaluating or building a big data solution: what data partitioning means, how to model your data in order to get the best performance from your distributed system, what is the best format of your data, what is the best storage or the best way to analyze your data. Solutions like HDFS, Hive, Cassandra, Hbase, Spark, Kafka, YARN should be known – not necessarily because you will work specifically with them – but mainly because knowing the concepts of these solutions will help you understand other similar solutions in the big data space. This course is designed to make sure the participants will understand the usage and applicability of big data technologies like HDFS, Spark, Cassandra, Hbase, Kafka ,..  and which aspects to consider when starting to build a Big Data architecture.

Please see details for the course and registration here:

Spark Structured Streaming vs Kafka Streams

workshop Spark Structured Streaming vs Kafka Streams

Date: TBD
Trainers: Felix Crisan, Valentina Crisan, Maria Catana
Location: TBD
Number of places: 20
Price: 150 RON (including VAT)

Streams processing can be solved at application level or cluster level (stream processing framework) and two of the existing solutions in these areas are Kafka Streams and Spark Structured Streaming, the former choosing a microservices approach by exposing an API and the later extending the well known Spark processing capabilities to structured streaming processing.

This workshop aims to discuss the major differences between the Kafka and Spark approach when it comes to streams processing: starting from the architecture, the functionalities, the limitations in both solutions, the possible use cases for both and some of the implementation details.

You can check out the agenda and register here.

Workshop Kafka Streams

workshop kafka streams

Date: 18 May, 9:00 – 13:30
Trainers: Felix Crisan, Valentina Crisan
Location: Adobe Romania , Anchor Plaza, Bulevardul Timișoara 26Z, București 061331
Number of places: 20 no more places left
Price: 150 RON (including VAT)

Streams processing is one of the most active topics in big data architecture discussions nowadays, with many open and proprietary solutions available on the market ( Apache Spark Streaming, Apache Storm, Apache Flink, Google DataFlow..). But starting with release Apache Kafka as well introduced the capability to process the streams of data that flow through Kafka – thus understanding what you can do with Kafka Streams and how is different from other solutions in the market it’s key in knowing what to choose for your particular use case.

This workshop aims to cover the most important parts of Kafka streams: the concepts (streams, tables, handling state, interactive queries, .. ), the practicality (what can you do with it and what is the difference between the API and the KSQL server) and to explain what means building an application that uses Kafka Streams. We will be focusing on the stream processing part of Kafka, assuming that participants are already familiar with the basic concepts of Apache Kafka – the distributed messaging bus.


You can check out the agenda and register here.

About Neo4j…

In march we will restart our sessions, workshops mainly aimed to help participants navigate easier through big data architectures and get a basic understanding in some of the possible components of such architectures. We have discussed in the past Cassandra, HDFS, Hive, Impala, Elasticsearch, Solr, Spark & Spark SQL, generic big data architectures and on March 16th we will continue our journey with one of the unusual children of noSQL: the graph database Neo4j. Not quite similar with the other noSQL siblings, this database is not derived from the likes of DynamoDB or BigTable like others do, but instead addresses relationship between data not just the data itself. The result is amazing, the use cases are incredible and Calin Constantinov will guide us through the basics of this interesting solution.   

See below a few questions and answers in advance of the workshop, hopefully these will increase your curiosity towards Neo4j.

Valentina Crisan –

Calin Constantinov – trainer “Intro to Neo4j” workshop, March 16th


What is a graph database and which are the possible use cases that favour such a database?

Calin: They say “everything is a graph”. Indeed, even the good old entity-relationship diagram is no exception to this rule. And graphs come with a great “feature” which us humans tend to value very much: they are visual! Graphs can easily be represented on a whiteboard and immediately understood by a wide audience.

Moreover, in a traditional database, explicit relationships are destroyed the very moment we store data and need to be recreated on-demand using JOIN operations. A native graph database has preferential treatment for relationships meaning that there are actual pointers linking an entity to all its neighbors.

I remember the first time I needed to implement a rather simple Access Control List solution that needed to support various inheritable groups, permissions and exceptions. Writing this in SQL can quickly become a nightmare.

But of course, the most popular example is social data similar to what Facebook generates. For some wild reason, imagine you need to figure out the year with the most events attended by at least 5 of your 2rd degree connections (friends-of-friends), with an additional restriction that none of these 5 are friends between them. I wouldn’t really enjoy implementing that with anything other than Neo4j!

However, not all graphs are meant to be stored in a graph database. For instance, while a set of unrelated documents can be represented as a graph with no edges, please don’t rush to using Neo4j for this use-case. I think a Document store is a better persistence choice.

In terms of adoption, 75% of the Fortune 100 companies are already using Neo4j. As for concrete use-case examples, Neo4j is behind eBay’s ShopBot for Graph-Powered Conversational Commerce while NBC News used it for uncovering 200K tweets tied to Russian trolls. My personal favourite is the “Panama Papers” where 2.6TB of spaghetti data, made up of 11.5M heterogeneous documents, was fed to Neo4j. And I think we all know the story that led the investigative team to win the Pulitzer Prize.

What graph databases exist out there and how is Neo4j different from those?

Calin: Full disclosure, given the wonderful community around it (and partially because it happened to be the top result of my Google search), it was love at first sight with me and Neo4j. So, while I’m only using Neo4j in my work, I do closely follow what’s happening in the graph world.

JanusGraph, “a graph database that carries forward the legacy of TitanDB” is one of the most well-known alternatives. A major difference is that JanusGraph is more of a “graph abstraction layer” meaning that it requires a storage backend instead of it being a native graph.

OrientDB is also popular do its Multi-Model, Polyglot Persistence implementation. This means that it’s capable of storing graph, document and key/value data, while maintaining direct connections between records. The only drawback is that it might have not yet reached the maturity and stability required by the most data-intensive tasks out there.

More recently, TigerGraph showed impressive preliminary results, so I might need to check that out soon.

Is the Neo4j architecture a distributed one? Does it scale horizontally like other noSQL databases?

Calin: The short answer is that Neo4j can store graphs of any sizes in an ACID-compliant, distributed, Highly-Available, Causal Clustering architecture, where data replication is based on the state-of-the-art Raft protocol.

In order to achieve best performance, we would probably need to partition the graph in some way. Unfortunately, this is typically a NP-hard problem and, more often than not, our graphs are densely connected which can really make some form of clustering quite challenging. To make matters worse, coming back to the Facebook example, we need to understand that this graph is constantly changing, with each tap of the “Like” button. This means that our graph database can easily end up spending more time finding a (sub-)optimal partition than actually responding to queries. Moreover, when combining a complex query with a bad partitioning of the data, you wind up with requiring a lot of network transfers within the cluster, which will most likely cost more than a cache miss. In turn, this could also have a negative effect on query predictability. Sorry to disappoint you, but this is the reason for which Neo4j doesn’t yet support data distribution. And it’s a good reason too!

So, a limitation in the way Neo4j scales is that every database instance has a complete replica of the graph. Ideally, for best performance, all instances need to have enough resources to keep the whole graph in memory. If this is not the case, in some scenarios, we can at least attempt to achieve cache sharding by identifying all queries hitting a given region of the graph and always routing them to the same instance. As a starting point, there is a built-in load-balancer which can potentially be extended to do that. Additionally, we can easily direct I/O requests intelligently in a heterogeneous environment, designating some Read Replicas for handling read queries while only writing to instances packing the most power. This is a good thing for read operations which can easily scale horizontally. Write operations are however the bottleneck. Nevertheless, the guys over at Neo4j are always coming up with clever ways to significantly improve write performance with every new release.

Does Neo4j work with unstructured/flexible structured data?

Calin: A graph is composed of nodes and relationships. We are able to group similar nodes together by attaching a label, such as “User” and “Post”. Similarly, a relationship can have a type, such as “LIKED” and “TAGGED”. Neo4j is a property graph meaning that multiple name-value pairs can be added both to relationships and nodes. While it is mandatory in Neo4j for relationships to have exactly one type, labels and properties are optional. New ones can be defined on-the-fly and nodes and relationships of the same type don’t all necessarily need to have the same properties. If needed, Neo4j does support indexes and constraints, which can, for instance, improve query performance, but this is as close as you get to an actual schema.

In regards to the question, the whole point of a graph database is to structure your data in some way. Keep in mind that the true value of your data lies within uncovering relationships between entities. If you feel like this doesn’t fit your use-case, coming back to the example of having a number of unrelated free-text documents or even some form of semi-structured data, while Neo4j now supports full text indexing and search, there are clearly better alternatives out there, such as key-value and document stores.

How is it best and easiest to get started with Neo4j?

Calin: Apart from attending my workshop, right? I think the best way to get up to speed with Neo4j is to use the Neo4j Sandbox. It’s a cloud-based trial environment which only requires a browser to work and which comes preloaded with a bunch of interesting datasets. If you’re into a more academic approach, I highly recommend grabbing a copy of “Graph Databases” or “Neo4j in Action“.

Can you detail how can users interact with Neo4j?  What about developers, are there pre-built drivers or interfaces?

Calin: Neo4j packs a very nice Browser that enables users to quite easily query and visualize the graph. This comes with syntax highlighting and autocompletion for Cypher, the query language used by Neo4j. It also features a very handy way to interact with query execution plans.

Developers can “talk” with the database using a REST API or, better yet, the proprietary binary protocol called Bolt, which is already uniformly encapsulated by a number of official drivers, covering the most popular programming languages out there.

However, as I don’t want to spoil the fun, that’s all you’re getting out of me today. But do come and join us on the 16th of March. Please.