ETL with Apache Spark

ETL WITH APACHE SPARK

Workshop date & duration: March 28th, 2020, 9:30 – 14:00, 30 min break included
TrainerValentina CrisanMaria Catana
Location:  eSolutions Academy, Budişteanu Office Building, strada General Constantin Budişteanu Nr. 28C, etaj 1, Sector 1, Bucureşti
Price: 150 RON (including VAT)
Number of places: 10 no more places left
Languages: Scala & SQL

DESCRIPTION:

One of the many uses of Apache Spark is to transform data from different formats and sources, both batch and streaming data. In this workshop that will be mainly hands on we will focus on just that: understanding how we can read/write/transform/manage schema/join different formats of data and how is best to handle those data when it comes to Apache Spark. So, if you know a bit about Spark but did not manage to play too much with its ETL capabilities or even if you don’t know too much but would like to find out – this workshop might be of interest.

You can check out the agenda and register here.

Fast and Scalable: SpringBoot + SOLR

Fast and Scalable: SpringBoot + SOLR

Workshop date & duration: March 21st, 2020, 9:30 – 14:00, 30 min break included
Trainer: Oana Brezai
Location:  eSolutions Academy, Budişteanu Office Building, strada General Constantin Budişteanu Nr. 28C, etaj 1, Sector 1, Bucureşti
Price: 150 RON (including VAT)
Number of places: 10
Languages: Java

Description:

Probably you have heard so far of SOLR,  it’s only the open source search platform that powers the search and navigation features of many of the world’s largest internet sites (e.g. AOL, Apple, Netflix, ..). During this workshop, you will build (under my guidance) a basic web shop (catalog + search part) using SOLR and SpringBoot.

 

You can check out the agenda and register here.

Big Data learning – Working groups 2020

Learning a new solution or building an architecture for a specific use case is never easy, especially when you are trying to work alone on such an endeavour – thus this year we will debut a new way of learning specific big data solutions/use cases: working groups.

What will these working groups mean:

  • A predefined topic (see below the topics for 2020) that will be either understanding a big data solution or building a use case;
  • A group of 5 participants and one predefined driver per group – the scope of the driver is (besides being part of the group) to organize the groups, provide the meeting locations and the cloud infrastructure needed for installing the studied solution;
  • 5 physical meetings every 2 weeks (thus a 10 weeks time window for each working group). The meetings will take place either during the week (5PM – 9PM) or Saturdays morning (10AM – 2PM).
  • Active participation/contribution from each participant, for example each participant will have to present in 2 of the meetings to the rest of the group;
  • Some study @ home between the sessions;

More details and registration here.

Intro to Spark Structured Streaming using Scala and Apache Kafka

Intro to Spark Structured Streaming using Scala and Apache Kafka

Workshop date & duration: February 1st, 2020, 9:30 – 14:00, 30 min break included
Trainer: Valentina Crisan, Maria Catana
Location:  eSolutions Academy, Budişteanu Office Building, strada General Constantin Budişteanu Nr. 28C, etaj 1, Sector 1, Bucureşti
Price: 150 RON (including VAT)
Number of places: 10 no more left
Languages: Scala & SQL

Description:

Starting with Spark 2.0 structured streaming processing was introduced, modeling the stream as an unbounded/infinite table –  a big architectural change if we look at the batch model (Dstream) that existed prior to Spark 2.0. The workshop will introduce you into how Spark can read, process & analyze streams of data –  we will use stream data from Apache Kafka and Scala & SQL for reading/processing/analyzing the data. We will discuss as well stateless vs stateful queries and how Spark handles out of order data in case of aggregation queries.

 

You can check out the agenda and register here.

ML on Spark workshop

WORKSHOP MACHINE LEARNING WITH SPARK

Workshop date & duration:  November 5th – Tuesday, 14:00 – 18:00
Trainer: Sorin Peste
Supporting students: Alexandru Petrescu, Laurentiu Partas
Location: TechHub Bucuresti
Price: Free (upon approval by the organizer & trainer)
Number of places: 20 no more places left
Languages: Python

Description:
We are coming back in November with a new workshop on Machine Learning – this time with how to build a model using Spark ML logistic regression and gradient boosting.
So, come join us for an afternoon in which we will explore Apache Spark’s Machine Learning capabilities. We’ll be looking at using Spark to build a Credit Scoring model which estimates the probability of default for current and existing customers.

You can check out the agenda and register here.

Open course Big Data, September 25-28, 2019

Open course big data

Open Course: Big Data Architecture and Technology Concepts
Course duration: 3.5 days, September 25-28 (Wednesday-Friday 9:00 – 17:00, Saturday 9:30-13:00)
Trainers: Valentina Crisan, Felix Crisan
Location: Bucharest, TBD (location will be communicated to participants)
Price: 450 EUR, 10% discount early bird if registration is confirmed until 2nd of September – 405 EUR
Number of places: 10
Pre-requisites: knowledge of distributed systems, Hadoop ecosystem (HDFS, MapReduce), know a bit of SQL.

Description:

There are a few concepts and solutions that solutions architects should be aware of when evaluating or building a big data solution: what data partitioning means, how to model your data in order to get the best performance from your distributed system, what is the best format of your data, what is the best storage or the best way to analyze your data. Solutions like HDFS, Hive, Cassandra, Hbase, Spark, Kafka, YARN should be known – not necessarily because you will work specifically with them – but mainly because knowing the concepts of these solutions will help you understand other similar solutions in the big data space. This course is designed to make sure the participants will understand the usage and applicability of big data technologies like HDFS, Spark, Cassandra, Hbase, Kafka ,..  and which aspects to consider when starting to build a Big Data architecture.

Please see details for the course and registration here: https://bigdata.ro/open-course-big-data-september-25-28-2019/

ML Intro using Decision Trees and Random Forest

workshop ML intro using decision trees and random forest

Using Python, Jupyter and SKlearn

Course date duration: July 13th, 9:30 – 14:00, 30 min break included
Trainers: Tudor Lapusan
Location: Impact Hub Bucuresti, Timpuri Noi area
Price: 200 RON (including VAT)
Number of places: 20 no more places left
Languages: Python

Getting started with Machine Learning can seem a pretty hefty task to some people: understanding the algorithms, learning a bit of programming, deciding which libraries to use, getting some data to learn on, etc… But in reality if you’re actually setting your expectations right and willing to start small and learn step by step, learning the basics of ML it’s actually quite doable. After learning a bit it’s actually up to you to take your knowledge in the real world and apply and expand what you have learned.

This workshop aims to introduce you into ML world and to teach you how to solve classification and regression problems through the usage of decision trees and random forest algorithms. We will go from the theory to hands on in just a couple of hours aiming mostly to make you understand the main pipeline of an ML project, while of course learning a bit of ML:

    • Software and hardware requirements for a ML project
    • Common Python libraries for data analysis
    • Feature encoding and feature preprocessing
    • Exploratory Data Analysis (EDA)
    • Model validation
    • Model hyperparameter optimization
    • Tree based models for classification and regression:
      • Decision Tree
      • Random Forest
    • Repetitive model improvement

You can check out the agenda and register here.

Spark Structured Streaming vs Kafka Streams

workshop Spark Structured Streaming vs Kafka Streams

Date: TBD
Trainers: Felix Crisan, Valentina Crisan, Maria Catana
Location: TBD
Number of places: 20
Price: 150 RON (including VAT)

Streams processing can be solved at application level or cluster level (stream processing framework) and two of the existing solutions in these areas are Kafka Streams and Spark Structured Streaming, the former choosing a microservices approach by exposing an API and the later extending the well known Spark processing capabilities to structured streaming processing.

This workshop aims to discuss the major differences between the Kafka and Spark approach when it comes to streams processing: starting from the architecture, the functionalities, the limitations in both solutions, the possible use cases for both and some of the implementation details.

You can check out the agenda and register here.

Workshop Kafka Streams

workshop kafka streams

Date: 18 May, 9:00 – 13:30
Trainers: Felix Crisan, Valentina Crisan
Location: Adobe Romania , Anchor Plaza, Bulevardul Timișoara 26Z, București 061331
Number of places: 20 no more places left
Price: 150 RON (including VAT)

Streams processing is one of the most active topics in big data architecture discussions nowadays, with many open and proprietary solutions available on the market ( Apache Spark Streaming, Apache Storm, Apache Flink, Google DataFlow..). But starting with release 0.11.0.0 Apache Kafka as well introduced the capability to process the streams of data that flow through Kafka – thus understanding what you can do with Kafka Streams and how is different from other solutions in the market it’s key in knowing what to choose for your particular use case.

This workshop aims to cover the most important parts of Kafka streams: the concepts (streams, tables, handling state, interactive queries, .. ), the practicality (what can you do with it and what is the difference between the API and the KSQL server) and to explain what means building an application that uses Kafka Streams. We will be focusing on the stream processing part of Kafka, assuming that participants are already familiar with the basic concepts of Apache Kafka – the distributed messaging bus.

 

You can check out the agenda and register here.

About Neo4j…

In march we will restart our bigdata.ro sessions, workshops mainly aimed to help participants navigate easier through big data architectures and get a basic understanding in some of the possible components of such architectures. We have discussed in the past Cassandra, HDFS, Hive, Impala, Elasticsearch, Solr, Spark & Spark SQL, generic big data architectures and on March 16th we will continue our journey with one of the unusual children of noSQL: the graph database Neo4j. Not quite similar with the other noSQL siblings, this database is not derived from the likes of DynamoDB or BigTable like others do, but instead addresses relationship between data not just the data itself. The result is amazing, the use cases are incredible and Calin Constantinov will guide us through the basics of this interesting solution.   

See below a few questions and answers in advance of the workshop, hopefully these will increase your curiosity towards Neo4j.

Valentina Crisan – bigdata.ro

Calin Constantinov – trainer “Intro to Neo4j” workshop, March 16th

—————————————————————————————————————————————

What is a graph database and which are the possible use cases that favour such a database?

Calin: They say “everything is a graph”. Indeed, even the good old entity-relationship diagram is no exception to this rule. And graphs come with a great “feature” which us humans tend to value very much: they are visual! Graphs can easily be represented on a whiteboard and immediately understood by a wide audience.

Moreover, in a traditional database, explicit relationships are destroyed the very moment we store data and need to be recreated on-demand using JOIN operations. A native graph database has preferential treatment for relationships meaning that there are actual pointers linking an entity to all its neighbors.

I remember the first time I needed to implement a rather simple Access Control List solution that needed to support various inheritable groups, permissions and exceptions. Writing this in SQL can quickly become a nightmare.

But of course, the most popular example is social data similar to what Facebook generates. For some wild reason, imagine you need to figure out the year with the most events attended by at least 5 of your 2rd degree connections (friends-of-friends), with an additional restriction that none of these 5 are friends between them. I wouldn’t really enjoy implementing that with anything other than Neo4j!

However, not all graphs are meant to be stored in a graph database. For instance, while a set of unrelated documents can be represented as a graph with no edges, please don’t rush to using Neo4j for this use-case. I think a Document store is a better persistence choice.

In terms of adoption, 75% of the Fortune 100 companies are already using Neo4j. As for concrete use-case examples, Neo4j is behind eBay’s ShopBot for Graph-Powered Conversational Commerce while NBC News used it for uncovering 200K tweets tied to Russian trolls. My personal favourite is the “Panama Papers” where 2.6TB of spaghetti data, made up of 11.5M heterogeneous documents, was fed to Neo4j. And I think we all know the story that led the investigative team to win the Pulitzer Prize.

What graph databases exist out there and how is Neo4j different from those?

Calin: Full disclosure, given the wonderful community around it (and partially because it happened to be the top result of my Google search), it was love at first sight with me and Neo4j. So, while I’m only using Neo4j in my work, I do closely follow what’s happening in the graph world.

JanusGraph, “a graph database that carries forward the legacy of TitanDB” is one of the most well-known alternatives. A major difference is that JanusGraph is more of a “graph abstraction layer” meaning that it requires a storage backend instead of it being a native graph.

OrientDB is also popular do its Multi-Model, Polyglot Persistence implementation. This means that it’s capable of storing graph, document and key/value data, while maintaining direct connections between records. The only drawback is that it might have not yet reached the maturity and stability required by the most data-intensive tasks out there.

More recently, TigerGraph showed impressive preliminary results, so I might need to check that out soon.

Is the Neo4j architecture a distributed one? Does it scale horizontally like other noSQL databases?

Calin: The short answer is that Neo4j can store graphs of any sizes in an ACID-compliant, distributed, Highly-Available, Causal Clustering architecture, where data replication is based on the state-of-the-art Raft protocol.

In order to achieve best performance, we would probably need to partition the graph in some way. Unfortunately, this is typically a NP-hard problem and, more often than not, our graphs are densely connected which can really make some form of clustering quite challenging. To make matters worse, coming back to the Facebook example, we need to understand that this graph is constantly changing, with each tap of the “Like” button. This means that our graph database can easily end up spending more time finding a (sub-)optimal partition than actually responding to queries. Moreover, when combining a complex query with a bad partitioning of the data, you wind up with requiring a lot of network transfers within the cluster, which will most likely cost more than a cache miss. In turn, this could also have a negative effect on query predictability. Sorry to disappoint you, but this is the reason for which Neo4j doesn’t yet support data distribution. And it’s a good reason too!

So, a limitation in the way Neo4j scales is that every database instance has a complete replica of the graph. Ideally, for best performance, all instances need to have enough resources to keep the whole graph in memory. If this is not the case, in some scenarios, we can at least attempt to achieve cache sharding by identifying all queries hitting a given region of the graph and always routing them to the same instance. As a starting point, there is a built-in load-balancer which can potentially be extended to do that. Additionally, we can easily direct I/O requests intelligently in a heterogeneous environment, designating some Read Replicas for handling read queries while only writing to instances packing the most power. This is a good thing for read operations which can easily scale horizontally. Write operations are however the bottleneck. Nevertheless, the guys over at Neo4j are always coming up with clever ways to significantly improve write performance with every new release.

Does Neo4j work with unstructured/flexible structured data?

Calin: A graph is composed of nodes and relationships. We are able to group similar nodes together by attaching a label, such as “User” and “Post”. Similarly, a relationship can have a type, such as “LIKED” and “TAGGED”. Neo4j is a property graph meaning that multiple name-value pairs can be added both to relationships and nodes. While it is mandatory in Neo4j for relationships to have exactly one type, labels and properties are optional. New ones can be defined on-the-fly and nodes and relationships of the same type don’t all necessarily need to have the same properties. If needed, Neo4j does support indexes and constraints, which can, for instance, improve query performance, but this is as close as you get to an actual schema.

In regards to the question, the whole point of a graph database is to structure your data in some way. Keep in mind that the true value of your data lies within uncovering relationships between entities. If you feel like this doesn’t fit your use-case, coming back to the example of having a number of unrelated free-text documents or even some form of semi-structured data, while Neo4j now supports full text indexing and search, there are clearly better alternatives out there, such as key-value and document stores.

How is it best and easiest to get started with Neo4j?

Calin: Apart from attending my workshop, right? I think the best way to get up to speed with Neo4j is to use the Neo4j Sandbox. It’s a cloud-based trial environment which only requires a browser to work and which comes preloaded with a bunch of interesting datasets. If you’re into a more academic approach, I highly recommend grabbing a copy of “Graph Databases” or “Neo4j in Action“.

Can you detail how can users interact with Neo4j?  What about developers, are there pre-built drivers or interfaces?

Calin: Neo4j packs a very nice Browser that enables users to quite easily query and visualize the graph. This comes with syntax highlighting and autocompletion for Cypher, the query language used by Neo4j. It also features a very handy way to interact with query execution plans.

Developers can “talk” with the database using a REST API or, better yet, the proprietary binary protocol called Bolt, which is already uniformly encapsulated by a number of official drivers, covering the most popular programming languages out there.

However, as I don’t want to spoil the fun, that’s all you’re getting out of me today. But do come and join us on the 16th of March. Please.