Workshop Kafka Streams

workshop kafka streams

Date: 18 May, 9:00 – 13:30
Trainers: Felix Crisan, Valentina Crisan
Location: Adobe Romania , Anchor Plaza, Bulevardul Timișoara 26Z, București 061331
Number of places: 20
Price: 150 RON (including VAT)

Streams processing is one of the most active topics in big data architecture discussions nowadays, with many open and proprietary solutions available on the market ( Apache Spark Streaming, Apache Storm, Apache Flink, Google DataFlow..). But starting with release 0.11.0.0 Apache Kafka as well introduced the capability to process the streams of data that flow through Kafka – thus understanding what you can do with Kafka Streams and how is different from other solutions in the market it’s key in knowing what to choose for your particular use case.

This workshop aims to cover the most important parts of Kafka streams: the concepts (streams, tables, handling state, interactive queries, .. ), the practicality (what can you do with it and what is the difference between the API and the KSQL server) and to explain what means building an application that uses Kafka Streams. We will be focusing on the stream processing part of Kafka, assuming that participants are already familiar with the basic concepts of Apache Kafka – the distributed messaging bus.

 

You can check out the agenda and register here.

About Neo4j…

In march we will restart our bigdata.ro sessions, workshops mainly aimed to help participants navigate easier through big data architectures and get a basic understanding in some of the possible components of such architectures. We have discussed in the past Cassandra, HDFS, Hive, Impala, Elasticsearch, Solr, Spark & Spark SQL, generic big data architectures and on March 16th we will continue our journey with one of the unusual children of noSQL: the graph database Neo4j. Not quite similar with the other noSQL siblings, this database is not derived from the likes of DynamoDB or BigTable like others do, but instead addresses relationship between data not just the data itself. The result is amazing, the use cases are incredible and Calin Constantinov will guide us through the basics of this interesting solution.   

See below a few questions and answers in advance of the workshop, hopefully these will increase your curiosity towards Neo4j.

Valentina Crisan – bigdata.ro

Calin Constantinov – trainer “Intro to Neo4j” workshop, March 16th

—————————————————————————————————————————————

What is a graph database and which are the possible use cases that favour such a database?

Calin: They say “everything is a graph”. Indeed, even the good old entity-relationship diagram is no exception to this rule. And graphs come with a great “feature” which us humans tend to value very much: they are visual! Graphs can easily be represented on a whiteboard and immediately understood by a wide audience.

Moreover, in a traditional database, explicit relationships are destroyed the very moment we store data and need to be recreated on-demand using JOIN operations. A native graph database has preferential treatment for relationships meaning that there are actual pointers linking an entity to all its neighbors.

I remember the first time I needed to implement a rather simple Access Control List solution that needed to support various inheritable groups, permissions and exceptions. Writing this in SQL can quickly become a nightmare.

But of course, the most popular example is social data similar to what Facebook generates. For some wild reason, imagine you need to figure out the year with the most events attended by at least 5 of your 2rd degree connections (friends-of-friends), with an additional restriction that none of these 5 are friends between them. I wouldn’t really enjoy implementing that with anything other than Neo4j!

However, not all graphs are meant to be stored in a graph database. For instance, while a set of unrelated documents can be represented as a graph with no edges, please don’t rush to using Neo4j for this use-case. I think a Document store is a better persistence choice.

In terms of adoption, 75% of the Fortune 100 companies are already using Neo4j. As for concrete use-case examples, Neo4j is behind eBay’s ShopBot for Graph-Powered Conversational Commerce while NBC News used it for uncovering 200K tweets tied to Russian trolls. My personal favourite is the “Panama Papers” where 2.6TB of spaghetti data, made up of 11.5M heterogeneous documents, was fed to Neo4j. And I think we all know the story that led the investigative team to win the Pulitzer Prize.

What graph databases exist out there and how is Neo4j different from those?

Calin: Full disclosure, given the wonderful community around it (and partially because it happened to be the top result of my Google search), it was love at first sight with me and Neo4j. So, while I’m only using Neo4j in my work, I do closely follow what’s happening in the graph world.

JanusGraph, “a graph database that carries forward the legacy of TitanDB” is one of the most well-known alternatives. A major difference is that JanusGraph is more of a “graph abstraction layer” meaning that it requires a storage backend instead of it being a native graph.

OrientDB is also popular do its Multi-Model, Polyglot Persistence implementation. This means that it’s capable of storing graph, document and key/value data, while maintaining direct connections between records. The only drawback is that it might have not yet reached the maturity and stability required by the most data-intensive tasks out there.

More recently, TigerGraph showed impressive preliminary results, so I might need to check that out soon.

Is the Neo4j architecture a distributed one? Does it scale horizontally like other noSQL databases?

Calin: The short answer is that Neo4j can store graphs of any sizes in an ACID-compliant, distributed, Highly-Available, Causal Clustering architecture, where data replication is based on the state-of-the-art Raft protocol.

In order to achieve best performance, we would probably need to partition the graph in some way. Unfortunately, this is typically a NP-hard problem and, more often than not, our graphs are densely connected which can really make some form of clustering quite challenging. To make matters worse, coming back to the Facebook example, we need to understand that this graph is constantly changing, with each tap of the “Like” button. This means that our graph database can easily end up spending more time finding a (sub-)optimal partition than actually responding to queries. Moreover, when combining a complex query with a bad partitioning of the data, you wind up with requiring a lot of network transfers within the cluster, which will most likely cost more than a cache miss. In turn, this could also have a negative effect on query predictability. Sorry to disappoint you, but this is the reason for which Neo4j doesn’t yet support data distribution. And it’s a good reason too!

So, a limitation in the way Neo4j scales is that every database instance has a complete replica of the graph. Ideally, for best performance, all instances need to have enough resources to keep the whole graph in memory. If this is not the case, in some scenarios, we can at least attempt to achieve cache sharding by identifying all queries hitting a given region of the graph and always routing them to the same instance. As a starting point, there is a built-in load-balancer which can potentially be extended to do that. Additionally, we can easily direct I/O requests intelligently in a heterogeneous environment, designating some Read Replicas for handling read queries while only writing to instances packing the most power. This is a good thing for read operations which can easily scale horizontally. Write operations are however the bottleneck. Nevertheless, the guys over at Neo4j are always coming up with clever ways to significantly improve write performance with every new release.

Does Neo4j work with unstructured/flexible structured data?

Calin: A graph is composed of nodes and relationships. We are able to group similar nodes together by attaching a label, such as “User” and “Post”. Similarly, a relationship can have a type, such as “LIKED” and “TAGGED”. Neo4j is a property graph meaning that multiple name-value pairs can be added both to relationships and nodes. While it is mandatory in Neo4j for relationships to have exactly one type, labels and properties are optional. New ones can be defined on-the-fly and nodes and relationships of the same type don’t all necessarily need to have the same properties. If needed, Neo4j does support indexes and constraints, which can, for instance, improve query performance, but this is as close as you get to an actual schema.

In regards to the question, the whole point of a graph database is to structure your data in some way. Keep in mind that the true value of your data lies within uncovering relationships between entities. If you feel like this doesn’t fit your use-case, coming back to the example of having a number of unrelated free-text documents or even some form of semi-structured data, while Neo4j now supports full text indexing and search, there are clearly better alternatives out there, such as key-value and document stores.

How is it best and easiest to get started with Neo4j?

Calin: Apart from attending my workshop, right? I think the best way to get up to speed with Neo4j is to use the Neo4j Sandbox. It’s a cloud-based trial environment which only requires a browser to work and which comes preloaded with a bunch of interesting datasets. If you’re into a more academic approach, I highly recommend grabbing a copy of “Graph Databases” or “Neo4j in Action“.

Can you detail how can users interact with Neo4j?  What about developers, are there pre-built drivers or interfaces?

Calin: Neo4j packs a very nice Browser that enables users to quite easily query and visualize the graph. This comes with syntax highlighting and autocompletion for Cypher, the query language used by Neo4j. It also features a very handy way to interact with query execution plans.

Developers can “talk” with the database using a REST API or, better yet, the proprietary binary protocol called Bolt, which is already uniformly encapsulated by a number of official drivers, covering the most popular programming languages out there.

However, as I don’t want to spoil the fun, that’s all you’re getting out of me today. But do come and join us on the 16th of March. Please.

Introduction to Neo4j

Description:

This workshop aims to cover various introductory topics in the area of graph databases, with a main focus on Neo4j. We will tackle subjects relating to data modelling, performance and scalability. We will then have a look at how this technology can be used to highlight valuable patterns within our data.

The workshop will be divided into three main parts: a presentation covering the theoretical aspects surrounding graph databases, a demo showcasing typical Neo4j usage and a hands-on lab activity.

Introduction TO Neo4j

Date:March 16th, 2019, 9:30 – 13:30
Trainer:  Calin Constantinov
Location: eSolutions Academy, Budişteanu Office Building, strada General Constantin Budişteanu Nr. 28C, etaj 1, Sector 1, Bucureşti.
Number of places: 15 no more places left
Price: 150 RON (including VAT)

You can check out the agenda and register here.

Introduction to Apache Kafka

Apache Kafka is positioning strongly lately as Kafka as a Platform, quite an evolution from the messaging bus build by LinkedIn in 2011. But what makes Apache Kafka market such a strong position in the big data architecture landscape: highly distributed infinite (theoretically at least) storage of data, streaming features and API, KSQL? In this workshop we will go through the main features of Apache Kafka and discuss its evolved position in a big data architecture through use cases and through a hands on session in which we will store data through producers API, retrieve data through consumers API, see how data is partitioned and replicated, we will process data stored in Kafka through Kafka streams using KSQL. This workshop is entry level and addresses anyone interested in understanding how to get started with Apache Kafka and the role this solution can play in a big data architecture

Date: October 20, 2018, 9:30-13:30
TrainersValentina CrisanFelix Crisan
Location: eSolutions Academy, Budişteanu Office Building, strada General Constantin Budişteanu Nr. 28C, etaj 1, Sector 1, Bucureşti.
Number of places: 15  no more places left
Price: 150 RON (including VAT)

You can check out the agenda and register for future session here.

Intro to Big Data Hadoop Architecture

Open course: Intro to Big Data Hadoop Architecture

Dates18-20 May, 2018

Location: eSolutions Academy, Budişteanu Office Building, strada General Constantin Budişteanu Nr. 28C, etaj 1, Sector 1, Bucureşti.
Number of places: 15

We have decided, together with our partners from eSolutions Academy, to organize an open course for Big Data introduction. Together with Felix we’ve been (delivered) through many and varied Big Data courses in the past 2 years, thus we believe we have the general understanding needed in order to approach the general architecture level of a Big Data solution for a class of people coming from different backgrounds thus having different expectations and maybe use cases in mind. We asked eSolutions to help with organizing and the logistics of the course, thus all the registration process goes through them. We look forward seeing you at the course.

Please see details for the course and registration here:

Introduction to Apache Solr

This workshop addresses anyone interested in Search solutions, the workshop aim is to be a light intro in Search engines and especially Apache Solr. Apache Solr is one of the two main open source search engines existing today and it’s also the base for the search functionalities implemented in several big data platforms ( e.g. Datastax, Cloudera). Thus, understanding Solr will help you not only in working with the Apache version but as well have a starting point in several platforms that use Solr as base for their search functionalities.

Date: 30 June, 2018, 9:30-13:30
Trainers: Radu Gheorghe
Location: eSolutions Academy, Budişteanu Office Building, strada General Constantin Budişteanu Nr. 28C, etaj 1, Sector 1, Bucureşti.
Number of places:  15  10 places left
Price: 150 RON (including VAT)

You can check out the agenda and register for this session here.

Big Data Architecture intro workshop

This workshop is addressed to anyone interested in Big Data and the overall architectural components required to build a data solution. We will use Apache Zeppelin for some data exploration but otherwise the workshop will be more a theoretical one – allowing enough time to overall understand which are the possible components and their role in a Big Data Architecture. We will not go in depth in the components/solutions but the aim is to understand the overall role of possible components in architecting a big data solution.

The scope of this workshop is to make the participants familiar with the Big Data architecture components and has as prerequisite the overall understanding of IT architectures.

Date: February 24th, 2018, 9:00 – 13:00
Trainers: Felix Crisan, Valentina Crisan
Location: eSolutions Academy, Budişteanu Office Building, strada General Constantin Budişteanu Nr. 28C, etaj 1, Sector 1, Bucureşti.
Number of places:  15  no more places left
Price: 150 RON (including VAT)

Check out the agenda and register for this session here.

SQL on Hadoop hands on session, part 1: Hive and Impala intro

They say SQL is the english language of the Big Data World, since almost everybody understands/knows its syntax. The aim of this workshop is to explain in SQL what kind of queries can be run on HDFS (the storage component of the Hadoop environment) – for batch and interactive queries – and, out of the several solutions available, to address and run hands on exercises on Apache Hive and Apache Impala and discuss the general performance that can be obtained. We will also discuss different file formats that can be used in order to get the best performance out of Hive and Impala, besides the types of operations/analytics that can be performed on HDFS data.

Date: November 11, 2017, 9:30 – 13:30
Trainers: Felix Crisan, Valentina Crisan
Location: eSolutions Academy, Budişteanu Office Building, strada General Constantin Budişteanu Nr. 28C, etaj 1, Sector 1, Bucureşti.
Number of places: 15  3 places left
Price: 150 RON (including VAT)

You can check out the agenda and register for this session here 

About big data lakes & architectures

In preparation for October 10th Bucharest Big Data Meetup here’s a short interview with Cristina Grosu, Product Manager @ Bigstep, company that has built a platform of integrated big data technologies such as Hadoop, NoSQL, MPP Databases as well as containers, bare-metal compute and storage, offered as a service. Cristina will be talking “Datalake architectures seen in production” during the meetup and we discussed a bit about the Data Lake concept and the possible solutions part of its architecture.

 
[Valentina] What is a data lake and how is it different from the traditional data warehouse present today in majority of bigger companies?

[Cristina] Think about the data lake as a huge data repository where data from very different and disparate sources can be stored in its raw, original format. One of the most important advantages of using this concept when storing data comes from the lack of a predefined schema, specific data structure, format or type which encourages the user to reinterpret the data and find new actionable insights.

Real-time or batch, structured or non-structured, files, sensor data, clickstream data, logs, etc can all be stored in the data lake. As users start a data lake project, they can explore more complex use cases since all data sources and formats can be blended together using modern, distributed technologies that can easily scale at a predictable cost and performance.

[Valentina] What (Big Data) solutions usually take part in a data lake architecture? Tell us a little bit regarding the generic architecture of such a solution.

[Cristina] The most common stack of technologies that I see in a data lake architecture is built around the Hadoop Distributed File System.

Since this is an open source project it allows customers build an open data lake that is extensible as well as well integrated with other applications that are running on-premises or in the cloud. Depending on the type of workload, the technology stack around the data lake might differ, but the Hadoop ecosystem is so versatile in terms of data processing (Apache Spark), data ingestion (Streamsets, Apache NiFi), data visualization(Apache Kibana, Superset), data exploration(Zeppelin, Jupyter) and streaming solutions(Storm, Flink, Spark Streaming) that can be fairly easy interconnected to create a full-fledged data pipeline.

What I have seen in production with a lot of our customers are “continuously evolving data lakes”, where users are building their solutions incrementally with a long-term vision in mind, expanding their data architecture when needed and leveraging new technologies when a new twist on their initial approach appears.

Other architectures for the data lake are based on NoSQL databases (Cassandra) or are using Hadoop as an extension of existing data warehouses in conjunction with other BI specific applications.

Regardless of the technology stack, it’s important to design a “forward-looking architecture” that enables data democratization, data lineage tracking, data audit capabilities, real-time data delivery and analytics, metadata management and innovation.

[Valentina] What are the main requirements of a company that would like to build a data lake architecture?  

[Cristina] The truth is that in big data projects, the main requirement is to have a driver in the company, someone with enough influence and experience that can sell the data lake concept in the company and can work together with the team to overcome all obstacles.

At the beginning of the project building the team’s big data skills might be necessary to make sure everyone is up to speed and create momentum within the organization by training, hiring or bringing in consultants.

Oftentimes, this leader will be challenged with implementing shifts in mindsets and behaviors regarding how the data is used, who’s involved in the analytics process and how everyone is interacting with data. As a result, new data collaboration pipelines will be developed to encourage cross-departmental communication.

To ensure the success of a data lake project and turn it into a strategic move, besides the right solution, you need a committed driver in your organization and the right people in place.

[Valentina] From the projects you’ve run so far, which do you think are the trickiest points to tackle when building a data lake solution?

[Cristina] Defining strategies for metadata management is quite difficult especially for organizations that have a large variety of data. Dumping everything in the data lake and trying to make sense of it 6 months into the project can become a real challenge if there is no data dictionary or data catalog that can be used to browse through data. I would say, data governance is an important thing to consider when developing the initial plan, even though organizations sometimes overlook it.

In addition, it’s important to have the right skills in the organization, understand how to extract value from data and of course figure out how to integrate the data lake into existing IT applications if that is important for the use case.

[Valentina] Any particular type of business that would benefit more from building a data lake?

[Cristina] To be honest the data lake is such a universal solution that it can be applied to all business verticals with relatively small differences in the design of the solution. The specificity of the data lake comes from every industry’s business processes and the business function it serves.

The greatest data lake adoption is observed among the usual suspects in retail, marketing, logistics, advertising where the volume and velocity of data grow day by day. Very interesting data lake projects are also seen in banking, energy, manufacturing, government departments and even Formula 1.

[Valentina] Will data lake concept change the way companies work with data in your opinion?    

[Cristina] Access to data will broaden the audience of the organization’s data and will drive the shift towards self-service analytics to power real-time decision making and automation.

Ideally, each person within a company would have access to a powerful dashboard that can be customized and adapted to run data exploration experiments to find new correlations. Once you find better ways to work with data, you can better understand how trends are going to affect your business, what processes can be optimized or you can simulate different scenarios and understand how your business is going to perform.

The data lake represents the start of this mindset change but everything will materialize once the adoption increases and technologies mature.

Interview by: Valentina Crisan

Modeling your data for analytics with Apache Cassandra and Spark SQL

This session is intended for those looking to understand better how to model data for queries in Apache Cassandra and Apache Cassandra + Spark SQL. The session will help you understand the concept of secondary indexes and materialized views in Cassandra and the way Spark SQL can be used in conjunction with Cassandra in order to be able to run complex analytical queries. We assume you are familiar with Cassandra & Spark SQL (but it’s not mandatory since we will explain the basic concepts behind data modeling in Cassandra and Spark SQL). The whole workshop will be run in Cassandra Query Language and SQL and we will use Zeppelin as the interface towards Cassandra + Spark SQL.

Date: 19 August, 9:00 – 13:30
Trainers: Felix Crisan, Valentina Crisan
Location: eSolutions Academy, Budişteanu Office Building, strada General Constantin Budişteanu Nr. 28C, etaj 1, Sector 1, Bucureşti.
Number of places:  15  8 left
Price: 150 RON (including VAT)

Check out the agenda and register for future session here.