Workshops in April: getting started with Python for data science and build your first ML model

We are launching in April two online workshops that should get you started with both Python and Machine Learning in the easiest way. Instead of local installations (in order to have a Python environment like Apache Jupyter) – we will use Google Colab (an online notebook environment that allows writing and executing of Python code). As for Machine Learning library we are going to use scikit-learn – probably the most known and used ML open library. The workshops are build independently so that one can participate to either one or both of them, but in case you want to get started in this field we recommend both getting familiar with Python and the data science libraries and then learning how to build and test and ML model.

More info on the workshops can be found below:

Workshop 1, April 3rd: An Easy Slide into Python for Data Science , trainer Ionut Oprea

Workshop 2, April 17th: Start learning ML with Sklearn and Google Colab , trainer Maria Catana

Working group: Streams processing with Apache Flink

Learning a new solution or building an architecture for a specific use case is never easy, especially when you are trying to embark alone on such an endeavour – thus in 2020 bigdata.ro started a new way of learning specific big data solutions/use cases: working groups. In 2020 we started 3 working groups:

  • Spark Structured Streaming + NLP
  • Building live dashboards with Druid + Superset
  • Understanding Decision Trees (running until December)

And with 2 of the groups completed and the Decision Trees one to be completed soon, we are now opening registration for a new working group – this time focused on Apache Flink: How to process streaming data with Apache Flink and Apache Pulsar/Apache Kafka. The working group aims to take place December – February and will bring together a team of 5-6 participants that will define the scope (Kafka/Pulsar and the exact use case), select the data (open data), install the needed components, implement the needed flow.       

More details and registration here.

Open course: Understanding Apache Kafka

Open Course: Understanding Apache Kafka  

Timeline & Duration: November 20th – December 15th, 4 X 5 hours online sessions, during 4 weeks (1 sessions/week on November 20 & 27, Dec 4 & 15, to be decided the actual hours). An online setup will be available for exercises/hands-on sessions for the duration of the course, but also local installations will be guided for the course project completion. 

Main trainer: Valentina Crisan

Location: Online (Zoom)

Price: 350 EUR (early bird 300EUR if payment is done until November 15th) 

Pre-requisites: knowledge of distributed systems, SQL syntax for the KSQL part of the course

More details and registration here.

Starting up with big data

Authors: Cosmin Chauciuc, Valentina Crisan

Throughout our training journeys with bigdata.ro one of the most encountered questions at the end of a course (that usually runs in the cloud on an already installed/configured setup) is: “how do I now setup my own small big data system”? Most of the time the question refers to a local installation – not something in the cloud – but to something that can be used for testing different small use cases and get someone started with their big data curiosity/exploration. 

Now, getting started with big data in general can have 2 completely different approaches depending of the intended use case: 

  • if you need HDFS – a distributed file storage system – in order to properly test a kind of a Hadoop eco-system. In this case I recommend going for the Cloudera / Hortonworks small installations. But even the smallest installations require 16GB RAM and a lot of laptops/PC’c cannot afford to allocate this amount of RAM – plus this minimal configuration works pretty slow. Thus, in case a dedicated server – e.g. getting a Hetzner server – might be a better choice versus local installations. 
  • the second option (local storage, no HDFS) – when you just want to have a minimal installation and just understand what you could do with it – is to install a minimal combination of: Confluent Kafka + Cassandra + Zeppelin. Zeppelin comes with an embedded local Spark thus you will basically get something like in the below pic. This is the setup that will be the focus of this post.    

small big data

The blog post below has been drafted and revised by me (Valentina Crisan) but actually all steps were described and detailed by one of my former students in the Big Data Architecture and Technologies open courses: Cosmin Chauciuc. I am so proud to see someone taking forward the basis laid by my course and wanting to expand and become proficient in the big data field.

The blog post will have 3 parts:

  • how to install a minimum big data solution on your own computer (current post)
  • building an end to end use case 
  • visualizing data in Spark with Helium

We assume the reader already knows something about the solutions mentioned in this post (Kafka, Cassandra, Spark, Zeppelin) – thus we won’t go into the basics of what these are. 

Build your own small big data solution

We will use Ubuntu Linux distribution for this installation – thus please follow Step 0 for one of the alternatives of installing Ubuntu on your system. Please note for a complete setup you will need a minimum of 8GB RAM – see in next chapter the RAM requirements for each service.

0. Install Ubuntu

Ubuntu is an open source linux distribution, based on Debian, very easy to set up and have a full linux operating system. There are various ways to have Ubuntu on your laptop with your current windows installation.

  1. WSL – for windows 10 https://ubuntu.com/wsl
    Install a complete Ubuntu terminal environment in minutes on Windows 10 with Windows Subsystem for Linux (WSL).
  2. Docker
    Docker is a set of platform as a service (PaaS) products that use OS-level virtualization to deliver software in packages called containers.
    https://ubuntu.com/tutorials/windows-ubuntu-hyperv-containers#1-overview
  3. Using VirtualBox
    VirtualBox can run different operating systems using virtualization. See below the steps to install:

    1. Software to download : https://www.virtualbox.org/
    2. Ubuntu ISO to download : https://ubuntu.com/#download
    3. Set up a new machine in Virtualbox
      – Name : Ubuntu – it will auto fill type : linux
      – Allocate ram for virtual machine : minimum recommended 8GB

Now, why 8GB RAM?  See below an estimated RAM for each of the services :

      • Apache Cassandra 2.3GB
      • Confluent Kafka 3.8GB
      • Apache Zeppelin 1.5GB
      • Ubuntu OS 1.1GB

4. Create a virtual hard disk, VDI, with dynamically allocated size. Depends on what you plan to install, allocate 20-30 GB size.

5. The last step is to attach to the newly created virtual machine, the ISO file: settings , storage : Live CD. This will let you boot from the ISO file.

Ubuntu can be used as a Live system or it can be installed on the VDI hard drive. The difference is that the live OS allows you to use it, but it is not persisting changes/files/new software installed, at reboot. If we want to proceed with installation of big data tools, we have to install the system.

After you have Ubuntu OS installed you need to process with the next Installation steps :
1. Confluent Kafka
2. Apache Cassandra
3. Apache Zeppelin – with embedded Apache Spark local
4. Apache Zeppelin configurations for Spark interworking with Kafka and Cassandra configurations

Note: you might observe that all solutions above are Apache besides Kafka that we chose to be a Confluent installation. The reasoning is simple – Confluent has a one node installation available that has all Kafka services in one node: Apache Kafka (Zookeeper + Kafka broker), Schema Registry, Connect, KSQL to name a few. This one node installation is meant for testing purposes (not for commercial/production ones). In production you will need to install these services from scratch (unless you choose a commercial license installation from Confluent), but this is a setup we found best to get you started and give you a good glimpse into what the Kafka ecosystem can give you as functionality. Nonetheless, if you don’t want to go for Confluent Kafka one node installation at the end of this post you have as well the Apache Kafka version – in this case you will have only Kafka – without the Connect, SchemaRegistry and KSQL.

1. Prerequisites for services installations

  • Install curl
sudo apt install curl
  • Install Java 8
sudo apt install openjdk-8-jdk openjdk-8-jre
  • Java version check
$ java -version
openjdk version "1.8.0_265"
OpenJDK Runtime Environment (build 1.8.0_265-8u265-b01-0ubuntu2~20.04-b01)
OpenJDK 64-Bit Server VM (build 25.265-b01, mixed mode)

2. Kafka installation

  • We will download Confluent Kafka – one node installation (no commercial license needed) – it is one node that has all the Kafka components available: Zookeeper, Broker, Kafka Registry, KSQL, Connect
curl -O http://packages.confluent.io/archive/5.5/confluent-5.5.1-2.12.zip
  • Extract the contents of the archive
unzip confluent-5.5.1-2.12.zip
cd confluent-5.5.1/

We’ll use the default configuration files.

<path-to-confluent>/etc/
  • Starting the services: this Confluent script starts all the services, including KSQL.
~/confluent-5.5.1/bin$ ./confluent local start

    The local commands are intended for a single-node development environment
    only, NOT for production usage. https://docs.confluent.io/current/cli/index.html
 
 
Using CONFLUENT_CURRENT: /tmp/confluent.kCNzjS0a
Starting zookeeper
zookeeper is [UP]
Starting kafka
kafka is [UP]
Starting schema-registry
schema-registry is [UP]
Starting kafka-rest
kafka-rest is [UP]
Starting connect
connect is [UP]
Starting ksql-server
ksql-server is [UP]
Starting control-center
control-center is [UP]
  • Another way to install Kafka – not using the Confluent script – is to start first Zookeeper and then add each Kafka broker. See the details in Paragraph 7.

3. Apache Cassandra

  • Add the Apache repository of Cassandra to the file cassandra.sources.list
echo "deb https://downloads.apache.org/cassandra/debian 311x main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list
  • Add the Apache Cassandra repository keys to the list of trusted keys on the server
curl https://downloads.apache.org/cassandra/KEYS | sudo apt-key add -
  • Update
sudo apt-get update
  • Install Cassandra
sudo apt-get install cassandra
  • Start Cassandra service
sudo service cassandra start
sudo service cassandra stop // only if you want to stop the service
  • Check the status of Cassandra
nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  70.03 KiB  256          100.0%            1c169827-bf4c-487f-b79a-38c00855b144  rack1

  • Test CQLSH
$ cqlsh
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.7 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh>

4. Apache Zeppelin

$ wget http://mirrors.m247.ro/apache/zeppelin/zeppelin-0.9.0-preview2/zeppelin-0.9.0-preview2-bin-all.tgz
  • Extract archived files
$ tar xzvf zeppelin-0.9.0-preview2-bin-all.tgz
  • Start Zeppelin service
$ cd zeppelin-0.9.0-preview2-bin-all
$ ./bin/zeppelin-daemon.sh start
Log dir doesn't exist, create /home/kosmin/zeppelin-0.9.0-preview2-bin-all/logs
Pid dir doesn't exist, create /home/kosmin/zeppelin-0.9.0-preview2-bin-all/run
Zeppelin start                                             [  OK  ]

  • Configuring users for Zeppelin

Default login is with an anonymous user. The configuration for users is found in conf folder from <zeppelin_path>

$  ~/zeppelin-0.9.0-preview2-bin-all$ cd conf/
$ mv shiro.ini.template shiro.ini
$ vi shiro.ini
#uncomment the line with admin = password1, admin
save
restart zeppelin service
  • Login and test the spark interpreter
sc.version
res3: String = 2.4.5

5. Configuring Zeppelin for connection with Kafka and Cassandra – through the embedded Apache Spark

An Apache Zeppelin interpreter is a plugin that enables you to access processing engines and data sources from the Zeppelin UI. In order to connect our installed Zeppelin to Kafka  we will need to add some artifacts in the Spark interpreter in Zeppelin configurations. More info on Zeppelin interpreters can be found here:

Configurations for Kafka in Zeppelin

  • Finding the spark version
sc.version - in the zeppelin notebook
res1: String = 2.4.5
  • Open the interpreter config page 
  • Under the USERNAME , open the menu and click Interpreter
    • Search for SPARK and click edit
    • At Dependencies part , add this under Artifact and click save 
org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5

Configurations for Cassandra in Zeppelin

  • Finding the spark version
sc.version - in the zeppelin notebook
res1: String = 2.4.5
  • Open the interpreter config page
    • Under the USERNAME , open the menu and click Interpreter
    • Search for SPARK and click edit and at Dependencies part , add this under Artifact and click save : 
 com.datastax.spark:spark-cassandra-connector_2.11:2.5.1

Version compatibility for the Spark-Cassandra connector can be found here.

6. End to end exercise to test setup

Using Kafka producer console : Open a topic words and start typing data into it ( make sure you set 3 partitions for this topic).

kosmin@bigdata:~/confluent-5.5.1/bin$ ./kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 --partitions 3 --topic bigdata
Created topic bigdata.
kosmin@bigdata:~/confluent-5.5.1/bin$ ./kafka-console-producer --broker-list localhost:9092 --topic bigdata
enter some words here :) 
CTRL+C // if you want to close producing events in the topic

Using Zeppelin, open a Notebook: Create a stream DF in Spark that points to words topic in Kafka. Create a query and read all the parameters for each topic – key (null in our case)

val kafkaDF = spark.readStream.format("kafka")
      .option("kafka.bootstrap.servers", "localhost:9092")
      .option("subscribe","bigdata")
      .option("startingOffsets","earliest")
      .load().select($"key".cast("STRING").as("key"), 
                     $"value".cast("STRING").as("value"), 
                     $"topic",
                     $"partition",
                     $"offset",
                     $"timestamp")

val query_s0 = kafkaDF.writeStream.outputMode("append").format("console").start()
query_s0.awaitTermination(30000)
query_s0.stop()

Test for Cassandra

import org.apache.spark.sql.functions._
val tables = spark.read.format("org.apache.spark.sql.cassandra").options(Map( "table" -> "tables", "keyspace" -> "system_schema" )).load()
tables.show(10)

7. Another way to install Kafka – not using the Confluent script – is to start first Zookeeper and then add each Kafka broker

  • Zookeper
kosmin@bigdata:~/confluent-5.5.1$ ./bin/zookeeper-server-start  ./etc/kafka/zookeeper.properties

In another terminal window, we can check if Zookeeper started. By default, it listens on 2181 port.

$ ss -tulp | grep 2181
tcp    LISTEN   0        50                     *:2181                  *:*      users:(("java",pid=6139,fd=403))
  • Kafka

For starting Kafka service, run this in another terminal :

kosmin@bigdata:~/confluent-5.5.1$ ./bin/kafka-server-start ./etc/kafka/server.properties

Check if Kafka started in another terminal window. Default Kafka broker port is 9092.

kosmin@bigdata:~/confluent-5.5.1$ ss -tulp | grep 9092
tcp    LISTEN   0        50                     *:9092                  *:*      users:(("java",pid=6403,fd=408))
  • Test Kafka topics
kosmin@bigdata:~/confluent-5.5.1/bin$ ./kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic Hello
Created topic Hello.
kosmin@bigdata:~/confluent-5.5.1/bin$ ./kafka-topics --list --zookeeper localhost:2181
Hello
__confluent.support.metrics
_confluent-license
kosmin@bigdata:~/confluent-5.5.1/bin$

At this moment you should have your own private small cluster ready for learning some of the solutions in Big Data Architectures. Although with such setup you cannot see the wonder of distributed systems, you can though understand the purpose of Kafka/Cassandra/Spark/Zeppelin in a big data architecture.

Next post in these series will be about building an end to end case with this setup.

About Neo4j…

In march we will restart our bigdata.ro sessions, workshops mainly aimed to help participants navigate easier through big data architectures and get a basic understanding in some of the possible components of such architectures. We have discussed in the past Cassandra, HDFS, Hive, Impala, Elasticsearch, Solr, Spark & Spark SQL, generic big data architectures and on March 16th we will continue our journey with one of the unusual children of noSQL: the graph database Neo4j. Not quite similar with the other noSQL siblings, this database is not derived from the likes of DynamoDB or BigTable like others do, but instead addresses relationship between data not just the data itself. The result is amazing, the use cases are incredible and Calin Constantinov will guide us through the basics of this interesting solution.   

See below a few questions and answers in advance of the workshop, hopefully these will increase your curiosity towards Neo4j.

Valentina Crisan – bigdata.ro

Calin Constantinov – trainer “Intro to Neo4j” workshop, March 16th

—————————————————————————————————————————————

What is a graph database and which are the possible use cases that favour such a database?

Calin: They say “everything is a graph”. Indeed, even the good old entity-relationship diagram is no exception to this rule. And graphs come with a great “feature” which us humans tend to value very much: they are visual! Graphs can easily be represented on a whiteboard and immediately understood by a wide audience.

Moreover, in a traditional database, explicit relationships are destroyed the very moment we store data and need to be recreated on-demand using JOIN operations. A native graph database has preferential treatment for relationships meaning that there are actual pointers linking an entity to all its neighbors.

I remember the first time I needed to implement a rather simple Access Control List solution that needed to support various inheritable groups, permissions and exceptions. Writing this in SQL can quickly become a nightmare.

But of course, the most popular example is social data similar to what Facebook generates. For some wild reason, imagine you need to figure out the year with the most events attended by at least 5 of your 2rd degree connections (friends-of-friends), with an additional restriction that none of these 5 are friends between them. I wouldn’t really enjoy implementing that with anything other than Neo4j!

However, not all graphs are meant to be stored in a graph database. For instance, while a set of unrelated documents can be represented as a graph with no edges, please don’t rush to using Neo4j for this use-case. I think a Document store is a better persistence choice.

In terms of adoption, 75% of the Fortune 100 companies are already using Neo4j. As for concrete use-case examples, Neo4j is behind eBay’s ShopBot for Graph-Powered Conversational Commerce while NBC News used it for uncovering 200K tweets tied to Russian trolls. My personal favourite is the “Panama Papers” where 2.6TB of spaghetti data, made up of 11.5M heterogeneous documents, was fed to Neo4j. And I think we all know the story that led the investigative team to win the Pulitzer Prize.

What graph databases exist out there and how is Neo4j different from those?

Calin: Full disclosure, given the wonderful community around it (and partially because it happened to be the top result of my Google search), it was love at first sight with me and Neo4j. So, while I’m only using Neo4j in my work, I do closely follow what’s happening in the graph world.

JanusGraph, “a graph database that carries forward the legacy of TitanDB” is one of the most well-known alternatives. A major difference is that JanusGraph is more of a “graph abstraction layer” meaning that it requires a storage backend instead of it being a native graph.

OrientDB is also popular do its Multi-Model, Polyglot Persistence implementation. This means that it’s capable of storing graph, document and key/value data, while maintaining direct connections between records. The only drawback is that it might have not yet reached the maturity and stability required by the most data-intensive tasks out there.

More recently, TigerGraph showed impressive preliminary results, so I might need to check that out soon.

Is the Neo4j architecture a distributed one? Does it scale horizontally like other noSQL databases?

Calin: The short answer is that Neo4j can store graphs of any sizes in an ACID-compliant, distributed, Highly-Available, Causal Clustering architecture, where data replication is based on the state-of-the-art Raft protocol.

In order to achieve best performance, we would probably need to partition the graph in some way. Unfortunately, this is typically a NP-hard problem and, more often than not, our graphs are densely connected which can really make some form of clustering quite challenging. To make matters worse, coming back to the Facebook example, we need to understand that this graph is constantly changing, with each tap of the “Like” button. This means that our graph database can easily end up spending more time finding a (sub-)optimal partition than actually responding to queries. Moreover, when combining a complex query with a bad partitioning of the data, you wind up with requiring a lot of network transfers within the cluster, which will most likely cost more than a cache miss. In turn, this could also have a negative effect on query predictability. Sorry to disappoint you, but this is the reason for which Neo4j doesn’t yet support data distribution. And it’s a good reason too!

So, a limitation in the way Neo4j scales is that every database instance has a complete replica of the graph. Ideally, for best performance, all instances need to have enough resources to keep the whole graph in memory. If this is not the case, in some scenarios, we can at least attempt to achieve cache sharding by identifying all queries hitting a given region of the graph and always routing them to the same instance. As a starting point, there is a built-in load-balancer which can potentially be extended to do that. Additionally, we can easily direct I/O requests intelligently in a heterogeneous environment, designating some Read Replicas for handling read queries while only writing to instances packing the most power. This is a good thing for read operations which can easily scale horizontally. Write operations are however the bottleneck. Nevertheless, the guys over at Neo4j are always coming up with clever ways to significantly improve write performance with every new release.

Does Neo4j work with unstructured/flexible structured data?

Calin: A graph is composed of nodes and relationships. We are able to group similar nodes together by attaching a label, such as “User” and “Post”. Similarly, a relationship can have a type, such as “LIKED” and “TAGGED”. Neo4j is a property graph meaning that multiple name-value pairs can be added both to relationships and nodes. While it is mandatory in Neo4j for relationships to have exactly one type, labels and properties are optional. New ones can be defined on-the-fly and nodes and relationships of the same type don’t all necessarily need to have the same properties. If needed, Neo4j does support indexes and constraints, which can, for instance, improve query performance, but this is as close as you get to an actual schema.

In regards to the question, the whole point of a graph database is to structure your data in some way. Keep in mind that the true value of your data lies within uncovering relationships between entities. If you feel like this doesn’t fit your use-case, coming back to the example of having a number of unrelated free-text documents or even some form of semi-structured data, while Neo4j now supports full text indexing and search, there are clearly better alternatives out there, such as key-value and document stores.

How is it best and easiest to get started with Neo4j?

Calin: Apart from attending my workshop, right? I think the best way to get up to speed with Neo4j is to use the Neo4j Sandbox. It’s a cloud-based trial environment which only requires a browser to work and which comes preloaded with a bunch of interesting datasets. If you’re into a more academic approach, I highly recommend grabbing a copy of “Graph Databases” or “Neo4j in Action“.

Can you detail how can users interact with Neo4j?  What about developers, are there pre-built drivers or interfaces?

Calin: Neo4j packs a very nice Browser that enables users to quite easily query and visualize the graph. This comes with syntax highlighting and autocompletion for Cypher, the query language used by Neo4j. It also features a very handy way to interact with query execution plans.

Developers can “talk” with the database using a REST API or, better yet, the proprietary binary protocol called Bolt, which is already uniformly encapsulated by a number of official drivers, covering the most popular programming languages out there.

However, as I don’t want to spoil the fun, that’s all you’re getting out of me today. But do come and join us on the 16th of March. Please.

About big data lakes & architectures

In preparation for October 10th Bucharest Big Data Meetup here’s a short interview with Cristina Grosu, Product Manager @ Bigstep, company that has built a platform of integrated big data technologies such as Hadoop, NoSQL, MPP Databases as well as containers, bare-metal compute and storage, offered as a service. Cristina will be talking “Datalake architectures seen in production” during the meetup and we discussed a bit about the Data Lake concept and the possible solutions part of its architecture.

 
[Valentina] What is a data lake and how is it different from the traditional data warehouse present today in majority of bigger companies?

[Cristina] Think about the data lake as a huge data repository where data from very different and disparate sources can be stored in its raw, original format. One of the most important advantages of using this concept when storing data comes from the lack of a predefined schema, specific data structure, format or type which encourages the user to reinterpret the data and find new actionable insights.

Real-time or batch, structured or non-structured, files, sensor data, clickstream data, logs, etc can all be stored in the data lake. As users start a data lake project, they can explore more complex use cases since all data sources and formats can be blended together using modern, distributed technologies that can easily scale at a predictable cost and performance.

[Valentina] What (Big Data) solutions usually take part in a data lake architecture? Tell us a little bit regarding the generic architecture of such a solution.

[Cristina] The most common stack of technologies that I see in a data lake architecture is built around the Hadoop Distributed File System.

Since this is an open source project it allows customers build an open data lake that is extensible as well as well integrated with other applications that are running on-premises or in the cloud. Depending on the type of workload, the technology stack around the data lake might differ, but the Hadoop ecosystem is so versatile in terms of data processing (Apache Spark), data ingestion (Streamsets, Apache NiFi), data visualization(Apache Kibana, Superset), data exploration(Zeppelin, Jupyter) and streaming solutions(Storm, Flink, Spark Streaming) that can be fairly easy interconnected to create a full-fledged data pipeline.

What I have seen in production with a lot of our customers are “continuously evolving data lakes”, where users are building their solutions incrementally with a long-term vision in mind, expanding their data architecture when needed and leveraging new technologies when a new twist on their initial approach appears.

Other architectures for the data lake are based on NoSQL databases (Cassandra) or are using Hadoop as an extension of existing data warehouses in conjunction with other BI specific applications.

Regardless of the technology stack, it’s important to design a “forward-looking architecture” that enables data democratization, data lineage tracking, data audit capabilities, real-time data delivery and analytics, metadata management and innovation.

[Valentina] What are the main requirements of a company that would like to build a data lake architecture?  

[Cristina] The truth is that in big data projects, the main requirement is to have a driver in the company, someone with enough influence and experience that can sell the data lake concept in the company and can work together with the team to overcome all obstacles.

At the beginning of the project building the team’s big data skills might be necessary to make sure everyone is up to speed and create momentum within the organization by training, hiring or bringing in consultants.

Oftentimes, this leader will be challenged with implementing shifts in mindsets and behaviors regarding how the data is used, who’s involved in the analytics process and how everyone is interacting with data. As a result, new data collaboration pipelines will be developed to encourage cross-departmental communication.

To ensure the success of a data lake project and turn it into a strategic move, besides the right solution, you need a committed driver in your organization and the right people in place.

[Valentina] From the projects you’ve run so far, which do you think are the trickiest points to tackle when building a data lake solution?

[Cristina] Defining strategies for metadata management is quite difficult especially for organizations that have a large variety of data. Dumping everything in the data lake and trying to make sense of it 6 months into the project can become a real challenge if there is no data dictionary or data catalog that can be used to browse through data. I would say, data governance is an important thing to consider when developing the initial plan, even though organizations sometimes overlook it.

In addition, it’s important to have the right skills in the organization, understand how to extract value from data and of course figure out how to integrate the data lake into existing IT applications if that is important for the use case.

[Valentina] Any particular type of business that would benefit more from building a data lake?

[Cristina] To be honest the data lake is such a universal solution that it can be applied to all business verticals with relatively small differences in the design of the solution. The specificity of the data lake comes from every industry’s business processes and the business function it serves.

The greatest data lake adoption is observed among the usual suspects in retail, marketing, logistics, advertising where the volume and velocity of data grow day by day. Very interesting data lake projects are also seen in banking, energy, manufacturing, government departments and even Formula 1.

[Valentina] Will data lake concept change the way companies work with data in your opinion?    

[Cristina] Access to data will broaden the audience of the organization’s data and will drive the shift towards self-service analytics to power real-time decision making and automation.

Ideally, each person within a company would have access to a powerful dashboard that can be customized and adapted to run data exploration experiments to find new correlations. Once you find better ways to work with data, you can better understand how trends are going to affect your business, what processes can be optimized or you can simulate different scenarios and understand how your business is going to perform.

The data lake represents the start of this mindset change but everything will materialize once the adoption increases and technologies mature.

Interview by: Valentina Crisan

Big Data Projects and Architectures (Part 1)

Two of the biggest struggles for a Big Data beginner are – most of the time – the practicality of data, meaning the understanding the applicability of the data in different businesses ( what kind of data and what kind of results those businesses obtain for it) and where to start from when thinking of a new idea.   

We are kicking off today a couple of posts that are meant to help understanding the applicability of data in business and the architectures that need to be built. Also, we will gather suggestions/recommendations on how to start when thinking of a new project.

Our first interview is with Tudor Lapusan, Big Data/ Hadoop Devops @ Telenav, who also runs Big Data Meetup in Cluj and is part of the team developing Zimilar – the startup that aims to recognize and match any piece of clothing & accessories out there.   

[Valentina] What kind of Big Data / Data are you analyzing at Telenav (at least in the area you’re working on)? Please mention type, volume, batch/real time – if possible.

[Tudor] I’m working for Telenav and playing around with big data technologies for more that 5 years. Telenav is a car navigation company and as you already may guess, our majority datasets are geospatial related (ex. anonymous lat, lon coordinates indicating where our clients are driving). Sharing the exact amount of data we have it’s confidential, but what I can disclose is that we store tens of terabytes of data and the amount of data is growing continuously.

Also my view is that the info on the size of one’s data on is relative because it depends on how it’s stored. You can store it in raw/serialized format or you can store it compressed. Also depends what schema/solution you choose to store the data, you can used HDFS or a NoSql database. All of these choices have a big impact in the size of your data.

In Telenav we started our Big Data project using batch processing. Five-six years ago was that moment when MapReduce and HDFS were the key components from Hadoop ecosystem and – at that momentum – with a weak adoption for small/medium companies. I believe that majority of early adopters were doing batch processing in that time. Currently although our main analysis processes are still batch processing, we are also working on prototypes for real time projects.

[Valentina] What kind of analytics are you performing on this data?

[Tudor]  Some of Telenav projects are using Open Street Map (OSM) as the map behind the car navigation. Because OSM is a free map, where anyone from this world can edit it, there are still some things to improve. We are analyzing our dataset to predict the real speed from the streets to improve the car driving experience. Also, based on our data, we improve the OSM by finding streets which are missing from the original map, we detect wrong direction information from the streets, one ways, double ways, turn-restriction which are mapped incorrect. These are one of the main analyses we are doing and all have the goal to improve the OSM and the car navigation experience.

We built even an web application to expose all of these results and to help OSM community : https://improveosm.org/.  Also, we are doing a lot of internal analyses to help us understand where the majority of our clients are coming from and their behavior.

[Valentina] Can you describe the architecture used (the solutions inside and their role)?

[Tudor] As I said before, in the beginning we started with classical architecture, MapReduce and HDFS (stored the data in HDFS and processed it with MapReduce). After that we needed also random access to our data, so after some research we choose HBase as a random data access solution.

Our current architecture is different though nowadays from the first one we built. Now we use Spark to analyze the data and because we don’t need random access on entire data, we don’t use HBase anymore. All the data is stored in HDFS as parquet files which is a perfect match for Spark and batch processing. We use ElasticSearch to import small/medium datasets when we need random access, mostly useful for debugging and internal products.

[Valentina] How did you choose the different components? What were the main requirements to be fulfilled?

[Tudor] I read and researched at lot at the times about what it’s going into Big Data world. Also I’m part of a  passionate team and this helps a lot in our decisions.

The world of Big Data is more stable now than in the past years. So, if you are a beginner in this space and you have a Big Data use-case for your project, after few days of research you will see that all tutorials go through one or two technologies to choose from. Of course, if you need to set up an entire pipeline, things get complicated and maybe you would need few advices from someone with more experience.

Others important aspects are the community around the frameworks, resources (tutorials, books, stack-overflow) and its maturity. Majority of the big data technologies are working well when you analyze few gigabytes. The real challenge is when you give them to analyze tens of terabytes of data or even more. Until now I didn’t find the perfect technology to work the same on small and big data. Depends on everyone how deep you want to search and understand that technology.

[Valentina] When building an architecture which are – in your experience – the main things to consider when choosing different solutions. For example, choosing different Hadoop distributions: like Cloudera or Map-R?

[Tudor] When we started to work with Hadoop, if I remember correctly, there was only Cloudera which offered solutions to deploy and monitor your cluster and it wasn’t free (as is it now). So we initially decided to manually deploy and manage our cluster. And all was fine while we had only nine servers in our cluster and only few services to manage. The main advantage of the free solution was that it forced us to really understand how each service is working, the big disadvantage was that it took time, efforts and sometimes a dedicated team to manage it.

Jumping to nowadays, to deploy a Hadoop cluster become a really easy process. You can do it even after a few days of research. All of these thanks to the big players which entered into the big data world like Amazon, Cloudera, Hortonworks, Map-R, Microsoft and many others. At Telenav we are using Cloudera as a solution to deploy and monitor our cluster and we are pretty happy with it.

If you plan to start working with big data technologies, I strongly recommend to use the solutions of one of the above mentioned companies. I can’t comment if one is better than the others, since I have more experience with Cloudera. But I have friends from other companies which are happy with Hortonworks or AWS.

[Valentina] What is the best use case that you see for HDFS currently?

[Tudor] HDFS is a distributed filesystem thought to store large volume of data. The default block size is, if I remember correctly, 128 MB in the newest versions and this size is customizable. These characteristics make HDFS the perfect solution to store your data at a minimum cost and in the end to do some big analysis on it (batch processing). At Telenav we store as parquet files all our historical data into HDFS and use Spark to process it.  The book “Hadoop, The definitive guide” (HDFS chapter) , includes one of the best explanations regarding HDFS characteristics. I strongly recommend our readers to read it if they want to find out more about HDFS and others use-cases for it.

[Valentina] You are as well the main driver of Cluj big data community, can you tell me how do you see the take off or adoption of Big Data solutions/architectures in Cluj and as well Romania level?

[Tudor] I remember when I initiated the Big Data community in Cluj, three years ago, that only 2-3 companies were then playing with Hadoop. I used to use Linkedin to look for other people from Cluj with ‘hadoop’ skills and found very few people at the time. Now, there are definitely more companies in Cluj working with Big Data solutions, some of them also supporting our community nowadays. From meetups and workshops, I met a lot of people which are working with big data technologies and even more people which are interested in this field. Plus more and more companies started to work on machine learning also. I don’t know the level for Romania, but I assume that it’s a good one. I heard about companies working with big data technologies in Bucharest, Timisoara, Iasi, Brasov.

Interview by: Valentina Crisan @bigdata.ro

 

 

 

Distributed search with Apache Cassandra and/or Apache Solr?

One is a noSQL solution, the other one is primarily a search engine, each also adding up search or scaling functions thus increasing their potential use cases. Talking about the integration of Cassandra and Solr and how is best to perform a distributed search, we (Valentina Crisan & Radu Gheorghe) decided that this might be a good topic for one of Bucharest Big Data meetup presentations thus on June 7th we will debate the use cases in which these 2 powerful solutions might be best on their own or integrated.        

Below interview article is the base of the presentation and discussion in the upcoming June 7th Bucharest Big Data meetup: Distributed search with Apache Cassandra and/or Apache Solr.  

Best use cases for Cassandra + Solr/Lucene or Solr + Cassandra/noSQL?

[Valentina] From Cassandra perspective is pretty simple, any use case that needs as well – besides its main use cases  – queries with full-text search capabilities ( facets, highlights, geo-location…), also requiring one interface (CQL) for launching all queries, then Solr integration or any Lucene index will be the solution. The main data store would be Cassandra and the indexes would be handled by Lucene/Solr engine. Please also see below answer about Cassandra search capabilities.     

[Radu] Solr is great for either full text search (think books, blog posts) or structured search (products, logs, events) and for doing analytics on top of those search results (facets on an online shop, log/metric analytics, etc). Backups aside, you would benefit from a dependable datastore like Cassandra to be the source of truth. Also, some use-cases have big field values that don’t need to be shown with results (e.g. Email attachments), so they might as well be stored in Cassandra and only pulled when the user selects that result – leaving Solr with only the index to deal with for that field.

What are the optimal use cases for Apache Cassandra?

[Valentina]  When your requirement is to have very heavy write, distributed and highly scalable system and you might want as well to have quite responsive reporting system on top of that stored data, then Cassandra is your answer. Consider use case of Web analytics where log data is stored for each request and you want to built analytical platform around it to count hits by hour, by browser, by IP, etc in real time manner. Thumb rule of performing real time analytics is that you should have your data already calculated and should persist in the database. If you know the reports you want to show in real time, you can have your schema defined accordingly and generate your data at real time. Cassandra is not an OLAP system but can handle simple queries with a pre-defined data model and pre-aggregation of data, supporting integration with Spark for more complex analytics: joins, groupby,window functions.

Most of Cassandra uses cases are around storing data from next domains: Product Catalog/Playlist, Recommendation/Personalization Engine, Sensor Data/Internet of Things, Messaging, Fraud Detection.

Which are the Cassandra search capabilities?

[Valentina]  Traditionally Cassandra was not designed to support search cases (it can only do matches on partition key), but since v3.4 there has been some improvement, through the introduction of SASI feature (see below).

In general the standard option in Cassandra when dealing with search type of queries is to use an index, Cassandra supporting a pluggable interface for defining indexes, and thus there are multiple implementations available:

  • The native solution: SASI ( SSTable-Attached Secondary Index), implemented over the native secondary indexes in Cassandra, this feature is not only an extension of the native secondary indexes in Cassandra, but an improved one in the sense that shortens access time to data versus the standard indexes implementation. The are still some things to consider with SASI when compared with a search engine ( see Ref 1):
    • It is not possible to ask for total ordering of the result (even when LIMIT clause is used), SASI returns result in token range order;
    • SASI allows full text search, but there is no scoring applied to matched terms;
    • being a secondary index, SASI requires 2 passes on disk to fetch data: first to get the index data and another read in order to fetch the rest of the data;
    • native secondary indexes in Cassandra are locally stored in each node, thus all searches will need to hit all cluster nodes in order to retrieve the final data – even if sometimes this means only a node will have the relevant information;

  Through SASI Apache Cassandra supports:

– search queries including more than 1 column (multiple columns indexes)

– for data types (text, varchar & ascii) supports queries on:

– prefix using the LIKE ‘prefix%’ syntax

– suffix using the LIKE ‘%suffix’ syntax

– substring using the LIKE ‘%substring%’ syntax

exact match using equality (=)

– For other data types (int, date, uuid …) supports queries on:

– equality (=)

              – range ( <, ≤, >, ≥ )

  • Stratio’s Cassandra Lucene Index:  is an open plugin for Apache Cassandra that extends its index functionality to provide near real time search such as ElasticSearch or Solr, including full text search capabilities and free multivariable, geospatial search. It is achieved through an Apache Lucene based implementation of Cassandra secondary indexes, where each node of the cluster indexes its own data. Through Cassandra Lucene index next queries are supported (see Ref 2):
    • Full text search (language-aware analysis, wildcard, fuzzy, regexp)
    • Boolean search (and, or, not)
    • Sorting by relevance, column value, and distance
    • Geospatial indexing (points, lines, polygons and their multiparts)
    • Geospatial transformations (bounding box, buffer, centroid, convex hull, union, difference, intersection)
    • Geospatial operations (intersects, contains, is within)
    • Bitemporal search (valid and transaction time durations)
  • DSE ( Datastax Enterprise Search  – commercial edition): it’s a Datastax proprietary implementation ( you can get access to it only through DSE ) which uses Apache Solr and Lucene on top of Cassandra. It maintains search cores co-located with each Cassandra node, updating them as part of Cassandra’s write path so that data is indexed in real-time. In addition to supporting the text and geospatial search features, DSE Search also supports advanced capabilities like search faceting.
  • Other solutions: 
    • Of course you can integrate separately Cassandra and Solr (application layer integration) by using an analytical tool that joins the data from both systems and a solution like Kafka in order to synchronize the data. This has nothing to do with Cassandra search capabilities, just pointing the option. Spark also could be use as an aggregation and analytics layer for the data.       

Why would anyone, in your opinion, use Cassandra for search instead of just using Solr?

[Valentina]  The best implementation always depends on the use case and its history. If the primary use case is building an OLTP system, storing the data reliably, writing it really fast in distributed environments with simple queries that usually rely on the partition key and just sporadically support more complex queries: text search, joins, groupby, moving average, then having Cassandra as a base system and use it’s SASI or Lucene index or even DSE Solr implementation plus Spark for full analytics might be a solution. Depending on the search complexity you can decide on the 3 alternatives provided by Cassandra.

If your use case is a heavy analytics one that includes as well evolved search capabilities then having an analytics tool that combines the fast retrieval of data in Cassandra (queries by partition) and the indexes in Solr might be a solution.    

So, depends which is your primary use case for the solution you’re trying to build. For sure comparing Cassandra and Solr as noSQL or search engines is not optimal since Cassandra was not meant to be search engine and Solr a noSQL system ( although since SolrCloud there have been use cases for Solr as a data storage system) thus your main requirements should be the ones that drive the solution design not just a bare comparison between these 2.

How can Solr be used as a data storage system (besides a search engine)? What features are available for data redundancy and consistency?

[Radu] Solr supports CRUD operations, though updates tend to be expensive (see below) and a get by ID is in fact a search by ID: a very cheap search, but usually slower than getting documents by primary key from a more traditional storage system – especially when you want N documents back.

For redundancy, replicas can be created on demand and if leader goes down, a replica is automatically promoted. Searches are round-robined between replicas, and because replication is based on the transaction log (on SolrCloud at least), we’re talking about the same load no matter if a shard is a leader or a replica. Lastly, active-passive cross-DC replication is supported.

As for consistency, Zookeeper holds the cluster state, and writes are acknowledged when all replicas write to transaction log: the leader writes a new document to its transaction log and replicates it to all the replicas, which index the same document in parallel.

What are the preferred types of input data for Solr use cases (e.g. documents)? Which type of data is the most optimal for its functionality?  

[Radu]  Structured documents (e.g. products) and/or free text (books, events) are ideal use-cases. Solr doesn’t work well with huge documents (GB) or huge number of fields (tens of thousands).

Some solutions owners do have a separate storage system connected to Solr. How the integration looks in this case (all data is replicated twice or only the indexes reside in Solr)?

[Radu]  Usually Solr indexes data (for searching) and some fields have docValues (for sorting and facets). Most also store the original value for fields where docValues aren’t enabled (so one could get the whole document without leaving Solr). Storing a field is also required for highlighting. If you need all these features, you need to replicate the data twice (the worst case is times:separate storage, docValues, stored fields. Plus the inverted index). If you don’t, you can skip some of those, and rely on the separate storage to get the original data. That said, fetching data from a different storage is often slower because it requires another [network] call.

How does data update look like in Solr? Is the data updated or rather added up (with different versions)?

[Radu]  There are three main options:

  • Index a new document over the old one. Versioning can be used here for concurrency control.
  • Update a value of a document. This will trigger Solr to fetch the document internally, apply the change and index the resulting document over the old one. You can still use versions here, either assigned by Solr or provided by you (e.g. a timestamp).
  • For fields that only have docValues (not indexed, not stored), an in-place update of that value is possible.

When might be ok to use Solr as an analytics tool?

[Radu]  When you filter (search) on data before analyzing, because that’s where Solr shines. For example, live dashboards on logs, events, metric data. Or faceting for product search (e.g. Amazon).

What are Solr’s capabilities when it comes to full text search?

[Radu] I can’t be exhaustive here, but let me go over the main categories:

  • Fast retrieval of top N documents matching a search. It can also go over all matches of a search if a cursor is used.
  • Capable and pluggable text analysis. For example, language detection, stemming, synonyms, ngrams and so on. This is all meant to break text into tokens so that “analysis” in your query matches “analyze” in your document, for example.
  • Sorting by a field works, of course, but Solr is also good at ranking results by a number of criteria. For example, per-field boosting (including boosting more exact matches over more lenient ones, like those that are stemmed), TF/IDF or BM25, value of a field, potentially with a multiplier. Multiple criteria can be combined as well, so you can really tweak your relevancy scores.
  • Highlighting matches. It doesn’t have to go over the text again, which is useful for large fields like books.

 

This discussion will be be continued on June 7th, Bucharest Big Data Meetup.  See you there. 

Ref 1: http://www.doanduyhai.com/blog/?p=2058#sasi_perf_benchmarks

Ref 2: https://github.com/Stratio/cassandra-lucene-index 

Big Data in Tg Mures

Looks like Big Data topic is getting more and more attention and more cities are starting events that include this topic, thus we decided we will have a new bigdata.ro page where we will present these events. The first event we will feature will take place in Tg Mures on June 2nd, Kodok Marton will talk about “Complex Realtime Event Analytics Using BigQuery”.

Join the event if you are in the Tg Mures area, you can see the event details here.

Looking forward to see a Big Data community forming in Tg Mures as well.