Starting up with big data

Authors: Cosmin Chauciuc, Valentina Crisan

Throughout our training journeys with bigdata.ro one of the most encountered questions at the end of a course (that usually runs in the cloud on an already installed/configured setup) is: “how do I now setup my own small big data system”? Most of the time the question refers to a local installation – not something in the cloud – but to something that can be used for testing different small use cases and get someone started with their big data curiosity/exploration. 

Now, getting started with big data in general can have 2 completely different approaches depending of the intended use case: 

  • if you need HDFS – a distributed file storage system – in order to properly test a kind of a Hadoop eco-system. In this case I recommend going for the Cloudera / Hortonworks small installations. But even the smallest installations require 16GB RAM and a lot of laptops/PC’c cannot afford to allocate this amount of RAM – plus this minimal configuration works pretty slow. Thus, in case a dedicated server – e.g. getting a Hetzner server – might be a better choice versus local installations. 
  • the second option (local storage, no HDFS) – when you just want to have a minimal installation and just understand what you could do with it – is to install a minimal combination of: Confluent Kafka + Cassandra + Zeppelin. Zeppelin comes with an embedded local Spark thus you will basically get something like in the below pic. This is the setup that will be the focus of this post.    

small big data

The blog post below has been drafted and revised by me (Valentina Crisan) but actually all steps were described and detailed by one of my former students in the Big Data Architecture and Technologies open courses: Cosmin Chauciuc. I am so proud to see someone taking forward the basis laid by my course and wanting to expand and become proficient in the big data field.

The blog post will have 3 parts:

  • how to install a minimum big data solution on your own computer (current post)
  • building an end to end use case 
  • visualizing data in Spark with Helium

We assume the reader already knows something about the solutions mentioned in this post (Kafka, Cassandra, Spark, Zeppelin) – thus we won’t go into the basics of what these are. 

Build your own small big data solution

We will use Ubuntu Linux distribution for this installation – thus please follow Step 0 for one of the alternatives of installing Ubuntu on your system. Please note for a complete setup you will need a minimum of 8GB RAM – see in next chapter the RAM requirements for each service.

0. Install Ubuntu

Ubuntu is an open source linux distribution, based on Debian, very easy to set up and have a full linux operating system. There are various ways to have Ubuntu on your laptop with your current windows installation.

  1. WSL – for windows 10 https://ubuntu.com/wsl
    Install a complete Ubuntu terminal environment in minutes on Windows 10 with Windows Subsystem for Linux (WSL).
  2. Docker
    Docker is a set of platform as a service (PaaS) products that use OS-level virtualization to deliver software in packages called containers.
    https://ubuntu.com/tutorials/windows-ubuntu-hyperv-containers#1-overview
  3. Using VirtualBox
    VirtualBox can run different operating systems using virtualization. See below the steps to install:

    1. Software to download : https://www.virtualbox.org/
    2. Ubuntu ISO to download : https://ubuntu.com/#download
    3. Set up a new machine in Virtualbox
      – Name : Ubuntu – it will auto fill type : linux
      – Allocate ram for virtual machine : minimum recommended 8GB

Now, why 8GB RAM?  See below an estimated RAM for each of the services :

      • Apache Cassandra 2.3GB
      • Confluent Kafka 3.8GB
      • Apache Zeppelin 1.5GB
      • Ubuntu OS 1.1GB

4. Create a virtual hard disk, VDI, with dynamically allocated size. Depends on what you plan to install, allocate 20-30 GB size.

5. The last step is to attach to the newly created virtual machine, the ISO file: settings , storage : Live CD. This will let you boot from the ISO file.

Ubuntu can be used as a Live system or it can be installed on the VDI hard drive. The difference is that the live OS allows you to use it, but it is not persisting changes/files/new software installed, at reboot. If we want to proceed with installation of big data tools, we have to install the system.

After you have Ubuntu OS installed you need to process with the next Installation steps :
1. Confluent Kafka
2. Apache Cassandra
3. Apache Zeppelin – with embedded Apache Spark local
4. Apache Zeppelin configurations for Spark interworking with Kafka and Cassandra configurations

Note: you might observe that all solutions above are Apache besides Kafka that we chose to be a Confluent installation. The reasoning is simple – Confluent has a one node installation available that has all Kafka services in one node: Apache Kafka (Zookeeper + Kafka broker), Schema Registry, Connect, KSQL to name a few. This one node installation is meant for testing purposes (not for commercial/production ones). In production you will need to install these services from scratch (unless you choose a commercial license installation from Confluent), but this is a setup we found best to get you started and give you a good glimpse into what the Kafka ecosystem can give you as functionality. Nonetheless, if you don’t want to go for Confluent Kafka one node installation at the end of this post you have as well the Apache Kafka version – in this case you will have only Kafka – without the Connect, SchemaRegistry and KSQL.

1. Prerequisites for services installations

  • Install curl
sudo apt install curl
  • Install Java 8
sudo apt install openjdk-8-jdk openjdk-8-jre
  • Java version check
$ java -version
openjdk version "1.8.0_265"
OpenJDK Runtime Environment (build 1.8.0_265-8u265-b01-0ubuntu2~20.04-b01)
OpenJDK 64-Bit Server VM (build 25.265-b01, mixed mode)

2. Kafka installation

  • We will download Confluent Kafka – one node installation (no commercial license needed) – it is one node that has all the Kafka components available: Zookeeper, Broker, Kafka Registry, KSQL, Connect
curl -O http://packages.confluent.io/archive/5.5/confluent-5.5.1-2.12.zip
  • Extract the contents of the archive
unzip confluent-5.5.1-2.12.zip
cd confluent-5.5.1/

We’ll use the default configuration files.

<path-to-confluent>/etc/
  • Starting the services: this Confluent script starts all the services, including KSQL.
~/confluent-5.5.1/bin$ ./confluent local start

    The local commands are intended for a single-node development environment
    only, NOT for production usage. https://docs.confluent.io/current/cli/index.html
 
 
Using CONFLUENT_CURRENT: /tmp/confluent.kCNzjS0a
Starting zookeeper
zookeeper is [UP]
Starting kafka
kafka is [UP]
Starting schema-registry
schema-registry is [UP]
Starting kafka-rest
kafka-rest is [UP]
Starting connect
connect is [UP]
Starting ksql-server
ksql-server is [UP]
Starting control-center
control-center is [UP]
  • Another way to install Kafka – not using the Confluent script – is to start first Zookeeper and then add each Kafka broker. See the details in Paragraph 7.

3. Apache Cassandra

  • Add the Apache repository of Cassandra to the file cassandra.sources.list
echo "deb https://downloads.apache.org/cassandra/debian 311x main" | sudo tee -a /etc/apt/sources.list.d/cassandra.sources.list
  • Add the Apache Cassandra repository keys to the list of trusted keys on the server
curl https://downloads.apache.org/cassandra/KEYS | sudo apt-key add -
  • Update
sudo apt-get update
  • Install Cassandra
sudo apt-get install cassandra
  • Start Cassandra service
sudo service cassandra start
sudo service cassandra stop // only if you want to stop the service
  • Check the status of Cassandra
nodetool status

Datacenter: datacenter1
=======================
Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving
--  Address    Load       Tokens       Owns (effective)  Host ID                               Rack
UN  127.0.0.1  70.03 KiB  256          100.0%            1c169827-bf4c-487f-b79a-38c00855b144  rack1

  • Test CQLSH
$ cqlsh
Connected to Test Cluster at 127.0.0.1:9042.
[cqlsh 5.0.1 | Cassandra 3.11.7 | CQL spec 3.4.4 | Native protocol v4]
Use HELP for help.
cqlsh>

4. Apache Zeppelin

$ wget http://mirrors.m247.ro/apache/zeppelin/zeppelin-0.9.0-preview2/zeppelin-0.9.0-preview2-bin-all.tgz
  • Extract archived files
$ tar xzvf zeppelin-0.9.0-preview2-bin-all.tgz
  • Start Zeppelin service
$ cd zeppelin-0.9.0-preview2-bin-all
$ ./bin/zeppelin-daemon.sh start
Log dir doesn't exist, create /home/kosmin/zeppelin-0.9.0-preview2-bin-all/logs
Pid dir doesn't exist, create /home/kosmin/zeppelin-0.9.0-preview2-bin-all/run
Zeppelin start                                             [  OK  ]

  • Configuring users for Zeppelin

Default login is with an anonymous user. The configuration for users is found in conf folder from <zeppelin_path>

$  ~/zeppelin-0.9.0-preview2-bin-all$ cd conf/
$ mv shiro.ini.template shiro.ini
$ vi shiro.ini
#uncomment the line with admin = password1, admin
save
restart zeppelin service
  • Login and test the spark interpreter
sc.version
res3: String = 2.4.5

5. Configuring Zeppelin for connection with Kafka and Cassandra – through the embedded Apache Spark

An Apache Zeppelin interpreter is a plugin that enables you to access processing engines and data sources from the Zeppelin UI. In order to connect our installed Zeppelin to Kafka  we will need to add some artifacts in the Spark interpreter in Zeppelin configurations. More info on Zeppelin interpreters can be found here:

Configurations for Kafka in Zeppelin

  • Finding the spark version
sc.version - in the zeppelin notebook
res1: String = 2.4.5
  • Open the interpreter config page 
  • Under the USERNAME , open the menu and click Interpreter
    • Search for SPARK and click edit
    • At Dependencies part , add this under Artifact and click save 
org.apache.spark:spark-sql-kafka-0-10_2.11:2.4.5

Configurations for Cassandra in Zeppelin

  • Finding the spark version
sc.version - in the zeppelin notebook
res1: String = 2.4.5
  • Open the interpreter config page
    • Under the USERNAME , open the menu and click Interpreter
    • Search for SPARK and click edit and at Dependencies part , add this under Artifact and click save : 
 com.datastax.spark:spark-cassandra-connector_2.11:2.5.1

Version compatibility for the Spark-Cassandra connector can be found here.

6. End to end exercise to test setup

Using Kafka producer console : Open a topic words and start typing data into it ( make sure you set 3 partitions for this topic).

kosmin@bigdata:~/confluent-5.5.1/bin$ ./kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 --partitions 3 --topic bigdata
Created topic bigdata.
kosmin@bigdata:~/confluent-5.5.1/bin$ ./kafka-console-producer --broker-list localhost:9092 --topic bigdata
enter some words here :) 
CTRL+C // if you want to close producing events in the topic

Using Zeppelin, open a Notebook: Create a stream DF in Spark that points to words topic in Kafka. Create a query and read all the parameters for each topic – key (null in our case)

val kafkaDF = spark.readStream.format("kafka")
      .option("kafka.bootstrap.servers", "localhost:9092")
      .option("subscribe","bigdata")
      .option("startingOffsets","earliest")
      .load().select($"key".cast("STRING").as("key"), 
                     $"value".cast("STRING").as("value"), 
                     $"topic",
                     $"partition",
                     $"offset",
                     $"timestamp")

val query_s0 = kafkaDF.writeStream.outputMode("append").format("console").start()
query_s0.awaitTermination(30000)
query_s0.stop()

Test for Cassandra

import org.apache.spark.sql.functions._
val tables = spark.read.format("org.apache.spark.sql.cassandra").options(Map( "table" -> "tables", "keyspace" -> "system_schema" )).load()
tables.show(10)

7. Another way to install Kafka – not using the Confluent script – is to start first Zookeeper and then add each Kafka broker

  • Zookeper
kosmin@bigdata:~/confluent-5.5.1$ ./bin/zookeeper-server-start  ./etc/kafka/zookeeper.properties

In another terminal window, we can check if Zookeeper started. By default, it listens on 2181 port.

$ ss -tulp | grep 2181
tcp    LISTEN   0        50                     *:2181                  *:*      users:(("java",pid=6139,fd=403))
  • Kafka

For starting Kafka service, run this in another terminal :

kosmin@bigdata:~/confluent-5.5.1$ ./bin/kafka-server-start ./etc/kafka/server.properties

Check if Kafka started in another terminal window. Default Kafka broker port is 9092.

kosmin@bigdata:~/confluent-5.5.1$ ss -tulp | grep 9092
tcp    LISTEN   0        50                     *:9092                  *:*      users:(("java",pid=6403,fd=408))
  • Test Kafka topics
kosmin@bigdata:~/confluent-5.5.1/bin$ ./kafka-topics --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic Hello
Created topic Hello.
kosmin@bigdata:~/confluent-5.5.1/bin$ ./kafka-topics --list --zookeeper localhost:2181
Hello
__confluent.support.metrics
_confluent-license
kosmin@bigdata:~/confluent-5.5.1/bin$

At this moment you should have your own private small cluster ready for learning some of the solutions in Big Data Architectures. Although with such setup you cannot see the wonder of distributed systems, you can though understand the purpose of Kafka/Cassandra/Spark/Zeppelin in a big data architecture.

Next post in these series will be about building an end to end case with this setup.

Open course Big Data, September 25-28, 2019

Open course big data

Open Course: Big Data Architecture and Technology Concepts
Course duration: 3.5 days, September 25-28 (Wednesday-Friday 9:00 – 17:00, Saturday 9:30-13:00)
Trainers: Valentina Crisan, Felix Crisan
Location: Bucharest, TBD (location will be communicated to participants)
Price: 450 EUR, 10% discount early bird if registration is confirmed until 2nd of September – 405 EUR
Number of places: 10
Pre-requisites: knowledge of distributed systems, Hadoop ecosystem (HDFS, MapReduce), know a bit of SQL.

Description:

There are a few concepts and solutions that solutions architects should be aware of when evaluating or building a big data solution: what data partitioning means, how to model your data in order to get the best performance from your distributed system, what is the best format of your data, what is the best storage or the best way to analyze your data. Solutions like HDFS, Hive, Cassandra, Hbase, Spark, Kafka, YARN should be known – not necessarily because you will work specifically with them – but mainly because knowing the concepts of these solutions will help you understand other similar solutions in the big data space. This course is designed to make sure the participants will understand the usage and applicability of big data technologies like HDFS, Spark, Cassandra, Hbase, Kafka ,..  and which aspects to consider when starting to build a Big Data architecture.

Please see details for the course and registration here: https://bigdata.ro/open-course-big-data-september-25-28-2019/

Introduction to Apache Solr

This workshop addresses anyone interested in Search solutions, the workshop aim is to be a light intro in Search engines and especially Apache Solr. Apache Solr is one of the two main open source search engines existing today and it’s also the base for the search functionalities implemented in several big data platforms ( e.g. Datastax, Cloudera). Thus, understanding Solr will help you not only in working with the Apache version but as well have a starting point in several platforms that use Solr as base for their search functionalities.

Date: 30 June, 2018, 9:30-13:30
Trainers: Radu Gheorghe
Location: eSolutions Academy, Budişteanu Office Building, strada General Constantin Budişteanu Nr. 28C, etaj 1, Sector 1, Bucureşti.
Number of places:  15  10 places left
Price: 150 RON (including VAT)

You can check out the agenda and register for this session here.

Big Data Architecture intro workshop

This workshop is addressed to anyone interested in Big Data and the overall architectural components required to build a data solution. We will use Apache Zeppelin for some data exploration but otherwise the workshop will be more a theoretical one – allowing enough time to overall understand which are the possible components and their role in a Big Data Architecture. We will not go in depth in the components/solutions but the aim is to understand the overall role of possible components in architecting a big data solution.

The scope of this workshop is to make the participants familiar with the Big Data architecture components and has as prerequisite the overall understanding of IT architectures.

Date: February 24th, 2018, 9:00 – 13:00
Trainers: Felix Crisan, Valentina Crisan
Location: eSolutions Academy, Budişteanu Office Building, strada General Constantin Budişteanu Nr. 28C, etaj 1, Sector 1, Bucureşti.
Number of places:  15  no more places left
Price: 150 RON (including VAT)

Check out the agenda and register for this session here.

Modeling your data for analytics with Apache Cassandra and Spark SQL

This session is intended for those looking to understand better how to model data for queries in Apache Cassandra and Apache Cassandra + Spark SQL. The session will help you understand the concept of secondary indexes and materialized views in Cassandra and the way Spark SQL can be used in conjunction with Cassandra in order to be able to run complex analytical queries. We assume you are familiar with Cassandra & Spark SQL (but it’s not mandatory since we will explain the basic concepts behind data modeling in Cassandra and Spark SQL). The whole workshop will be run in Cassandra Query Language and SQL and we will use Zeppelin as the interface towards Cassandra + Spark SQL.

Date: 19 August, 9:00 – 13:30
Trainers: Felix Crisan, Valentina Crisan
Location: eSolutions Academy, Budişteanu Office Building, strada General Constantin Budişteanu Nr. 28C, etaj 1, Sector 1, Bucureşti.
Number of places:  15  8 left
Price: 150 RON (including VAT)

Check out the agenda and register for future session here.

Modeling your data for analytics with Apache Cassandra and Spark SQL

This session is intended for those looking to understand better how to model data for queries in Apache Cassandra and Apache Cassandra + Spark SQL. The session will help you understand the concept of secondary indexes and materialized views in Cassandra and the way Spark SQL can be used in conjunction with Cassandra in order to be able to run complex analytical queries. We assume you are familiar with Cassandra & Spark SQL (but it’s not mandatory since we will explain the basic concepts behind data modeling in Cassandra and Spark SQL). The whole workshop will be run in Cassandra Query Language and SQL and we will use Zeppelin as the interface towards Cassandra + Spark SQL.

Date: 10 June, 9:00 – 13:30 – this workshop will be rescheduled
Trainers: Felix Crisan, Valentina Crisan
Location: eSolutions Academy, Budişteanu Office Building, strada General Constantin Budişteanu Nr. 28C, etaj 1, Sector 1, Bucureşti.
Number of places:  15
Price: 150 RON (including VAT)

Check out the agenda and register for future session here.

Analytics with Cassandra and Spark SQL Workshop

We continue the series of Spark SQL and Cassandra with more hands on exercises on the integration between the 2 solutions, working on open Movielens data. This workshop addresses those who know the basics of Cassandra & CQL and have SQL knowledge. Spark is not mandatory, although would be good to know it’s basic concepts ( RDD, transformations, actions) since we will not address these concepts in the workshop but we will mention them in several occasions. Without Spark basic concepts you will still understand the aggregations that can be done at Spark SQL level but you will not fully understand how Spark SQL integrates in the whole Spark system.
 In this workshop you will understand the optimal way of making queries in a solution composed of Apache Cassandra and Apache Spark.
Prerequisites: Cassandra Concepts knowledge, SQL knowledge

Trainers: Felix Crisan, Valentina Crisan
When: 22 April
Time: 9:30-14:00
Location: eSolutions Academy, Budişteanu Office Building, strada General Constantin Budişteanu Nr. 28C, etaj 1, Sector 1, Bucureşti.
Number of places: 15 5 places left
Price: 125 RON  including  VAT

Check out the agenda and register here.