Modeling your data for analytics with Apache Cassandra and Spark SQL

This session is intended for those looking to understand better how to model data for queries in Apache Cassandra and Apache Cassandra + Spark SQL. The session will help you understand the concept of secondary indexes and materialized views in Cassandra and the way Spark SQL can be used in conjunction with Cassandra in order to be able to run complex analytical queries. We assume you are familiar with Cassandra & Spark SQL (but it’s not mandatory since we will explain the basic concepts behind data modeling in Cassandra and Spark SQL). The whole workshop will be run in Cassandra Query Language and SQL and we will use Zeppelin as the interface towards Cassandra + Spark SQL.

Date: 19 August, 9:00 – 13:30
Trainers: Felix Crisan, Valentina Crisan
Location: eSolutions Academy, Budişteanu Office Building, strada General Constantin Budişteanu Nr. 28C, etaj 1, Sector 1, Bucureşti.
Number of places:  15  8 left
Price: 150 RON (including VAT)

Check out the agenda and register for future session here.

Big Data Projects and Architectures (Part 1)

Two of the biggest struggles for a Big Data beginner are – most of the time – the practicality of data, meaning the understanding the applicability of the data in different businesses ( what kind of data and what kind of results those businesses obtain for it) and where to start from when thinking of a new idea.   

We are kicking off today a couple of posts that are meant to help understanding the applicability of data in business and the architectures that need to be built. Also, we will gather suggestions/recommendations on how to start when thinking of a new project.

Our first interview is with Tudor Lapusan, Big Data/ Hadoop Devops @ Telenav, who also runs Big Data Meetup in Cluj and is part of the team developing Zimilar – the startup that aims to recognize and match any piece of clothing & accessories out there.   

[Valentina] What kind of Big Data / Data are you analyzing at Telenav (at least in the area you’re working on)? Please mention type, volume, batch/real time – if possible.

[Tudor] I’m working for Telenav and playing around with big data technologies for more that 5 years. Telenav is a car navigation company and as you already may guess, our majority datasets are geospatial related (ex. anonymous lat, lon coordinates indicating where our clients are driving). Sharing the exact amount of data we have it’s confidential, but what I can disclose is that we store tens of terabytes of data and the amount of data is growing continuously.

Also my view is that the info on the size of one’s data on is relative because it depends on how it’s stored. You can store it in raw/serialized format or you can store it compressed. Also depends what schema/solution you choose to store the data, you can used HDFS or a NoSql database. All of these choices have a big impact in the size of your data.

In Telenav we started our Big Data project using batch processing. Five-six years ago was that moment when MapReduce and HDFS were the key components from Hadoop ecosystem and – at that momentum – with a weak adoption for small/medium companies. I believe that majority of early adopters were doing batch processing in that time. Currently although our main analysis processes are still batch processing, we are also working on prototypes for real time projects.

[Valentina] What kind of analytics are you performing on this data?

[Tudor]  Some of Telenav projects are using Open Street Map (OSM) as the map behind the car navigation. Because OSM is a free map, where anyone from this world can edit it, there are still some things to improve. We are analyzing our dataset to predict the real speed from the streets to improve the car driving experience. Also, based on our data, we improve the OSM by finding streets which are missing from the original map, we detect wrong direction information from the streets, one ways, double ways, turn-restriction which are mapped incorrect. These are one of the main analyses we are doing and all have the goal to improve the OSM and the car navigation experience.

We built even an web application to expose all of these results and to help OSM community : https://improveosm.org/.  Also, we are doing a lot of internal analyses to help us understand where the majority of our clients are coming from and their behavior.

[Valentina] Can you describe the architecture used (the solutions inside and their role)?

[Tudor] As I said before, in the beginning we started with classical architecture, MapReduce and HDFS (stored the data in HDFS and processed it with MapReduce). After that we needed also random access to our data, so after some research we choose HBase as a random data access solution.

Our current architecture is different though nowadays from the first one we built. Now we use Spark to analyze the data and because we don’t need random access on entire data, we don’t use HBase anymore. All the data is stored in HDFS as parquet files which is a perfect match for Spark and batch processing. We use ElasticSearch to import small/medium datasets when we need random access, mostly useful for debugging and internal products.

[Valentina] How did you choose the different components? What were the main requirements to be fulfilled?

[Tudor] I read and researched at lot at the times about what it’s going into Big Data world. Also I’m part of a  passionate team and this helps a lot in our decisions.

The world of Big Data is more stable now than in the past years. So, if you are a beginner in this space and you have a Big Data use-case for your project, after few days of research you will see that all tutorials go through one or two technologies to choose from. Of course, if you need to set up an entire pipeline, things get complicated and maybe you would need few advices from someone with more experience.

Others important aspects are the community around the frameworks, resources (tutorials, books, stack-overflow) and its maturity. Majority of the big data technologies are working well when you analyze few gigabytes. The real challenge is when you give them to analyze tens of terabytes of data or even more. Until now I didn’t find the perfect technology to work the same on small and big data. Depends on everyone how deep you want to search and understand that technology.

[Valentina] When building an architecture which are – in your experience – the main things to consider when choosing different solutions. For example, choosing different Hadoop distributions: like Cloudera or Map-R?

[Tudor] When we started to work with Hadoop, if I remember correctly, there was only Cloudera which offered solutions to deploy and monitor your cluster and it wasn’t free (as is it now). So we initially decided to manually deploy and manage our cluster. And all was fine while we had only nine servers in our cluster and only few services to manage. The main advantage of the free solution was that it forced us to really understand how each service is working, the big disadvantage was that it took time, efforts and sometimes a dedicated team to manage it.

Jumping to nowadays, to deploy a Hadoop cluster become a really easy process. You can do it even after a few days of research. All of these thanks to the big players which entered into the big data world like Amazon, Cloudera, Hortonworks, Map-R, Microsoft and many others. At Telenav we are using Cloudera as a solution to deploy and monitor our cluster and we are pretty happy with it.

If you plan to start working with big data technologies, I strongly recommend to use the solutions of one of the above mentioned companies. I can’t comment if one is better than the others, since I have more experience with Cloudera. But I have friends from other companies which are happy with Hortonworks or AWS.

[Valentina] What is the best use case that you see for HDFS currently?

[Tudor] HDFS is a distributed filesystem thought to store large volume of data. The default block size is, if I remember correctly, 128 MB in the newest versions and this size is customizable. These characteristics make HDFS the perfect solution to store your data at a minimum cost and in the end to do some big analysis on it (batch processing). At Telenav we store as parquet files all our historical data into HDFS and use Spark to process it.  The book “Hadoop, The definitive guide” (HDFS chapter) , includes one of the best explanations regarding HDFS characteristics. I strongly recommend our readers to read it if they want to find out more about HDFS and others use-cases for it.

[Valentina] You are as well the main driver of Cluj big data community, can you tell me how do you see the take off or adoption of Big Data solutions/architectures in Cluj and as well Romania level?

[Tudor] I remember when I initiated the Big Data community in Cluj, three years ago, that only 2-3 companies were then playing with Hadoop. I used to use Linkedin to look for other people from Cluj with ‘hadoop’ skills and found very few people at the time. Now, there are definitely more companies in Cluj working with Big Data solutions, some of them also supporting our community nowadays. From meetups and workshops, I met a lot of people which are working with big data technologies and even more people which are interested in this field. Plus more and more companies started to work on machine learning also. I don’t know the level for Romania, but I assume that it’s a good one. I heard about companies working with big data technologies in Bucharest, Timisoara, Iasi, Brasov.

Interview by: Valentina Crisan @bigdata.ro