In preparation for October 10th Bucharest Big Data Meetup here’s a short interview with Cristina Grosu, Product Manager @ Bigstep, company that has built a platform of integrated big data technologies such as Hadoop, NoSQL, MPP Databases as well as containers, bare-metal compute and storage, offered as a service. Cristina will be talking “Datalake architectures seen in production” during the meetup and we discussed a bit about the Data Lake concept and the possible solutions part of its architecture.
[Valentina] What is a data lake and how is it different from the traditional data warehouse present today in majority of bigger companies?
[Cristina] Think about the data lake as a huge data repository where data from very different and disparate sources can be stored in its raw, original format. One of the most important advantages of using this concept when storing data comes from the lack of a predefined schema, specific data structure, format or type which encourages the user to reinterpret the data and find new actionable insights.
Real-time or batch, structured or non-structured, files, sensor data, clickstream data, logs, etc can all be stored in the data lake. As users start a data lake project, they can explore more complex use cases since all data sources and formats can be blended together using modern, distributed technologies that can easily scale at a predictable cost and performance.
[Valentina] What (Big Data) solutions usually take part in a data lake architecture? Tell us a little bit regarding the generic architecture of such a solution.
[Cristina] The most common stack of technologies that I see in a data lake architecture is built around the Hadoop Distributed File System.
Since this is an open source project it allows customers build an open data lake that is extensible as well as well integrated with other applications that are running on-premises or in the cloud. Depending on the type of workload, the technology stack around the data lake might differ, but the Hadoop ecosystem is so versatile in terms of data processing (Apache Spark), data ingestion (Streamsets, Apache NiFi), data visualization(Apache Kibana, Superset), data exploration(Zeppelin, Jupyter) and streaming solutions(Storm, Flink, Spark Streaming) that can be fairly easy interconnected to create a full-fledged data pipeline.
What I have seen in production with a lot of our customers are “continuously evolving data lakes”, where users are building their solutions incrementally with a long-term vision in mind, expanding their data architecture when needed and leveraging new technologies when a new twist on their initial approach appears.
Other architectures for the data lake are based on NoSQL databases (Cassandra) or are using Hadoop as an extension of existing data warehouses in conjunction with other BI specific applications.
Regardless of the technology stack, it’s important to design a “forward-looking architecture” that enables data democratization, data lineage tracking, data audit capabilities, real-time data delivery and analytics, metadata management and innovation.
[Valentina] What are the main requirements of a company that would like to build a data lake architecture?
[Cristina] The truth is that in big data projects, the main requirement is to have a driver in the company, someone with enough influence and experience that can sell the data lake concept in the company and can work together with the team to overcome all obstacles.
At the beginning of the project building the team’s big data skills might be necessary to make sure everyone is up to speed and create momentum within the organization by training, hiring or bringing in consultants.
Oftentimes, this leader will be challenged with implementing shifts in mindsets and behaviors regarding how the data is used, who’s involved in the analytics process and how everyone is interacting with data. As a result, new data collaboration pipelines will be developed to encourage cross-departmental communication.
To ensure the success of a data lake project and turn it into a strategic move, besides the right solution, you need a committed driver in your organization and the right people in place.
[Valentina] From the projects you’ve run so far, which do you think are the trickiest points to tackle when building a data lake solution?
[Cristina] Defining strategies for metadata management is quite difficult especially for organizations that have a large variety of data. Dumping everything in the data lake and trying to make sense of it 6 months into the project can become a real challenge if there is no data dictionary or data catalog that can be used to browse through data. I would say, data governance is an important thing to consider when developing the initial plan, even though organizations sometimes overlook it.
In addition, it’s important to have the right skills in the organization, understand how to extract value from data and of course figure out how to integrate the data lake into existing IT applications if that is important for the use case.
[Valentina] Any particular type of business that would benefit more from building a data lake?
[Cristina] To be honest the data lake is such a universal solution that it can be applied to all business verticals with relatively small differences in the design of the solution. The specificity of the data lake comes from every industry’s business processes and the business function it serves.
The greatest data lake adoption is observed among the usual suspects in retail, marketing, logistics, advertising where the volume and velocity of data grow day by day. Very interesting data lake projects are also seen in banking, energy, manufacturing, government departments and even Formula 1.
[Valentina] Will data lake concept change the way companies work with data in your opinion?
[Cristina] Access to data will broaden the audience of the organization’s data and will drive the shift towards self-service analytics to power real-time decision making and automation.
Ideally, each person within a company would have access to a powerful dashboard that can be customized and adapted to run data exploration experiments to find new correlations. Once you find better ways to work with data, you can better understand how trends are going to affect your business, what processes can be optimized or you can simulate different scenarios and understand how your business is going to perform.
The data lake represents the start of this mindset change but everything will materialize once the adoption increases and technologies mature.
Interview by: Valentina Crisan