When talking about data science there are two recurring questions that arise: what actually encompasses data science and how should someone get started. In April we will start on bigdata.ro a series of events that aim to get you acquainted with the usual steps encountered in a data science project and as well with one of the programming languages that can be used in such projects: Python. We will debut our events on April 3rd with an easy intro in Python and its libraries for data science and then continue on April 17th with an intro into Machine Learning. Both workshops will break down the learning and the hands on into the steps that should be taken when you start such process and I asked the trainers of the respective sessions, Maria and Ionut, a few questions regarding data science and their sessions.
From your point of view where does data science process starts
- Our purpose is to solve business problems, so data science starts with a deep and complete understanding of those problems, then continues with filtering the ones we actually can solve, and then make sure we’re finding the right solutions to the right problems. We also need to test these solutions permanently to validate our work and also to predict and prevent any drifts once our solutions are into production (and there was a survey that said that around 80% of DS/ML projects don’t even get into production).
- And since it is a multidisciplinary field, combining statistics, computer science, coding, data engineering, and domain knowledge for each field it’s applied in, it should always start with a team that can put together an optimal mix of all these disciplines based on every team member’s own abilities.
- It’s very rare, if not increasingly impossible, for just one person to be able to know it all and do it all, and we shouldn’t aim for that, but rather towards a T-shaped knowledge model for each data scientist/engineer and the above mix for every data science team.
- Data science is about learning from data. It starts with a question that cannot be answered with fixed if-else rules, but whose answers can be discovered if you look through your business data and discover patterns.
- In order to discover those patterns and gain knowledge from data, you need to apply a collection of tools and skills from business analysis to statistics, machine learning and software engineering… and this is data science.
Which are the main steps in a data science project?
- The first step must always be to completely understand what we want to achieve, what are the deliverables of the project, what is the data we have available or that we need(and in this case, how do we get or create it), and to validate this understanding with the stakeholders.
- Then getting, extracting or creating the data, exploring it. Understanding/evaluating the structure/schema but also its components in detail, checking it for errors, missing elements and seeing if it has the necessary features for the full scope of our data science project(that is, can we draw our conclusions from the existing features? Can we create other more useful features out of them? Can we join this dataset with another one so that we have what we need?
- Then we formulate hypotheses and we validate them with statistical tests.
- Understand the business needs:
- asking questions about the problem to be solved and translating it in terms of machine learning tasks. For example: do we want to predict a continuous value like the weekly stock price? Or do we want to mark an email as being spam or not? For the first question we will translate our problem in terms of regression (predicting a continuous value), while for the second problem we need to look at classification tasks.
- Data exploration
- Once the problem defined, it’s important to look at the data: what data is available or what is the cost of getting the necessary data. At this step we create data visualisations and as we progress in our data exploration, we might realise we need to go back to the business understanding stage and redefine the problem to be solved.
- Data preparation
- transformations like removing or filling missing values, scaling, transforming text values into numerical. Also splitting the data in train-test datasets.
- Model creation
- after the data is prepared, we can use it to train our model.
- Model evaluation
- at this stage we need to determine if the model is able to predict good results in the real world. So we test our model on a dataset it has never seen before and check the result
- Creating the deployment pipeline and deploying the model to production
- when we reach this step we have a model that is ready to go into production and we need to integrate it into a deployment pipeline that will apply the same data preparation steps as when the model was trained, generate predictions on new data and then compare the predictions with the real values when these become available.
Ionut: what will you be insisting on in the “Intro to Python for Data science” workshop? Which are the points that you would like everybody to leave the Python workshop with?
I have built it upon the main advantages of Python: it’s a high-level programming language, with lots of incredible and useful abstractions at its core, easy to learn, flexible, with a focus on simplicity and making code instantly understandable by both machines and humans, including your future self that will thank you for thinking of her/him. If you keep that as a guiding philosophy while learning, the process will be faster and more satisfying.
Also, it’s not a general Python course, but one focused at every step on using Python from a data scientist’s perspective and in DS projects. We won’t go into other possible uses of Python outside DS and neither into the details of machine learning, which anyway is the next workshop in the series.
Maria: What is the most important part in building a ML model? What about the one that takes most time?
It’s very important to understand the business requirements and define the question(s) that needs to be answered, then structure the problem in terms of ML tasks. This is where it’s essential to communicate with the stakeholders and ask the right questions about the use cases, then understand the limitations for the problem. Sometimes it’s necessary to go back and forth and refine our understanding of the business requirements, even change the initial formulation of the problem.
Most of the time is usually spent with the first 3 steps: understanding the business, exploring the data and preparing it for the model. This is not a one way process, so it might take a few iterations before arriving at a good problem formulation and obtaining the necessary data.