What is Data Science?
(As of 22 July, 2016) I’ve just left an interview where they asked me the same question. After reading the other 41 answers, I’ll try to address a simpler and correct one.
WHAT IS DATA SCIENCE? The first time that I heard about the term “Data Science” was in Toronto, 2015. Back then, I really though that Machine Learning and Data Science were some kind of universal panacea for the world’s biggest problems. OMG, I was so fuc&^*%% wrong.
Our endeavor on this post will be to define and understan Data Science, so let’s get some perspective. I have a background in mechatronics and mechanical engineering. Mechatronics is basically an attempt to understand every engineering piece, from electronics to robotics, from mechanical to computing and so on. You can imagine that I’m familiar to the struggles of Data Science. However, with Mechatronics I have a more intimate and older story. Everything started with my technical course on 2007. I thought that it was the universal panacea too. I’ll leave up to you to guess if I was right or not.
DS vs Mecha (Nicknames = s2) Data Science versus Mechatronics
Ref.: Google trends
From this graph you can see in a glimpse that Data Science (red line) was not that famous on 2007, but surprisingly: “It was there! Wow!”. Yeah, I know.
First conclusion, Data Science on it’s own it’s older than what they are trying to make you believe.
Stay with my train of thought towards our second conclusion.
Now you know that the term “Data Science” is coined before 2004. (unfortunately I just have Google Trends data after 2004 and it’s a post on its own to discover when happened the first mention or meaningful number of searches).
Now you may be wondering:
How could people live 30 years ago without Data Science ?
The answer is siple: they lived.
And here it goes:
Second conclusion, Data Science was always there but without all the buzz, officially coined term and rebranding. (I say always because we’re not talking about cavemen here, ok?)
After these two main conclusions. Let’s get something more dynamic. A picture is worth a thousand words (think in Data Visualizations).
What about a video? Take one minute of your life and watch the video below (I promise that will worth it).
Yann LeCun was recognizing hand written digits in 1993 (I was 2 years old by then lol). This was not only about Machine Learning, it was required to have a specific database (MNIST) to train the model (not going to be technical here, more at: LeCun’s Demos).
Nowadays, if you need to create MNIST again from the NIST Special Database and a model to recognize digits, prototype the whole process into a product and present to C-suite executives, you would be probably considering to hire a Data Scientist.
Data Scientists work with data. That’s why we have Data Science. Simple enough.
What is Data Science?
It is a little bit of a misnomer and a buzz word that media is using to describe everything. However, it’s good to have this dicussion to come into an agreement;
The question is about Data science. So I will not talk about Data Scientists. Go to What is a data scientist? if you are interested in that answer;
The biggest error that I found in most of the answers was some sort of:
“Data Science is when you are dealing with Big Data, large ammounts of data”.
That is not true, Data Science can be applied to a data set with one thousand lines, there is no problem with this.
“Data Science only exists coupled with Machine Learning”.
That is not true. What about decades of statistical and mathematical modeling? It’s true that Machine Learning empowers and automate Data Science, but it’s not 100% necessary on 100% of use cases.
If we are going to call it “science” we need to consider the Science and Scientific Method definition. According to this, Data Science is not only about the practical or empirical methods, it needs scientific foundations;
No one talked about the difference between Data and Information.
Data is a raw, unorganized set o things that need to be processed to have a meaning;
That’s why I agree with the article Data Is Not The New Oil
Raw data is like Raw intelligence, Useless.
Information is when data is processed, organized, structured or presented in a given context so as to make it useful;
Based on this, we would have Data science and Information Science. Right now, people have a bias to talk about Data Science including Information Science.
“Information Science” seems to be more appropriate term, but it’s too far to go back.
It was clearly being used in a lot of fields for the past years:
Business intelligence - Market intelligence
The craziest part is that you see professionals of these areas updating their resumes with something like “I worked with Data Science back then in 199X”
The creation of Data Science in simple words: two sides that were not totally connected, but with the new fast paced and technological world would have to merge together.
Statistics/mathematics: formulate proper models to generate insights;
Computer science: make the bridge between the models and the data in a feasible time to come with the result;
Only two sides because Machine Learning is all based on math and stats;
Theoretical computer science could be considered a branch of mathematics;
Topics/tools that a person need to understand or have some knowledge when working with Data Science:
Non-linear systems, dynamic systems
Statistics and probability
Softwares: Excel, IBM SPSS, SAS Enterprise Miner
General DS & MLasS platforms:
IBM Watson Studio & Analytics
Azure Machine Learning,
Google Cloud Machine Learning,
Rapidminer and KMINE
Data visualizations: Power BI, Tableau, R/Python using plotly/ggplot/highcharts
Machine Learning (supervised, unsupervised and reinforcement learning)
Big Data (MapR, RedShift, Snowflake, Big Query, Cassandra, Hadoop, Spark)
Hardware (CPU, GPU, TPU, FPGA, ASIC)
The Drew Conway’s Data Science Venn Diagram . The Substantive expertise (or Domain expertise) is the specific knowledge of the area that you are applying Data Science. To know more about the lack of substantive expertise in data science: What's Missing in Data Science Talks - As Risky As It Gets
[2018, Update] I used to believe in the Danger Zone, but I don’t think that it makes sense now. Think in a business analyst that create all the SQL queries to get simple KPIs and update a company-wide dashboard. He would use some insignificant math knowledge and most of the work would be business logic + computer science (aka as Hacking Skills on the diagram). I’ll create my own diagram on the next update to reflect this new mindset.
WHAT IS NOT
Machine Learning is not a branch of Data science. Machine Learning originated from Artificial Intelligence. Data science is only using ML as a tool. The reason is that it produces amazing and autonomous results for specific tasks;
It’s not the salvation of companies that never measured anything and now want to get insights from their data. “Garbage in, garbage out” Data science will be as good as the data generated on the following years after the initial Data Science efforts. This can be mitigated by a legacy data migration;
Just present data using some Excel charts without any insight about the data. This would be descriptive analytics;
Finishing my answer with all types of analytics that together get closer to encompass the Data Science definition:
Have a safe journey!
[Update: 2018–02–17] I’ll be going through these +101 answers (which 18 are collapsed) to update my answer someday. There are some really good answers on this question, but I personally don’t recommend to take advises from people that aren’t researchers, professors or professionals. This people are also known as aficionados, enthusiasts, 190+ IQ etc. [Curiosity] If you analyse the most famous diagram that defines mechatronics, you are probably going to see some similarities. Humans when faced with complex problems, tend to be predictable (e.g. create diagrams to explain to others).