What is data science? (And what is it not?)
I just left an interview where they asked me the same question. After reading the other 41 answers, I will try to adress a simple and more correct one:
It is a little bit of a misnomer and a buzz word that media is using to describe everything. However, it’s good to have this dicussion to come into an agreement.
The questions is about Data science. So I will not talk about Data Scientists. Go to What is a data scientist? if you are interested.
The biggest error that I found in most of the answers was some sort of “Data Science is when you are dealing with Big Data, large ammounts of data”. That is not true, Data Science can be applied to a data set with one thousand lines, there is no problem with this.
If we are goig to call as “science” we need to consider the Science and Scientific Method definition. According to this, Data Science is not only about the practical or empirical methods, it needs scientific foundations.
No one talked about the difference between Data and Information.
Data is a raw, unorganized set o things that need to be processed to have a meaning.
Information is when data was processed, organized, structured or presented in a given context so as to make it useful
Based on this, we would have Data science and Information science. Right now, people have a bias to talk about Data science including Information science.
It was clearly being used in a lot of fields for the past years:
The craziest part is that you see professionals of these areas updating their resumes with something like “I worked with Data Science…”
The creation of data science in a simple way. Two sides that were not totally connected, but with the new fast paced and technological world would have to merge together:
Statistics/mathematics: formulate proper models to generate insights.
Computer science: make the bridge between the models and the data in a feasible time to come with the result.
Topics/tools that a person neeed to understand or have some knowledge when working with Data Science:
Programming language (R, Python, SAS)
Softwares: Excel, SPSS by IBM
General platforms: Watson Anlytics by IBM, Azure Machine Learning, Google Cloud machine learning,
Data visualizations: Power BI, Tableau, R/Python using plotly/ggplot
Machine Learning (supervised, unsupervised and reinforcement learning)
Big Data Frameworks (Hadoop and Spark)
Hardware (CPU, GPU, TPU, FPGA, ASIC)
One Picture Worth Ten Thousand Words. The Drew Conway’s Data Science Venn Diagram . The Substantive expertise (or Domain expertise) is the specific knowledge of the area that you are applying Data Science. To know more about the lack substantive expertise in data science: What's Missing in Data Science Talks - As Risky As It Gets
WHAT IS NOT
Machine Learning is not a branch of Data science. Machine Learning originated from Artificial Intelligence. Data science is only using ML as a tool. The reason is that it produces amazing and autonomous results for specific tasks
It’s not the salvation of companies that never measured anything and now want to get insights from their data. “Garbage in, garbage out” Data science will be as good as the data generated on the following years.
Just present data using some Excel charts without any insight about the data.