Distinguishing Data Science from the other fields

Business Intelligence

BI systems are associated with a couple of concepts. One is data warehouse, and the other is a set of dashboards or reports that consume data from the data warehouse, and are used to answer particular questions. So both of these components require a lot of upfront effort to design and build, and are therefore not too adaptable when requirements change. So a software stack designed for business intelligence may or may not be appropriate for any particular data science problems, where changing requirements are considered the norm. And so it sort of warrants a new term, is that business intelligence became associated with a particular approach to a particular set of problems, and a data science is in some sense broader. In addition, BI engineers are not typically expected to consume their own data products, and perform their own analysis, and make the business decisions themselves. Usually they are building tools for others to make decisions with. As a data scientist, you will be doing both.

Statistics

Statistical methods are at the heart of what a data scientist does, day to day. But a statistician will typically be comfortable with assuming that any data set they encounter will fit in main memory on a single machine. And this makes sense, because the whole field was born out of the need to extract the most information possible from a very sparse, very expensive to collect and therefpre very small data set. But that is not always the problems anymore as we shift from a data-poor regime to a data-rich regime, the set of challenges move from the need for new mathematics to squeeze information out of a data set to even handle or process very, very large data sets.

Data/Database Management

Database experts, database programmers and administrators bring a lot of skills to the table that make them appropriate for data science tasks, but there is a focus on particular data model, which is usually the relational data model. So there is rows and columns. So we have data coming from sources that are video, or audio, or even text, or to some extent even graphs, nodes and edges. A relational database may or may not be the right tool and even the concepts that transcend any particular database system may or may not be appropriate.

Visualization

Visualization experts also bring a lot of skills to the table. But like statistician, are historically, less concerned with massive scale data.

Machine Learning

Machine Learning is perhaps the closest to data science. As a proportion of the time you spend on data science problem, actually choosing the right model or algorithms, machine learning technique, and applying it and running it is a fairly small fraction. What you will be spending much more time on is the preparation of the data, the manipulation of the data, the cleaning of the data, the wrangling of the data. And for this, machine learning technique are not particularly relevant.

What "data science" tells us:

  • If you are a DBA, you need to learn to deal with unstructured data.
  • If you are a statistician, you need to learn to deal with data that does not fit in memory.
  • If you are a software engineer, you need to learn statistical modeling and how to communicate results.
  • If you are a business analyst, you need to learn about algorithms and tradesoffs at scale.