What is Data Science

The quotes around the web:

In Fortune Magazine

  • - "Data Science being the Hot New Gig in Tech"

Hal Varian, Google's Chief Economist, New York Time, 2009

  • – “Statistics being the next sexy job”
  • – “The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill.”

Mike Driscoll, CEO of metamarkets:

  • – “Data science, as it's practiced, is a blend of Red-Bull-fueled hacking and espresso-inspired statistics.”
  • – “Data science is the civil engineering of data. Its acolytes possess a practical knowledge of tools & materials, coupled with a theoretical understanding of what's possible.”

Drew Conway's Data Science Venn Diagram

Another perspective on Data Science is that you should be familar with is this Venn diagram that made several years ago by Drew Conways.

Drew Conways Data Science Venn Diagram

His point was that Data Science is probably the mix of three different sort of areas.

  • One is the hacking skills - programming expertise.
  • Another is the academic view - Math & Statistics knowledge.
  • Third is that he added - notion of substantive expertise. What he meant by this is kind of a deep investment with the data. You are participant in tool building and also dive deeply into the data and do the analysis.

What do data scientists do?

Some more quotes here:

“They need to find nuggets of truth in data and then explain it to the business leaders”

Rchard Snee, EMC

Data scientists “tend to be “hard scientists”, particularly physicists, rather than computer science majors. Physicists have a strong mathematical background, computing skills, and come from a discipline in which survival depends on getting the most from the data. They have to think about the big picture, the big problem.”

DJ Patil, Chief Scientist at LinkedIn

Three sexy skills of data geeks

Statistics

  • - traditional analysis

Data Munging

  • - parsing, scraping, and formatting data. This is sort of parsing and scraping data from the web, and converting to different file formats efficiently, and not getting hung up on these the kind of friction that you deal with when you are working with large and heterogeneous data sets. So a data scientist is someone who is very comfortable in that environment and is able to sort of work nimbly even when things are not very clean.

Visualization

  • - graphs, tools, etc. The ability to communicate the results through visualization.

Data Science

Here is another quote from Jeffrey Stanton who teaches a course in data science at Syracuse and has involved in one of the earlier programs in data science.

“Data Science refers to an emerging area of work concerned with the collection, preparation, analysis, visualization, management and preservation of large collections of information.”

Jeffrey Stanton

One thing interesting about his perspective includes the word preservation. While the preparation, the analytics and visualization are the three tenants that you see quite often. Jeffrey goes one step further and talks about preservations even after you're done communicating the results. What do you do with the data long term?

Another quote from thinker in Data Science is Hilary Mason, the Chief Scientist at bit.ly.

“A data scientist is someone who can obtain, scrub, explore, model and interpret data, blending hacking, statistics and machine learning. Data scientists not only are adept at working with data, but appreciate data itself as a first-class product.”

Hilary Mason, chief scientist at bit.ly

The first sentence of this quote is that in Drew Conways' Venn diagram. There is data scientist are not only adept at working with data, but can appreciate data itself as a first-class product is about be able to organize the data and actually produce something that is usable by other people.

Three types of tasks

To summarize the Data Science, there is perhaps three overarching tasks involved in Data Science:

  • 1) Preparing to run a model
    Gathering, cleaning, integrating, restructuring, transforming, loading, filtering, deleting, combining, merging, verifying, extracting, shaping, massaging
  • 2) Running the model
  • 3) communicating the results

Data Science is about Data Products

Data science is about building data products, not just answering questions.

Data products empower others to use the data.

May help communicate your results (e.g., Nate Silver’s maps)

May empower others to do their own analysis (e.g., Global Burden of Disease)

Data Products Example:
  • “Data-driven apps”
    • – Spellchecker
    • – Machine Translator
  • Interactive visualizations
    • – Google flu application
    • – Global Burden of Disease
  • Online Databases
    • – Enterprise data warehouse
    • – Sloan Digital Sky Survey