The Multiple Aspects of Data Science

binary dataEarlier this month, Nathan Yau at FlowingData posted Mike Loukides‘ analysis of data science from O’Reilly Radar. I finally found some time to read it.

I really enjoyed the post. The author entitled it, “What is data science?“, and covered the various aspects of the newbie field, primarily from a commercial point of view. He examined: what is data science?; where data comes from; working with data at scale; making data tell its story; and, data scientists. His analysis is that the, “future belongs to the companies and people that turn data into products”.

I thought he did an excellent job of discussing that particular field. The one aspect he did not mention as being part of data science has to do with my field: managing the data over the indefinite long-term. The long-term storage and accessibility of the data is just as important as how you use it and what you find in it. But then again, I’m biased. I would like data scientists to examine the following questions as part of their job.

  • What do you keep?
  • For how long?
  • What do you throw out?
  • Can you legally throw it out?
  • If not, how do you provide access to it for the indefinite long-term?

My point is that the perception that because storage is cheap and keeps getting cheaper, we should just, “keep everything”, isn’t cost-effective. Think, for a moment, about a company that has yottabytes of data, of which only 90% is being used. Should you pay to store and migrate that data? What if all companies are paying to store and provide access to data, 90% of which isn’t used, and they are not legally (or morally, for that matter) required to keep that information. Should they? What about the costs of creating the electricity itself, plus the cost of purchasing it, plus the costs related to buying new machines, and the human time involved in migrating the data every few years?

The long-term archiving of data, in my opinion, needs to be as much a part of data science as the analysis and creative use of the data.

Now that I’ve gotten off my data archive soapbox, I’m going to throw out some of my favorite quotes from the article.

We’ve all heard a lot about “big data,” but “big” is really a red herring. Oil companies, telecommunications companies, and other data-centric industries have had huge datasets for a long time. And as storage capacity continues to expand, today’s “big” is certainly tomorrow’s “medium” and next week’s “small.” The most meaningful definition I’ve heard: “big data” is when the size of the data itself becomes part of the problem. We’re discussing data problems ranging from gigabytes to petabytes of data. At some point, traditional techniques for working with data run out of steam.

Which is why you need Information Scientists. :-)

Data science isn’t just about the existence of data, or making guesses about what that data might mean; it’s about testing hypotheses and making sure that the conclusions you’re drawing from the data are valid.

Data science requires skills ranging from traditional computer science to mathematics to art.

According to DJ Patil, chief scientist at LinkedIn (@dpatil), the best data scientists tend to be “hard scientists,” particularly physicists, rather than computer science majors. Physicists have a strong mathematical background, computing skills, and come from a discipline in which survival depends on getting the most from the data. They have to think about the big picture, the big problem. When you’ve just spent a lot of grant money generating data, you can’t just throw the data out if it isn’t as clean as you’d like. You have to make it tell its story. You need some creativity for when the story the data is telling isn’t what you think it’s telling.

Entrepreneurship is another piece of the puzzle. Patil’s first flippant answer to “what kind of person are you looking for when you hire a data scientist?” was “someone you would start a company with.” That’s an important insight: we’re entering the era of products that are built on data. We don’t yet know what those products are, but we do know that the winners will be the people, and the companies, that find those products.

Data scientists combine entrepreneurship with patience, the willingness to build data products incrementally, the ability to explore, and the ability to iterate over a solution. They are inherently interdiscplinary. They can tackle all aspects of a problem, from initial data collection and data conditioning to drawing conclusions. They can think outside the box to come up with new ways to view the problem, or to work with very broadly defined problems: “here’s a lot of data, what can you make from it?”

The part of Hal Varian’s quote that nobody remembers says it all:
The ability to take data — to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it — that’s going to be a hugely important skill in the next decades.

What skills and knowledge do you think make a good data scientist? Do you think the ability to manage data for the long-term should also be a required skill?

Please let me know what you think....