While there is justifiable excitement in the technology industry (and other industries) these days on the widespread availability of data, and the availability of algorithms to process and make sense of this data, I sincerely think (like many others) that the hype behind Big Data is somewhat unfounded.
For many decades, “small data” have been studied in science and industry with the intent of constructing mathematical models, i.e., approximate, error-prone mathematical representations of phenomena. In some ways, the scientific method is all about such data analysis. We often hear in the news about the amplification of effects, the “truth inflation” observed when drawing conclusions from small data sets, to make broader generalizations. We hear about the lack of enough data impeding the progress of research, we also hear about fabricated data and spurious research results. A lot of scientific findings have come under scrutiny for these reason – and perhaps analysis of population data (as Big Data promises to do) may help this situation. However, the key difference between the past decades of statistics – from legends such as Fisher and George Box, to present day stalwarts in applied statistics and machine learning like Nate Silver, Sebastian Thrun and Andrew Ng, is the ability to leverage computing to analyse large data sets.
A lot of the discussion around Big Data seems to be on the so-called four Vs of Big Data – volume, velocity, variety – and increasingly, veracity – referring to the increasing speed and range of data generated in the information age. However, what’s forgotten often enough, is that below the hype, below the machine learning algorithms and below the databases and technologies, we still have the same underlying principles.
The types of data, the mathematical methods we use to evaluate them, and the fundamental concepts thereof are unchanged – and understanding this is often the key between knowing whether and when to sample from your big data set, or not. This is more important than we realize, because sampling is not obsolete. Often, well collected samples of data may be more than sufficient for establishing or testing a certain hypothesis we may have.
In my view, newcomers to the data science and big data revolutions ought to consider a course in statistics, statistical thinking and statistical reasoning first. This lays the foundation for everything else that follows. The internet and most developed and even developing countries are awash with resources that can enable individuals to learn programming and computer-based problem solving, but critical thinking and statistical thinking seem to be harder skills to learn.
Statistical thinking not only requires a level of mathematical rigour but an ability to embrace notions of uncertainty, probabilistic thinking and a fundamental change in one’s notions of cause and effect. Perhaps this is a big step for many. The relative certainty of the logic of programming languages may actually be welcoming to many – which is probably also why we see more discussions about Hadoop and Spark and not enough discussions about statistical hypothesis testing or time series auto-correlation models.
So, if you want to cut through the hype, see data science for what it is, by breaking it up into its elements – the data (which may be coming in from ever more diverse sources), the tools (algorithms, computers) and the science (which is, in this case, statistics). Not everyone is a data scientist, as some articles on the web have begun to claim, but it isn’t only a specific set of skills that makes one a “data scientist”. Some say that these data scientists are glorified statisticians, some say that they’re statistically competent programmers well versed in machine learning, but the truth is probably somewhere in between.
Furthermore, data visualization – another aspect of the data science hype – is both an art and a science – which perhaps implies that you can both be enlightened and obfuscated by charts and graphs. In my view, knowledge/abilities in visualization alone doesn’t make you a data scientist (nor does, for instance, knowledge of machine learning methods alone or skills in programming R for ETL purposes alone). When you cut through the hype here, what’s pragmatic is to be able to acquire a wide array of skills – and depth in some. Like many engineers in fields of technology or engineering, who may have a wide swath of knowledge but expertise in only a few areas, this is the most likely role that most data scientists may have.
There’s definitely more that can be said about specific aspects of the data science “movement”, but what is certain is that a knowledge of the science of statistics underlying most of the science cannot be underestimated in its value and relevance in the present day. Statistics, hopefully, will become as important as learning a language or developing an ability to have conversations, or write a well argued paragraph.