Big Data: Size and Velocity

One of the changes envisioned in the big data space is that there is the need to receive data that isn’t so much big in volume, as big in relevance. Perhaps this is a crucial distinction to make. Here, we examine business manifestations of relevant data, as opposed to just large volumes of data.

What Managers Want From Data

It is easier to ask the question “what do you want from your data” to managers and executives, than to answer it as one. As someone who has worked in Fortune 500s with teams that use data to make decisions, I’d like to share some insight into this:

  1. Managers don’t necessarily want to see data even if they talk about wanting to use data for decision making. They instead want to see interpretations of data that helps them make up their minds and take decisions.
  2. Decision making is not monotonous or based on single variables of interest.
  3. Decision making involves not only operational data descriptors (which are most often instrumented for collection from data sources)
  4. Decisions can be taken based on uncertain estimates in some cases, but many situations do require accurate estimates of results to drive decision making

From Data To Insight

The process of getting from data to insight isn’t linear. It involves exploration, and this means collecting more data, and iterating on one’s results and hypotheses. Broadly, the process of getting insights from data may involve data preparation and analysis as intermediate stages between data collection and the generation of insight. This doesn’t mean that the data scientist’s job is done once the insights are generated. There is a need to collect more data and refine the models we’ve built, and construct better views of the problem landscape.

Data Quality

A large percentage of the data analyst’s or data scientist’s problems have to do with the broad area of data quality, and its readiness for analysis. Specifically to data quality, some things stand out:

  1. Measurement aspects – whether the measured data really represents the state of the variable which was measured. This in turn involves other aspects such as linearity, stability, bias, range, sensitivity and other parameters of the measurement system
  2. Latency aspects – whether the measured data in time sequence is recorded and logged in the correct sequence and at the right intervals
  3. Missing and anomalous values – these are missing or anomalous readings/data records, as opposed to anomalous behaviour, which is a whole other subject.

Fast Data Processing

Speed is an essential element in the data scientist’s effectiveness. The speed of decisions is the speed of the slowest link in the chain. Traditionally, this slowest link has been the collection of the data itself. Data processing frameworks have improved by leaps and bounds in the recent past, with frameworks like Apache Spark leading the charge. However, this is changing, with sensors in IOT settings delivering huge data sets and massive streams of data in themselves. In such a situation, the dearth of time is not in the acquisition of data itself. Indeed, the availability of massive data lakes with lots of data on them itself signals the need for more and more data scientists, who can analyse this data and arrive at insights from the data. It is in this context that the rapid creation of models, analysis of insights from data, and the creation of meta-algorithms that do such work is valuable.

Where Size Does Matter

There are some problems which do require very large data sets. There are many examples of systems that gain effectiveness only with scale. One such example is the commonly found collaborative filtering recommendation engine, used everywhere in e-commerce and related industries. Size does matter for these data sets. Small data sets are prone to truth inflation and poor analysis results from poor data collection. In such cases, there is no respite other than to ensure we collect, store and analyze large data sets.

Volume or Relevance in Data

Now we come to the dichotomy we set out to resolve – whether volume is more important in data sets, or whether relevance is. Relevant data simply meets a number of the criteria listed above, whereas data that’s measured up purely in volume (petabytes or exabytes) doesn’t give us an idea of the quality and its use for real data analysis.

Volume and Relevance in Data

We now look at whether volume itself may become part of what makes the data relevant for analysis. Unsurprisingly, for some applications such as neural network training, data science on time series data sets of high frequency, etc., data volume is undeniably useful. More data in these cases implies that more can be done with the model, that more partitions or subsets of the data can be taken, and that more theories can be tested out on different representative sample sets of the data.

The Big-Three Vs

So, where does this leave us with respect to our understanding of what makes Big Data, Big Data? We’ve seen the popular trope that Big Data is data that exhibits volume, velocity and variety. Some discuss a fourth characteristic – the veracity of the data. Overall, the availability of relevant data in sufficient volumes should be able to address the needs of the data scientist for model building and for data exploration. The question of variety still remains, and as data profiling approaches mature, data wrangling will advance to a point where this variety isn’t a trouble, and is a genuine asset. However, the volume and velocity points are definitely scoped for a trade off, in a large percentage of the cases. For most model building activities, such as linear regression models, or classification models where we know the characteristics or behavior of the data set, so-called “small data” is sufficient, as long as the data set is representative and relevant.


