Effective measurement is as important in the data science revolution as effective analysis is. Without data that is measured correctly, we fly blind into data analysis, and such a scenario can hardly be effective at extracting insight from the data we possess. In this post, I discuss some challenges facing effective measurement in the context of data science and the internet of things. Rather than address specific technical aspects, this is a reflective post that engages the key questions that arise in the context of data from diverse sources, unprocessed and processed data, and the importance of measurement systems analysis to data science and Internet-of-Things system builder and integrator teams.
Data Before Software and Algorithms
While a lot of discussion and debate rages on about which algorithm to use, or which language is better for data science, or indeed, which distributed computing framework to use for data processing and machine learning, the effective and accurate measurement of data itself is in some cases an unsolved problem. The Internet of Things (IOT) revolution will bring with it the need to integrate hundreds of sensors into devices around us, and the consequential sensor data complexity will necessitate systematic methods of processing measurements and storing such measurements at large volumes. While databases and messaging engines to transfer data have kept up with needs in this space (and continue to innovate), there is a need to better integrate some measurement system analysis routines, error modeling methods, and calibration methods into sensors themselves. Perhaps this will be accomplished in the sensor architectures of the future.
The Value of Measurement Error Models
Measurement error models are a key outstanding problem for calibration activities, measurement system design, embedded measurement system integration, and measurement management. While contemporary approaches such as variance models (a la Type A and Type B measurement estimation methods) exist, architectures that combine many different kinds of measurement sensors are not often found. Such architectures would have to combine correlation and cross correlation analyses, integrate distribution models of errors and provide for sensor state memory (meta data about sensor measurements) alongside the collection and processing of actual measurements.
While direct measurement sensors address singular operational definitions at massive scale, the trend is definitely towards sensor architectures that consider sensor fusion approaches. Static and dynamic characteristics of these sensor systems then become of paramount importance, since understanding and modeling them (and new phenomena in this space) will be central to accomplishing more from both direct-measurement sensors and sensor fusion arrays. Derived measures, their meaning in the context of complex signal interactions, and related effects are sure to play a role in the definition of sensor architectures.
Addressing Measurement Uncertainty
The more we peer into the source of our data and the means we have used to collect it, the more we have to examine its sources of error. Traditionally acknowledged sources of static measurement uncertainty are:
- Measurement bias uncertainty
- Precision uncertainty
- Resolution uncertainty
- Environmental or noise factor induced uncertainty
Additionally, there are sources of dynamic uncertainty, some of which may be:
- Stability and absolute sensitivity
- Dynamic range and dynamic sensitivity
- Hysteresis behavior of sensors and measurement systems
Without effectively addressing such sources of error and measurement uncertainty (or quantifying them), it is probably unwise to use advanced algorithms that paint broad brush strokes about problems. There is a class of computationally induced uncertainty to add to this, that is removed from the measurement layer in architectures, such as:
- Database updates and null values
- Time lags in updates, mismatches between sensor time and computer time
- Memory latency and the associated lags
- Interpolation, computation and round-off errors
Interestingly enough, many of these errors cannot be addressed directly, since we use digital devices that are built to a certain architecture. It is perhaps impossible to do away with aspects such as memory latency, or computational latencies that cause data to be processed with a delay, or measured data to be stored with a delay. As processor and memory chips improve in computational capabilities (despite the warnings to the contrary that Moore’s law is slowing down) these lags are bound to be less and less significant. While individually, in small data sets, such latencies, biases and errors may not be significant, over time, the cost of decisions made with poorly measured and processed data is high.
Measurement Error and Uncertainty Characterization
Characterization of measurement error and uncertainty is the process of analyzing measurements from a process or sensor to understand:
- The measurement system’s contribution to errors or variations
- The contribution of the object being measured, to observed variations
The ISO GUM (Guidelines for expression of Uncertainty in Measurements) standards specify a set of approaches that are widely being followed by organizations whose best interests lie in effective measurement error characterization (think NASA and many Aerospace corporations who are likely to use fine grained measurements for their design, testing and manufacturing processes). Historically, approaches like Gage Analysis (ANOVA based Gage R&R) have been recommended by other manufacturing process and quality practitioners whose job it frequently was to deal with data and measurements. However, a more comprehensive, systems-based approach may be due, since the process for such analysis is sometimes based on archaic process definitions and expectations and is out of touch with the reality of increased manufacturing automation.
The Importance of Time
Finally, it is impossible to have effective measurement devices of any kind (leave alone massive sensor arrays or sensor fusion arrays) that don’t have a system of analysis that considers the time element of the analysis. Time series views of the data have to prevail over the i.i.d. view of data analysis that is often practiced in industry and academia. This means moving away from some frequentist views of the world, and embracing Bayesian approaches, to allow us to reason better with the data, rather than using theories about the data.
The Importance of Domain
I’m a fan of domain knowledge in data science and make no secret of the fact – primarily because domain knowledge and experience helps us make sense of data analysis. If data analysis is the process of reasoning with data effectively, domain knowledge is what sustains sane interpretations of such analysis. Domain knowledge is important in measurement, because setting up measurement systems is a highly domain specific task. Managing process and product measurement is a whole different and complex topic, and perhaps warrants a separate post.
Measurement is an oft-overlooked area of data analysis, because data scientists and analysts often like to get right to the analysis and insights. However, it would do business analysts, machine learning engineers and serious data scientists a world of good to take a close look at measurement systems and how they measure, capture and process data from the sources as measurements. Not only are measurement system analysis, measurement error characterization and measurement uncertainty characterization central to the process of collecting data, but they indirectly affect the results we deliver to our organizations and stakeholders as data scientists.