I recently came across this Hortonworks Data Flow presentation where the concept of the jagged edge of real time data analytics is discussed. The context that suits a discussion on this is to me, centred around prioritization of what comes in through the sensors (or other big data gathered from these “jagged edge” sources). This is a pondering, rather than a post with a specific agenda, perhaps I will add to this, as responses come in, or as I learn more.
One of the key challenges for a lot of data centres in future, in the world of the Internet of Things, is to be able to provide relevant data analytics, regardless of size, point of origin, or time when it was generated. In order to do this well, I foresee not only technologies like NiFi being able to provide low latency updates as much in real-time as possible, but also a need for technologies that have sufficient intelligence built into the data sampling and data collecting process.
The central philosophy is susprisingly old – Edwards Deming said that data are not collected for museum purposes, but for decision making. And it is in this context that we see the data lakes of today transforming into the data seas of tomorrow, if we are not to use intelligent prioritization to determine what data should be streamed, and what data should be stored. The reality of decision making in such real time analytics situations is the availability of too much data to make one decision – which reminds me of Barry Schwartz’s paradox of free choice – that too much choice can actually impede and delay decision making, rather than aid it.
How do we ensure that approaches like data flow prioritization can allow us to address these issues? How do we move away from the static data lakes to the more useful data streams from the jagged edge, that data streaming technologies like Spark promise on established frameworks like Hadoop, without the risk of turning our lakes into seas that we do not have the insight of fully benefiting from?
Some of the answers lie in the implementations of specific use cases, of course. There is no silver bullet, if you will. That said, what kinds of technologies can we foresee being developed for Hadoop, Spark and other technologies that will heavily influence the internet of things revolution, to solve this prioritization conundrum?