Hypothesis Generation: A Key Data Science Challenge

Data scientists are new age explorers. Their field of exploration is rife with data from various sources. Their methods are mathematics, linear algebra, computational sciences, statistics and data visualisation. Their tools are programming languages, frameworks, libraries and statistical analysis tools. And their rewards are stepping stones, better understanding and insights.

The data science process for many teams starts with data summaries, visualisation and data analysis, and ends with the interpretation of analysis results. However, in today’s world of rapid data science cycles, it is possible to do much more, if we take a hypothesis-centred approach to data science.

Theories for New Age Raconteurs

Data scientists work with data sets small and large, and are tellers of stories. These stories have entities, properties and relationships, all described by data. Their apparatus and methods open up data scientists to opportunities to identify, consolidate and validate hypotheses with data, and use these hypotheses as starting points for our data narratives. Hypothesis generation is a key challenge for data scientists. Hypothesis generation and by extension hypothesis refinement constitute the very purpose of data analysis and data science.

Hypothesis generation for a data scientist can take numerous forms, such as:

  1. They may be interested in the properties of a certain stream of data or a certain measurement. These properties and their default or exceptional values may form a certain hypothesis.
  2. They may be keen on understanding how a certain measure has evolved over time. In trying to understand this evolution of a system’s metric, or a person’s behaviour, they could rely on a mathematical model as a hypothesis.
  3. They could consider the impact of some properties on the states of systems, interactions and people. In trying to understand such relationships between different measures and properties, they could construct machine learning models of different kinds.

Ultimately, the purpose of such hypothesis generation is to simplify some aspect of system behaviour and represent such behaviour in a manner that’s tangible and tractable based on simple, explicable rules. This makes story-telling easier for data scientists when they become new-age raconteurs, straddling data visualisations, dashboards with data summaries and machine learning models.

Developing Nuanced Understanding

The importance of hypothesis generation in data science teams is many fold:

  1. Hypothesis generation allows the team to experiment with theories about the data
  2. Hypothesis generation can allow the team to take a systems-thinking approach to the problem to be solved
  3. Hypothesis generation allows us to build more sophisticated models based on prior hypotheses and understanding

When data science teams approach complex projects, some of them may be wont to diving right into building complex systems based on available resources, libraries and software. By taking a hypothesis-centred view of the data science problem, they could build up complexity and nuanced understanding in a very natural way, and build up hypotheses and ideas in the process.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s