This is a really interesting question for me, because I really enjoy discussing data science and data analysis. Some reasons I love data science:
- Discovering and uncovering patterns in the data through data visualization
- Finding and exploring unusual relationships between factors in a system using statistical measures
- Asking questions about systems in a data context – this is why data science is so hands-on, so iterative, and so full of throw-away models
Let me expand on each of these with an example, so that you get an idea.
Uncovering Patterns in Data
On a few projects, I’ve found data visualization to be a great way to identify hypotheses about my data set. Having a starting point such as a visualization for the hypothesis generation process makes us go into the process of building models a little more confidently. There’s the specific example of a time series analysis technique I used for energy system data, where using aggregate statistical measures and distribution fitting led to arbitrary and complex patterns in the data. Using time ordered visualizations helped me formulate the hypothesis in the correct way, and allowed me to build an explanatory model of the system.
Exploring Unusual Relationships in Data
In data science work, you begin to observe broad patterns and exceptions to these rules. Simple examples may be found in the analysis of anomalous behaviour in various kinds of systems. Some time back, I worked with a log data set that captured different kinds of customer transaction data between a customer and a client. These log data revealed unusual patterns that those steeped in the process could tell, but which couldn’t be quantified. By finding typical patterns across customers using session-specific metrics, I helped identify the anomalous customers. The construction of these variables, known as “feature engineering” in data science and machine learning, was a key insight. Such insights can only come when we’re informed about domain considerations, and when we understand the business context of the data analysis well.
Asking Questions about Systems in a Data Context
When you’re exploring the behaviour of systems using data, you start from some hypothesis (as I’ve described above) and then continue to improve your hypothesis to a point where it is able to help your business answer key questions. In each data science project, I’ve observed how considerations external to the immediate data set often come in, and present interesting possibilities to us during the data analysis. Sometimes, we answer these questions by finding and including the additional data, and at other times, the questions remain on the table. Either way, you get to ask a question on top of an answer you know, and you get to do an analysis on top of another analysis – with the result that you’ve composited different models together after a while, that give you completely new insights that you’ve not seen before.
All three patterns are exhilarating and interesting to observe, for data scientists, especially those who are deeply involved in reasoning about the data. A good indication of whether you’ve done well in data analysis is when you’re more curious and better educated about the nuances of a system or process than you were before – and this is definitely true in my case. What seemed like a simple system at the outset can reveal so much to you when you study its data – and as a long-time design, engineering and quality professional, this is what interests me a great deal about data science.