Achieving Explainability and Simplicity in Data Science Work

This post stems from a few of the tweets I’d authored recently (Over at @rexplorations) on deep learning, data science, and the other skills that data scientists ought to learn. Naturally, this is by no means a short list of skills, given the increasingly pivotal role that data scientists play in organizations.

Here’s a summary of the tweet-stream I’d put out, with some additional ponderings.

  1. Domain knowledge is ignored on the data science road to perdition. Doing data analysis, or building models from data without understanding the domain and the relevance of the data and factors one is using for these models, is akin to “data science suicide”. It is a sure shot road to perdition as a data scientist. Domain knowledge is also hard to acquire for data scientists, especially those working on projects as consultants, and applying their skills in a consultative, short-term setting. For instance, I have more than a decade of experience in the manufacturing industry, and I still find myself learning new things when I encounter a new engineering set up or a new firm. A data scientist is nobody if not capable of learning new things – and domain knowledge is something that they need to constantly skill up on, in addition to their analytical skills.
  2. Get coached on your communication skills, if needed. When interacting with domain experts and subject matter experts, communication skills are extremely important for data scientists. I have frequently seen data scientists suffer from the “impostor syndrome” – not only in the context of data analysis methods and techniques, but also in the context of domain understanding.
  3. Empathise, and take notes when speaking to subject matter experts. It is for this reason that the following things are extremely important for new data scientists interacting with subject matter experts:
    1. Humility about one’s own knowledge of a specific industry area,
    2. An ability to empathise with the problems of different stakeholders
    3. The ability to take notes, including but not limited to mind maps, to organize ideas and thoughts in data science projects
  4. Strive for the usefulness of models, not to build more complex models. Data scientists ignore hypotheses that come from such discussions at their own peril. Hypotheses form the lifeblood of useful data science and analysis. As George E. P. Box said, “All models are wrong, some models are useful” – and this couldn’t be more true than when dealing with models built from hypotheses. It is such models that become really useful.
  5. Simpler models are easier to manage in a data ethics context. In product companies that use machine learning and data science to add value to customers, a debate constantly exists on the effective and ethical use of customer data. While having more data at one’s disposal is helpful for building lots of features, callous use of customer data can present a huge risk. Simpler models are easier to explain – and are arrived at when we accumulate sufficient domain knowledge, and test enough hypotheses. With simpler models, it is easier to explain what data to collect, and this can also help win the customer’s trust.
  6. Careful feature engineering done with human supervision and care may be more effective and scrupulous than automated feature engineering. We live in a world where AutoML and RoboticDataScience are often discussed in the context of machine intelligence and speeding up the process of insight generation from data. However, for some applications, it may be a better idea in the short term to ensure that the feature engineering happens through human hands. Such careful feature engineering may give organizations that use sensitive data a leg up as a longer term strategy, by erring on the side of caution.
  7. Deep learning isn’t the end of the road for data scientists. Deep learning (justifiably) has seen a great deal of hype in the recent past. However, it cannot be seen as a panacea to all data analysis. The end goal from data is the generation of value – be it for a customer, or for society at large. There are many ways to do this – and deep learning is just one approach.

I’m not discussing the many technical aspects of building explainable models. These technical aspects are contextual and depend on the situation, for one, and additionally, the tone of the post and tweets are lighter, to encourage a discussion and to welcome beginner data scientists to this discussion. Hence my omission of these (important) topics.

If you like something on this post, or want to share any other related insights, do drop a comment, or tweet to me at @rexplorations or message me at LinkedIn.

Advertisements

Domain: The Missing Element in Data Science

As a data science consultant that routinely deals with large companies and their data analysis, data science and machine learning challenges, I have come to understand one key element of the data scientist’s skill set that isn’t oft-discussed in data science circles online. In this post I hope to elucidate on the importance of domain knowledge.

Over the last several years, there has (rightly) been significant debate on the skill sets of data scientists, and the importance of business, statistics, programming and other skill sets. Interesting sub-classifications of professions, such as “data hacker”, “data nerd” and other terms have been used to describe the various combinations or intersections of these skill sets.

The Importance of Domain Knowledge

In all of these discussions, however, one key element has been left out. And that is the domain.

Domain_DS

Domain knowledge is an important subset of the data scientist’s work. Although the perfect data scientist is a bit of a unicorn, the domain should be an important consideration.

Domain knowledge is distinct from statistics, data analysis, programming and the purely technical areas, and it is easy to see how that is the case. However, business knowledge is often conflated with domain knowledge, perhaps understandably, because these are both vague and interdisciplinary areas. Business knowledge entails some amount of financial knowledge, unit economics models, strategy, people management, and a range of other skills taught in business schools, and more commonly, learned in organisations on the job. Domain knowledge, however, is like being a kind of human expert system. Wikipedia defines an expert system without defining expertise. What role does expertise play in data science, however?

Domain knowledge is a result of the system exploration that humans as system builders naturally do. To be able to formulate intelligent hypotheses, the unique cause-effect chains that are relevant to specific systems can be studied and understood. Do humans learn about systems in ways that are different from how machines might explore them, if we were to give them infinite data and computational capability? That is a hard question to answer in this context, and perhaps represents a red herring of sorts. What is useful to note, however, is that machine learning models still rely on human-formulated hypotheses. There is the odd example of an expert system that has formulated hypotheses and proved them (as is happening in medicine, these days), but these examples are hardly possible without human intervention.

Now that we have established that human intervention has become necessary in machine learning systems, data science can be seen as a field that relies uniquely on human-formulated hypotheses. While computational power and statistical models help us explore and construct hypotheses, the decisions that are made from this data – that help us define hypotheses, model the data to test these hypotheses, construct mathematical or statistical models of these data, and then evaluate the results of those tests – all of these activities take place with human intervention.

So where does domain fit in? Domain experts are those who have significant experience learning about one or a few interconnected systems in intimate ways. Their ability to develop a gut feel for the system’s performance and characteristics helps them leap frog the formulation of hypotheses, and this is their biggest benefit, compared to domain-agnostic data scientists, who merely have the programming, statistics, business and communication skills required to make serious analysis happen.

Domain Expertise and Analysis Paralysis

Domain expertise is probably one fine way to fight off the analysis-paralysis problem that plagues many data science teams. Some data science teams take up significant time and resources to experiment with ideas vastly, and the availability of high performance computing power on tap makes them take hypothesis formulation less seriously. Adversity is truly the mother of inventiveness, and it is, for example, when computing power was at a premium, that some of the most efficient sorting algorithms were devised. Similarly, the availability of computing power and statistical modeling capabilities on a massive scale de-incentivize the need to ask pertinent questions.

Pertinent questions and specific answers lead to tangible decisions and related business improvements. Without the benefit of domain knowledge, this is not possible. Analysis paralysis is a very real phenomenon. Data scientists are susceptible in organizations that value domain expertise, and don’t value analytical solutions. In situations where analytical solutions and problem solving are valued, data science that fly blind toting algorithms and machine learning won’t come out on top either – they’re more likely to hurt the credibility of the data science exercise than help it, when they solve simple problems that have pre-existing domain formulations with the help of complex algorithms (which may sometimes not give sufficient insight into their own workings, despite working well).

Challenge or Channel Domain Expertise?

Machine learning work done in medicine (cancer cell detection) points to a future where human-learned skills are replicated by deep learning or reinforcement learning systems. Alternatively, many real data science programs at diverse companies indicate an analysis paralysis that can be addressed by involvement to a greater degree of domain experts of specific kinds in the data science hypothesis formulation, analysis and  interpretation of results. The latter is more representative of a real world scenario than the former, where an expert system independently learns about a hard problem and solves it.

Doing Data Science Better

In order to be able to do data science better, it isn’t merely important to consider developing data scientist resources along the lines described by Drew Conway or Stephan Kolassa. It is important to groom analytically capable people from within domains too. This means distributing the skill set required for serious analysis from the mainstream data science practice, into functional teams. Sometimes, this may mean penetrating leadership teams that work in functional capacities, and at other times, it may mean addressing the needs of small teams directly, by grooming functional/technical talent for doing data science.

Doing data science better doesn’t merely involve leveraging algorithms and their strengths better. It also means asking the right questions. Pay attention to your domain experts, and develop the capabilities around the analytical capabilities of your team. Success for many companies doesn’t look like all-conquering deep learning algorithms, but looks like specific problems solved in a targeted manner, by using well defined problem statements and the right algorithms and frameworks.