This page is an addition to my blog that’s intended to address a specific purpose. There’s a lot I’m interested in the data science and big data space, and this blog doesn’t fully reflect those interests. In the past several months, my professional life has taken me down some very interesting data analysis paths, and I’ve had a chance to learn from some of the industry pioneers in big data. I’ve become more open to the choices and preferences of data scientists and analysts, as a result, and have come away wiser about what’s required for serious, large scale data science projects.
Dabbling with Databases
From structured, tabular schema to distributed, NoSQL databases, there’s a wide range of options for data scientists that want to store data, especially massive data sets. The free and open source software industry’s technology landscape has enabled organizations large and small with a number of free database distributions – SQL, PostgreSQL and MySQL stand out on the relational database side, where business logic can be encoded into table schemas. However, the availability of large amounts of unstructured data has made it necessary to use distributed, NoSQL databases. Many common NoSQL databases such as Mongo are for specific purposes, but the data analysis world prefer the fault tolerance of NoSQL databases such as HBase, that uses HDFS. Here’s an image from Hortonworks that explains the role of HDFS in the overall architecture of a typical big data implementation, combining data acquisition, storage, governance, security and operations.
This Apache project provides a reliable, replication-based, fault-tolerant data store. Over the last several months, I’ve gained an appreciation for the speed and efficiency of these databases and other platform specific ones, such as MapR’s MapRFS (which is built on top of HDFS, but using container abstractions). What excites me about distributed databases is their use in Big Compute jobs (as distinct from Big Data, which are essentially characterized by Data Lake architectures). The easy and cheap availability of clusters on the cloud (such as AWS, Azure or Google Cloud) that are capable of large scale data storage and computation makes the value propositions for those getting into data analytics very cost-effective.
My experiences with the Hadoop ecosystem’s various flavours of implementations have taught me to like structured data and relational databases even more than I used to. For one thing, data processing for refining unstructured data is a hard problem, even given the considerable weight that companies like Trifecta are throwing behind it. Data wrangling is the slowest step in the chemical reaction that is data science, and determines the order of this reaction in many ways. Effective data wrangling is what exposes features to machine learning models and statistical analysis algorithms, which is hard to do in NoSQL databases by default. Extract, Transform and Load (ETL) processes therefore take a position of importance in data architectures in any serious business.
R, Python and Spark
The languages I have primarily associated myself with in the past several years have been R and Python. While I have used the former primarily for fast data analysis, building models and for visualizing data, the latter has become my go-to for quick and easy analysis of data frames and visualizations of medium sized data sets that range from a few hundred megabytes to a gigabyte. While a small sample of what I have done in R so far is available on this blog, I don’t maintain such a blog for my Python work, despite my experience in Python being more extensive – ranging from optimization algorithms to big data analysis and machine learning at scale.
What has captivated me in the last few months, of course, is Apache Spark. Apache Spark is self-described as a general purpose, large scale, distributed data processing engine, and it lives up to each one of those claims. While the data analysis I have done on smaller datasets has been easy enough to do on Python (and R, for that matter), Spark was revelatory in its ability to easily process large data sets on a cluster. Spark’s Scala and Python based implementations of Resilient Distributed Datasets (RDDs) made short work of setting up data files in Spark as a DataFrame (which is a very common abstraction in R, and is common enough in Python for data analysis). Spark performs equally well in its latest versions on either the Python (PySpark), R (SparkR) and Scala (native Spark) implementations, which is a pleasant surprise for data scientists who are used to Python and R.
However, Apache Spark, while great for data processing, is not without its disadvantages. Sometimes, it is complex, and error-prone in memory management on cluster. This causes occasional headaches for even experienced data scientists. Apache Spark also suffers from poor integration with YARN, which is the native resource manager within Hadoop architectures. Apache Spark also does not support visualizations natively. Despite the availability of Apache Zeppelin – which I would call a work in progress as of October 2016 – there is no truly excellent user interface for coding in Apache Spark, that takes the data scientist’s workload into account. In other words, if you like RStudio and Jupyter Notebook, you’ll be disappointed to know that Zeppelin isn’t as mature as these.
Despite these drawbacks of Spark, (and I believe the community will address them in time), its position of place in data science is very central at the moment. In many ways, I think Spark is ahead of its time, and the memory errors and other issues developers face from time to time will go away as the framework and hardware mature. The RDD abstraction is a very handy one for the problems that data scientists face, and the core ideas surrounding Spark are very sound, making it a platform of choice for data science teams working on large data sets.
The Challenges of Data Science
The challenges of data science are not only related to platforms, software and hardware. Data science is fundamentally the conversion of hypotheses and models of the data into verified insights through analysis. When you put the data scientist’s workflow into this context, you understand the value of sound statistical foundations for data analysis.
I’ve written earlier on this blog about the importance of domain in the data scientist’s job description. There’s lots of discussion on forums and data science blogs about what this Venn diagram for data science should look like, and for good reason. In my experience, all of the skills listed below have played a role in my own effectiveness in data science, and being informed about domains will make data scientists better prepared for taking on analysis and projects.
Fortunately, the tools at this stage enable such analysis very well. R and Python have built a library of data analysis tools over the years, that can do a range of statistical analyses, both descriptive and inferential. The challenges of big data are being addressed by Apache Spark’s libraries for statistics and data analysis, which are diverse and effective. Spark’s libraries for statistics are not as extensive as Python’s or R’s – the latter especially shines in comparison, because of the sheer library of functions and packages – but the growing body of code that Spark constitutes enables it to be compact and do the same thing that R or Python do, and in a concise and elegant enough way.
In my view, data science cannot be effective without exploratory data analysis. The tools for reasoning with data are not only numerical but also visual, and I believe visual data analysis tools help construct more widely applicable simple machine learning models better – and rather than simply amping up model complexity as novice data scientists are wont to do, there is value stating misclassification and error rates up front and striving for better data quality or even better modeling techniques, to alleviate the irreducible error in a model.
Feature Engineering is your Friend
Feature engineering techniques have been gaining prominence, especially since NoSQL databases and processing for derived tabular data gives data engineers and analysts an opportunity to better define, engineer and summarise their data, into the same column stores as they would the raw data. For data scientists, this is a force multiplier. In the past, the construction of polynomial regression models when the toolset consisted only of linear models, for instance, involved the creation of engineered features. With the advent of many classification and regression algorithms, there is real value to such engineered features. The definition of new features allows us to address one of machine learning’s most important stumbling blocks for data scientists – real world model quality. While model quality for a large number of data scientists is informed by context, the availability of feature engineering approaches makes it possible to experiment with many different models.
The availability of data abstractions such as dataframes and RDDs make it possible to construct such features on the go, and as samples and datasets change, the same rule based transformations, or functional transformations update the feature. Such feature engineering is a very important aspect of data science today. Feature engineering is not only a way of building better models, but a way of getting from thought to thing faster, by using less resources and by building repeatable models.
Deep Learning as “Just Another Tool”
Deep Learning users are of two kinds: one kind likes to see Deep Learning as one more in the suite of tools that can convert data into insights, while the other camp intends to use Deep Learning methods in ever new ways and as a replacement for special purpose algorithms that are meant to solve a specific kind of problem formulation. The former are more pragmatic and in my view, they are likely to have more success doing data science in general.
I have come across data scientists that are happy to use ML and DL algorithm implementations from established libraries, without due knowledge of the details of some of these implementations. They also do so without due diligence to the nature of the data, and this makes them less successful at interpreting the nature of results. Such behaviours in my view are dangerous and a ready recipe for data analysis disaster. I see many data scientists use Deep Learning in much the same way – with the intent that because the tools are available and free, they can be used to replace tools used in other situations. I would advise caution in such situations, because there is a very strong likelihood that a general purpose algorithm such as a deep learning algorithm is likely to be more computationally expensive (and wasteful) than an algorithm that is developed with clear mathematical formulations and which is therefore meant to solve specific problems. Too often have such problems been an issue in data science, where the model is seen as a black box, or a magic wand, rather than as an arguable, negotiable “data arranger and processor” that produces specific results.
The advent of AI and Deep Learning’s role in this is a whole other topic. However, those doing data science, rather than building intelligent systems, should seek to use the best, most efficient method, rather than use a general purpose model for purposes of convenience, because this parsimony will pay off in the long run, when integrating solutions as services for specific problems.
Don’t Ignore Time
By “time” here I mean the independent variable in a time series analysis, of course. Too many data scientists don’t test the underlying assumptions that the modeling technique makes about the data, such as the normality assumption in ordinary least squares regression. They also don’t test an important element of data collection – the time variable. Time series analysis can give crucial insights about data, since a number of quantities reveal their true patterns of variation only along a time axis, and seldom in conjunction with other variables do they reveal interesting effects.
All real world quantities change with time (with the possible exception of the constants of nature or mathematics) and the time based variation in data should be considered when analyzing data sets. This means that data scientists should not only be aware of time series analysis methods, but that they should be willing to understand process shift and drift, and heteroscedasticity. Autocorrelation, autoregression and models based on these analyses can be powerful allies for the data scientist that’s dealing with streams of data, especially sensor data.
From Thought to Thing and Fast
Gone are the days when data and business analysis experts would take days to run through models and ideas. With the advent of advanced data structuring and storage tools, and the availability of some of the tools alluded to above, there is a real value in fast iteration cycles in data science, and there are real benefits to iterative model building. Iterative statistical model building and iterative hypothesis testing are essential for the modern data scientist. The tools and platforms that enable this, such as distributed computing paradigms, large scale data analysis methods and machine learning techniques, all play a role here.
Business leaders around the world realize the potential and importance of some of these tools, and see them as strategic enablers when used well. Naturally, the demand for data scientist skills (and the demand for good data scientists, as a result) is high. More and more business leaders, however, are now interested in data science done right, since the propensity for data literacy is higher among seasoned managers. They are not satisfied merely with the initial results from analysis, but are interested in building platforms and business solutions for the management around these tools. This journey through the data science universe has definitely given me perspective on the challenges that managers face, in addition to the challenges and skills of data scientists themselves.