The dangers of Cortana Analytics and poor data
Microsoft has recently introduced Cortana Analytics — an Azure-based service that allow users to perform predictive analysis through natural language, by just asking a question either via text, or by speech. While I believe Cortana Analytics to be amazing and I look forward not only to use it, but to implement it to our customers, I am concerned on the dangers imposed in relying on this technology. And no, Microsoft is not the one at fault this time. It is the users who I am worried about.
So what is Cortana Analytics? Cortana Analytics is the next step in natural language query for Microsoft Power BI. Currently in Power BI, users can write a question in the Power BI search box, and the system will automatically generate the most appropriate report to answer that question. The following is an example of a question I asked Power BI against the Retail Analysis sample data set:
In simple terms, this Q&A feature of Power BI means that users no longer need to design reports or ask for reports to be designed. They can simply ask the question they want, fine-tune the output to suit their needs (e.g. show a map rather than horizontal bars), and then save the query themselves to be used later, or to share it with their peers. This is great news not only to users, but also to BI/DW specialists who do not need to be fidgeting with report design anymore and can focus their efforts in data modeling instead.
So what does Cortana Analytics adds to the mix? First, it allow users to ask questions straight from the Windows 10 desktop using Cortana. Searches can be performed not only through typing, but also through speech. Second and most importantly, Cortana Analytics allow users to perform predictive analysis, and ask forward-looking questions like what customers are most likely to churn in the next quarter or what is the expected waiting time for patients during the holidays.
The problem with this scenario is that users might ask questions which the analytics engine is not able to answer correctly, and by no means due to a fault on the product, but due to a poor data set. Let me elaborate.
Suppose that we have a data set (picture it as an Excel sheet) from a clinical trial containing sick and healthy patients. In this sheet we have 30 variables (i.e. columns) containing details such as age, gender, blood pressure, and so forth. The last column is a TRUE/FALSE (i.e. binary) column that determines whether the patient is sick (TRUE) or not (FALSE). Now let’s suppose that we want to find out which one of the 29 columns in this data set can influence the outcome of the 30th column. In other words, which variables determine whether a person can get sick or not.
For the example above, we would normally conduct prescriptive techniques such as regression analysis or decision trees in order to answer our question. With Cortana Analytics this is no different. Cortana will forward this question to a system that is apt to answer this question (i.e. Azure Machine Learning) in order to obtain the answer.
Now here is the problem. Let’s suppose that we run a regression analysis against the clinical trial example given above. If we have thousands or even hundreds of thousands of observations (i.e. rows) in this data set, then it is likely that we will be able to get a regression model that tell us with a acceptable degree of confidence which variables are most likely to make someone sick. Now if we have fewer observations (say, less than 100), then it is likely that the regression model will provide no concrete results. This is not a limitation of a vendor or its product. It is a fundamental fact of statistical analysis. And even if Cortana hides all the technicalities of such analysis under the hood, it does not mean that such technicalities are no longer relevant.
There are also other factors that can negatively affect the outcome of an analytical model. For example when dealing with numbers, the system must know which variables are discrete and which are continuous, as well was their measurement scale (i.e. interval or ratio). Also variables that have an ordinal value such as cold, warm and hot must be identified accordingly.
So the point is that if the data set is not appropriately defined, and if it does not contain enough observations, any sort of predictive or prescriptive analysis is likely to provide a poor model that is not meant to be trusted. Now imagine Cortana Analytics on the hands of a decision maker who has no idea about those issues, and is attempting to execute an analysis against an ill-suited data set. A recipe for certain disaster.
Now there are some people out there who thinks that with Cortana Analytics there is no need for data scientists. Now I urge those people to reconsider based on what I have outlined in this post. Thinking that Cortana Analytics is a replacement for data scientists is like thinking that auto-pilot is a replacement for an airplane flight crew.
This is the classic case of garbage in, garbage out. Now imagine when users do not have the training to identify what is garbage and what is not.