Using DAX for predictions with linear regression and correlation
DAX is a powerful programming language used by Excel and Power BI to manipulate data to gain the most insights. I have been learning both of these software’s in extreme detail for over a year now and I have found that one MAJOR drawback is the lack of ability to make predictions using linear regression and Pearson’s correlation coefficient. To obtain these statistics, people generally use R, SAS, or some other powerful statistical software but NOT DAX. Inspired by Rob Collie’s blog “Movers and Fakers: Spotting Outlier Performance in Power Pivot / Power BI”, I realized that, with a little statistics, regression is not a drawback of the software but a limitation on the user’s knowledge of statistics. This post is dedicated to teaching Excel and Power BI users how to utilize DAX in order to achieve optimal, accurate predictions and data driven decisions based off the statistical findings within a dataset.
Watch the video here- https://youtu.be/yDyidzBaFuQ
Before we get to the DAX, let’s update our statistics knowledge. Simple linear regression is used to predict the values of one variable (the criterion variable, Y) based on the values of another variable (the predictor variable, X) using the relationship between the two variables. It is referred to as “simple” linear regression because we only have one predictor variable for the criterion variable, but of course, it is possible to have many predictor variables and the more predictor variables, the more variance is explained in Y by X. The strength of the relationship between X and Y is measured by Pearson’s correlation coefficient r, or for more accurate predictions- the coefficient of determination, r squared. If r squared is -1, there is a strong negative relationship between these two variables. If r squared is 0, there is absolutely no relation between these two variables. If r squared is 1, then there is a perfect relationship between these variables and there is no variance that exists between these variables (which would be impossible unless both variables are in fact, identical). There is much, MUCH more to be said about these and other statistics but I will try to keep things as short and simple as possible. Like in most math/statistics classes, I will show you the most computationally extensive way of achieving results first, then the easier way, so if statistics doesn’t excite you like it does me, skip to the end of this post 🙂
Let’s take a look at the data we will be using for this example (shown below). This is fake data I created in Excel with three columns: Customer ID, Age (X), and Sales (Y). The ultimate goal here is to predict the sales amount, Y, based on the age, X, of customers that bought a hypothetical product.
The formula for Pearson’s correlation coefficient is as follows (where n is the number of observations):
It looks like we will need to create a few calculated columns and measures to make this formula work. The first step is loading our data in Power BI Desktop. After launching Power BI Desktop, we select the “Get Data” icon, then choose Excel and navigate to the location in which the data is stored, then load the data.
The formula for Pearson’s correlation coefficient needs information from:
As a side note, feel free to copy/paste my formulas and use them on your own data and substitute the names of my columns for yours. We create these columns in Power BI Desktop using the following formulas:
XY = [Age (X)]*[Sales (Y)]
Xsquared = [Age (X)]*[Age (X)]
Ysquared = [Sales (Y)]*[Sales (Y)]
The result should look like the screenshot below.
Now we need to create some measures to show the sum of: X, Xsquared, Y, Ysquared, and XY using the formulas below.
Xsum = SUM(DemoTable[Age (X)])
Xsquaredsum = SUM(DemoTable[Xsquared])
Ysum = SUM(DemoTable[Sales (Y)])
Ysquaredsum = SUM(DemoTable[Ysquared])
XYsum = SUM(DemoTable[XY])
The result of our calculations is shown below:
Now we can create Pearson’s correlation coefficient to see if our data has a strong enough relationship for prediction. Remember, we are trying to achieve a result for r that is close to 1 where:
I split up the formula for r in 4 different parts to make computations easier to read and follow.
r numerator = 10*[XYsum]-[Xsum]*[Ysum]
r denominator 1 = (10*[Xsquaredsum]-[Xsum]^2)*(10*[Ysquaredsum]-[Ysum]^2)
r denominator 2 = SQRT([r denominator 1])
r = DIVIDE([r numerator],[r denominator 2])
The correlation coefficient for our data is:
We simply square this number to achieve the coefficient of determination, r squared:
A r squared value of 0.8516 is a great indicator that X and Y have a strong linear relationship. Now lets find the equation of the linear regression line that we will use to make predictions with. The formula for a line is usually in the format: Y = mx + b (where m is the slope of the line, X is the predictor variable and b is the Y intercept). The formula for finding m and b is as follows:
The DAX measures I used to find the equation of the line is as follows:
Slope = (10*[XYsum] – [Xsum]*[Ysum])/(10*[Xsquaredsum] – [Xsum]^2)
Y Intercept = DIVIDE(([Ysum]*[Xsquaredsum] – [Xsum]*[XYsum]),(10*[Xsquaredsum] – [Xsum]^2))
Putting the results of these two measures together produces the formula for the line:
FINALLY, we are able to start making predictions about the sales amount, Y, based on the values of age, X.
Let’s see how much sales is predicted for the ages 10, 30, and 50. We create the last measures needed for this analysis by substituting these ages into the formula of our regression line.
10 Sales = 0.9675*10 – 5.0895
30 Sales = 0.9675*30 – 5.0895
50 Sales = 0.9675*50 – 5.0895
The results of these measures are as follows:
We can see that as age increases, the sales amount increases so if we were advertising our hypothetical product, we should target individuals around 50 years old.
Neat huh?
Now let’s see how to do this the easy way using Excel. If we highlight the X and Y data ranges back in our Excel workbook, we can insert a regular scatter chart as shown below.
Once we insert the scatter chart, right click on any data point and select “Add Trendline…”.
Once the Format Trendline dialog box appears on the right, scroll to the bottom of the box and select “Display Equation on chart” and “Display R-squared value on chart” as shown below.
After you conclude that r squared is high enough for accurate predictions and obtain the formula for the line, use Power BI to make predictions for sales amount by age as shown earlier in the post.
You can also add trendlines to scatterplots in Power BI Desktop.