Regression Analysis

You need to consider many factors when you’re buying a used car. Once you narrow your choice down to a particular car model, you can get a wealth of information about individual cars on the market through the Internet. How do you navigate through it all to find the best deal? By analyzing the data you have available.

Let's look at how this works using the Assistant in Minitab 17. With the Assistant, you can use regression analysis to calculate the expected price of a vehicle based on variables such as year, mileage, whether or not the technology package is included, and whether or not a free Carfax report is included.

And it's probably a lot easier than you think.

A search of a leading Internet auto sales site yielded data about 988 vehicles of a specific make and model. After putting the data into Minitab, we choose Assistant > Regression…

At this point, if you aren’t very comfortable with regression, the Assistant makes it easy to select the right option for your analysis.

A Decision Tree for Selecting the Right Analysis

We want to explore the relationships between the price of the vehicle and four factors, or X variables. Since we have more than one X variable, and since we're not looking to optimize a response, we want to choose Multiple Regression.

This data set includes five columns: mileage, the age of the car in years, whether or not it has a technology package, whether or not it includes a free CARFAX report, and, finally, the price of the car.

We don’t know which of these factors may have significant relationship to the cost of the vehicle, and we don’t know whether there are significant two-way interactions between them, or if there are quadratic (nonlinear) terms we should include—but we don’t need to. Just fill out the dialog box as shown.

Press OK and the Assistant assesses each potential model and selects the best-fitting one. It also provides a comprehensive set of reports, including a Model Building Report that details how the final model was selected and a Report Card that notifies you to potential problems with the analysis, if there are any.

Interpreting Regression Results in Plain Language

The Summary Report tells us in plain language that there is a significant relationship between the Y and X variables in this analysis, and that the factors in the final model explain 91 percent of the observed variation in price. It confirms that all of the variables we looked at are significant, and that there are significant interactions between them.

The Model Equations Report contains the final regression models, which can be used to predict the price of a used vehicle. The Assistant provides 2 equations, one for vehicles that include a free CARFAX report, and one for vehicles that do not.

We can tell several interesting things about the price of this vehicle model by reading the equations. First, the average cost for vehicles with a free CARFAX report is about $200 more than the average for vehicles with a paid report ($30,546 vs. $30,354). This could be because these cars probably have a clean report (if not, the sellers probably wouldn’t provide it for free).

Second, each additional mile added to the car decreases its expected price by roughly 8 cents, while each year added to the cars age decreases the expected price by $2,357.

The technology package adds, on average, $1,105 to the price of vehicles that have a free CARFAX report, but the package adds $2,774 to vehicles with a paid CARFAX report. Perhaps the sellers of these vehicles hope to use the appeal of the technology package to compensate for some other influence on the asking price.

Residuals versus Fitted Values

While these findings are interesting, our goal is to find the car that offers the best value. In other words, we want to find the car that has the largest difference between the asking price and the expected asking price predicted by the regression analysis.

For that, we can look at the Assistant’s Diagnostic Report. The report presents a chart of Residuals vs. Fitted Values. If we see obvious patterns in this chart, it can indicate problems with the analysis. In that respect, this chart of Residuals vs. Fitted Values looks fine, but now we’re going to use the chart to identify the best value on the market.

In this analysis, the “Fitted Values” are the prices predicted by the regression model. “Residuals” are what you get when you subtract the actual asking price from the predicted asking price—exactly the information you’re looking for! The Assistant marks large residuals in red, making them very easy to find. And three of those residuals—which appear in light blue above because we’ve selected them—appear to be very far below the asking price predicted by the regression analysis.

Selecting these data points on the graph reveals that these are vehicles whose data appears in rows 357, 359, and 934 of the data sheet. Now we can revisit those vehicles online to see if one of them is the right vehicle to purchase, or if there’s something undesirable that explains the low asking price.

Sure enough, the records for those vehicles reveal that two of them have severe collision damage.

But the remaining vehicle appears to be in pristine condition, and is several thousand dollars less than the price you’d expect to pay, based on this analysis!

With the power of regression analysis and the Assistant, we’ve found a great used car—at a price you know is a real bargain.

In regression, "sums of squares" are used to represent variation. In this post, we’ll use some sample data to walk through these calculations.

squares The sample data used in this post is available within Minitab by choosing Help> Sample Data, or File> Open Worksheet> Look in Minitab Sample Data folder (depending on your version of Minitab). The dataset is called ResearcherSalary.MTW, and contains data on salaries for researchers in a pharmaceutical company.

For this example we will use the data in C1, the salary, as Y or the response variable and C4, the years of experience as X or the predictor variable.

First, we can run our data through Minitab to see the results: Stat> Regression> Fitted Line Plot. The salary is the Y variable, and the years of experience is our X variable. The regression output will tell us about the relationship between years of experience and salary after we complete the dialog box as shown below, and then click OK:

fitted line plot dialog

In the window above, I’ve also clicked the Storage button, selected the box next to Coefficients to store the coefficients from the regression equation in the worksheet. When we click OK in the window above, Minitab gives us two pieces of output:

fitted line plot and output

On the left side above we see the regression equation and the ANOVA (Analysis of Variance) table, and on the right side we see a graph that shows us the relationship between years of experience on the horizontal axis and salary on the vertical axis. Both the right and left side of the output above are conveying the same information. We can clearly see from the graph that as the years of experience increase, the salary increases, too (so years of experience and salary are positively correlated). For this post, we’ll focus on the SS (Sums of Squares) column in the Analysis of Variance table.

Calculating the Regression Sum of Squares

We see a SS value of 5086.02 in the Regression line of the ANOVA table above. That value represents the amount of variation in the salary that is attributable to the number of years of experience, based on this sample. Here's where that number comes from.

Calculate the average response value (the salary). In Minitab, I’m using Stat> Basic Statistics> Store Descriptive Statistics:

dialog boxes

In addition to entering the Salary as the variable, I’ve clicked Statistics to make sure only Mean is selected, and I’ve also clicked Options and checked the box next to Store a row of output for each row of input. As a result, Minitab will store a value of 82.9514 (the average salary) in C5 35 times:

data

Next, we will use the regression equation that Minitab gave us to calculate the fitted values. The fitted values are the salaries that our regression equation would predict, given the number of years of experience.

Our regression equation is Salary = 60.70 + 2.169*Years, so for every year of experience, we expect the salary to increase by 2.169.

The first row in the Years column in our sample data is 11, so if we use 11 in our equation we get 60.70 + 2.169*11 = 84.559. So with 11 years of experience our regression equation tells us the expected salary is about $84,000.

Rather than calculating this for every row in our worksheet manually, we can use Minitab’s calculator: Calc> Calculator (I used the stored coefficients in the worksheet to include more decimals in the regression equation that I’ve typed into the calculator):

calculator

After clicking OK in the window above, Minitab will store the predicted salary value for every year in column C6. NOTE: In the regression graph we obtained, the red regression line represents the values we’ve just calculated in C6.

Now that we have the average salary in C5 and the predicted values from our equation in C6, we can calculate the Sums of Squares for the Regression (the 5086.02). We’ll use Calc> Calculator again, and this time we will subtract the average salary from the predicted values, square those differences, and then add all of those squared differences together:

calculator

We square all the values because some of the predicted values from our equation are lower than the average, so those predicted values would be negative. If we sum together both positive and negative values, they will cancel each other out. But because we square the values, all observations will be taken into account.

We have just calculated the Sum of Squares for the regression by summing the squared values. Our results should match what we’d seen in the regression output previously:

output

Calculating the Error Sum of Squares

The Error Sum of Squares is the variation in the salary that is not explained by number of years of experience. For example, the additional variation in the salary could be due to the person’s gender, number of publications, or other variables that are not part of this model. Any variation that is not explained by the predictors in the model becomes part of the error term.

To calculate the error sum of squares we will use the calculator (Calc > Calculator) again to subtract the fitted values (the salaries predicted by our regression equation) from the observed response (the actual salaries):

calculator

In C9, Minitab will store the differences between the actual salaries and what our equation predicted.

Because we’re calculating sums of squares again, we’re going to square all the values we stored in C9, and then add them up to come up with the sum of squares for error:

calculator

When we click OK in the calculator window above, we see that our calculated sum of squares for error matches Minitab’s output:

output

Finally the Sum of Squares total is calculated by adding the Regression and Error SS together: 5086.02 + 1022.61 = 6108.63.

I hope you’ve enjoyed this post, and that it helps demystify what sums of squares are. If you’d like to read more about regression, you may like some of Jim Frost’s regression tutorials.

There may be huge potential benefits waiting in the data in your servers. These data may be used for many different purposes. Better data allows better decisions, of course. Banks, insurance firms, and telecom companies already own a large amount of data about their customers. These resources are useful for building a more personal relationship with each customer.

Some organizations already use data from agricultural fields to build complex and customized models based on a very extensive number of input variables (soil characteristics, weather, plant types, etc.) in order to improve crop yields. Airline companies and large hotel chains use dynamic pricing models to improve their yield management. Data is increasingly being referred as the new “gold mine” of the 21st century.

A couple of factors underlie the rising prominence of data (and, therefore, data analysis):

Afficher l'image d'origine

Huge volumes of data

Data acquisition has never been easier (sensors in manufacturing plants, sensors in connected objects, data from internet usage and web clicks, from credit cards, fidelity cards, Customer Relations Management databases, satellite images etc…) and it can easily be stored at costs that are lower than ever before (huge storage capacity now available on the cloud and elsewhere). The amount of data that is being collected is not only huge, it is growing very fast… in an exponential way.

Unprecedented velocity

Connected devices, like our smart phones, provide data in almost real time and it can be processed very quickly. It is now possible to react to any change…almost immediately.

Incredible variety

The data collected is not be restricted to billing information; every source of data is potentially valuable for a business. Not only is numeric data getting collected in a massive way, but also unstructured data such as videos, pictures, etc., in a large variety of situations.

But the explosion of data available to us is prompting every business to wrestle with an extremely complicated problem:

How can we create value from these resources ?

Very simple methods, such as counting words used in queries submitted to company web sites, do provide a good insight as to the general mood of your customers and its evolution. Simple statistical correlations are often used by web vendors to suggest a purchase just after buying a product on the web. Very simple descriptive statistics are also useful.

Just guess what could be achieved from advanced regression models or powerful statistical multivariate techniques, which can be applied easily with statistical software packages like Minitab.

A simple example of the benefits of analyzing an enormous database

Let's consider an example of how one company benefited from analyzing a very large database.

Many steps are needed (security and safety checks, cleaning the cabin, etc.) before a plane can depart. Since delays negatively impact customer perceptions and also affect productivity, airline companies routinely collect a very large amount of data related to flight delays and times required to perform tasks before departure. Some times are automatically collected, others are manually recorded.

A major worldwide airline company intended to use this data to identify the crucial milestones among a very large number of preparation steps, and which ones often triggered delays in departure times. The company used Minitab's stepwise regression analysis to quickly focus on the few variables that played a major role among a large number of potential inputs. Many variables turned out to be statistically significant, but two among them clearly seemed to make a major contribution (X6 and X10).

Analysis of Variance1

Source DF Seq SS Contribution Adj SS Adj MS F-Value P-Value

X6 1 337394 53.54% 2512 2512.2 29.21 0.000

X10 1 112911 17.92% 66357 66357.1 771.46 0.000

When huge databases are used, statistical analyses may become overly sensitive and detect even very small differences (due to the large sample and power of the analysis). P values often tend to be quite small (p < 0.05) for a large number of predictors.

However, in Minitab, if you click on Results in the regression dialogue box and select Expanded tables, contributions from each variable will get displayed. X6 and X10 when considered together were contributing to more than 80% of the overall variability (with the largest F values by far), the contributions from the remaining factors were much smaller. The airline then ran a residual analysis to cross-validate the final model.

In addition, a Principal Component Analysis (PCA, a multivariate technique) was performed in Minitab to describe the relations between the most important predictors and the response. Milestones were expected to be strongly correlated to the subsequent steps.

The graph above is a Loading Plot from a principal component analysis. Lines that go in the same direction and are close to one another indicate how the variables may be grouped. Variables are visually grouped together according to their statistical correlations and how closely they are related.

A group of nine variables turned out to be strongly correlated to the most important inputs (X6 and X10) and to the final delay times (Y). Delays at the X6 stage obviously affected the X7 and X8 stages (subsequent operations), and delays from X10 affected the subsequent X11 and X12 operations.

Conclusion

This analysis provided simple rules that this airline's crews can follow in order to avoid delays, making passengers' next flight more pleasant.

The airline can repeat this analysis periodically to search for the next most important causes of delays. Such an approach can propel innovation and help organizations replace traditional and intuitive decision-making methods with data-driven ones.

What's more, the use of data to make things better is not restricted to the corporate world. More and more public administrations and non-governmental organizations are making large, open databases easily accessible to communities and to virtually anyone.

Most important variable You’ve performed multiple linear regression and have settled on a model which contains several predictor variables that are statistically significant. At this point, it’s common to ask, “Which variable is most important?”

This question is more complicated than it first appears. For one thing, how you define “most important” often depends on your subject area and goals. For another, how you collect and measure your sample data can influence the apparent importance of each variable.

With these issues in mind, I’ll help you answer this question. I’ll start by showing you statistics that don’t answer the question about importance, which may surprise you. Then, I’ll move on to both statistical and non-statistical methods for determining which variables are the most important in regression models.

Don’t Compare Regular Regression Coefficients to Determine Variable Importance

Regular regression coefficients describe the relationship between each predictor variable and the response. The coefficient value represents the mean change in the response given a one-unit increase in the predictor. Consequently, it’s easy to think that variables with larger coefficients are more important because they represent a larger change in the response.

However, the units vary between the different types of variables, which makes it impossible to compare them directly. For example, the meaning of a one-unit change is very different if you’re talking about temperature, weight, or chemical concentration.

This problem is further complicated by the fact that there are different units within each type of measurement. For example, weight can be measured in grams and kilograms. If you fit models for the same data set using grams in one model and kilograms in another, the coefficient for weight changes by a factor of a thousand even though the underlying fit of the model remains unchanged. The coefficient value changes greatly while the importance of the variable remains constant.

Takeaway: Larger coefficients don’t necessarily identify more important predictor variables.

Don’t Compare P-values to Determine Variable Importance

The coefficient value doesn’t indicate the importance a variable, but what about the variable’s p-value? After all, we look for low p-values to help determine whether the variable should be included in the model in the first place.

P-value calculations incorporate a variety of properties, but a measure of importance is not among them. A very low p-value can reflect properties other than importance, such as a very precise estimate and a large sample size.

Effects that are trivial in the real world can have very low p-values. A statistically significant result may not be practically significant.

Takeaway: Low p-values don’t necessarily identify predictor variables that are practically important.

Do Compare These Statistics To Help Determine Variable Importance

We ruled out a couple of the more obvious statistics that can’t assess the importance of variables. Fortunately, there are several statistics that can help us determine which predictor variables are most important in regression models. These statistics might not agree because the manner in which each one defines "most important" is a bit different.

Standardized regression coefficients

I explained how regular regression coefficients use different scales and you can’t compare them directly. However, if you standardize the regression coefficients so they’re based on the same scale, you can compare them.

To obtain standardized coefficients, standardize the values for all of your continuous predictors. In Minitab 17, you can do this easily by clicking the Coding button in the main Regression dialog. Under Standardize continuous predictors, choose Subtract the mean, then divide by the standard deviation.

After you fit the regression model using your standardized predictors, look at the coded coefficients, which are the standardized coefficients. This coding puts the different predictors on the same scale and allows you to compare their coefficients directly. Standardized coefficients represent the mean change in the response given a one standard deviation change in the predictor.

Takeaway: Look for the predictor variable with the largest absolute value for the standardized coefficient.

Change in R-squared when the variable is added to the model last

Multiple regression in Minitab's Assistant menu includes a neat analysis. It calculates the increase in R-squared that each variable produces when it is added to a model that already contains all of the other variables.

Because the change in R-squared analysis treats each variable as the last one entered into the model, the change represents the percentage of the variance a variable explains that the other variables in the model cannot explain. In other words, this change in R-squared represents the amount of unique variance that each variable explains above and beyond the other variables in the model.

Takeaway: Look for the predictor variable that is associated with the greatest increase in R-squared.

An Example of Using Statistics to Identify the Most Important Variables in a Regression Model

The example output below shows a regression model that has three predictors. The text output is produced by the regular regression analysis in Minitab. I’ve standardized the continuous predictors using the Coding dialog so we can see the standardized coefficients, which are labeled as coded coefficients. You can find this analysis in the Minitab menu: Stat > Regression > Regression > Fit Regression Model.

The report with the graphs is produced by Multiple Regression in the Assistant menu. You can find this analysis in the Minitab menu: Assistant > Regression > Multiple Regression.

Coded coefficient table

Minitab's Assistant menu output that displays the incremental impact of the variables

The standardized coefficients show that North has the standardized coefficient with the largest absolute value, followed by South and East. The Incremental Impact graph shows that North explains the greatest amount of the unique variance, followed by South and East. For our example, both statistics suggest that North is the most important variable in the regression model.

Caveats for Using Statistics to Identify Important Variables

Statistical measures can show the relative importance of the different predictor variables. However, these measures can't determine whether the variables are important in a practical sense. To determine practical importance, you'll need to use your subject area knowledge.

How you collect and measure your sample can bias the apparent importance of the variables in your sample compared to their true importance in the population.

If you randomly sample your observations, the variability of the predictor values in your sample likely reflects the variability in the population. In this case, the standardized coefficients and the change in R-squared values are likely to reflect their population values.

However, if you select a restricted range of predictor values for your sample, both statistics tend to underestimate the importance of that predictor. Conversely, if the sample variability for a predictor is greater than the variability in the population, the statistics tend to overestimate the importance of that predictor.

Also, consider the accuracy and precision of the measurements for your predictors because this can affect their apparent importance. For example, lower-quality measurements can cause a variable to appear less predictive than it truly is.

If your goal is to change the response mean, you should be confident that causal relationships exist between the predictors and the response rather just a correlation. If there is an observed correlation but no causation, intentional changes in the predictor values won’t necessarily produce the desired change in the response regardless of the statistical measures of importance.

To determine that there is a causal relationship, you typically need to perform a designed experiment rather than an observational study.

Non-Statistical Considerations for Identifying Important Variables

How you define “most important” often depends on your goals and subject area. While statistics can help you identify the most important variables in a regression model, applying subject area expertise to all aspects of statistical analysis is crucial. Real world issues are likely to influence which variable you identify as the most important in a regression model.

For example, if your goal is to change predictor values in order to change the response, use your expertise to determine which variables are the most feasible to change. There may be variables that are harder, or more expensive, to change. Some variables may be impossible to change. Sometimes a large change in one variable may be more practical than a small change in another variable.

“Most important” is a subjective, context sensitive characteristic. You can use statistics to help identify candidates for the most important variable in a regression model, but you’ll likely need to use your subject area expertise as well.

If you're just learning about regression, read my regression tutorial!

Picture of mining truck filled with numbers Data mining uses algorithms to explore correlations in data sets. An automated procedure sorts through large numbers of variables and includes them in the model based on statistical significance alone. No thought is given to whether the variables and the signs and magnitudes of their coefficients make theoretical sense.

We tend to think of data mining in the context of big data, with its huge databases and servers stuffed with information. However, it can also occur on the smaller scale of a research study.

The comment below is a real one that illustrates this point.

“Then, I moved to the Regression menu and there I could add all the terms I wanted and more. Just for fun, I added many terms and performed backward elimination. Surprisingly, some terms appeared significant and my R-squared Predicted shot up. To me, your concerns are all taken care of with R-squared Predicted. If the model can still predict without the data point, then that's good.”

Comments like this are common and emphasize the temptation to select regression models by trying as many different combinations of variables as possible and seeing which model produces the best-looking statistics. The overall gist of this type of comment is, "What could possibly be wrong with using data mining to build a regression model if the end results are that all the p-values are significant and the various types of R-squared values are all high?"

In this blog post, I’ll illustrate the problems associated with using data mining to build a regression model in the context of a smaller-scale analysis.

An Example of Using Data Mining to Build a Regression Model

My first order of business is to prove to you that data mining can have severe problems. I really want to bring the problems to life so you'll be leery of using this approach. Fortunately, this is simple to accomplish because I can use data mining to make it appear that a set of randomly generated predictor variables explains most of the changes in a randomly generated response variable!

To do this, I’ll create a worksheet in Minitab statistical software that has 100 columns, each of which contains 30 rows of entirely random data. In Minitab, you can use Calc > Random Data > Normal to create your own worksheet with random data, or you can use this worksheet that I created for the data mining example below. (If you don’t have Minitab and want to try this out, get the free 30 day trial!)

Next, I’ll perform stepwise regression using column 1 as the response variable and the other 99 columns as the potential predictor variables. This scenario produces a situation where stepwise regression is forced to dredge through 99 variables to see what sticks, which is a key characteristic of data mining.

When I perform stepwise regression, the procedure adds 28 variables that explain 100% of the variance! Because we only have 30 observations, we’re clearly overfitting the model. Overfitting the model is different problem that also inflates R-squared, which you can read about in my post about the dangers of overfitting models.

I’m specifically addressing the problems of data mining in this post, so I don’t want a model that is also overfit. To avoid an overfit model, a good rule of thumb is to include no more than one term for each 10 observations. We have 30 observations, so I’ll include only the first three variables that the stepwise procedure adds to the model: C7, C77, and C95. The output for the first three steps is below.

Stepwise regression output

Under step 3, we can see that all of the coefficient p-values are statistically significant. The R-squared value of 67.54% can either be good or mediocre depending on your field of study. In a real study, there are likely to be some real effects mixed in that would boost the R-squared even higher. We can also look at the adjusted and predicted R-squared values and neither one suggests a problem.

If we look at the model building process of steps 1 - 3, we see that at each step all of the R-squared values increase. That’s what we like to see. For good measure, let’s graph the relationship between the predictor (C7) and the response (C1). After all, seeing is believing, right?

Scatterplot of two variables in regression model

This graph looks good too! It sure appears that as C7 increases, C1 tends to increase, which agrees with the positive regression coefficient in the output. If we didn’t know better, we’d think that we have a good model!

This example answers the question posed at the beginning: what could possibly be wrong with this approach? Data mining can produce deceptive results. The statistics and graph all look good but these results are based on entirely random data with absolutely no real effects. Our regression model suggests that random data explain other random data even though that's impossible. Everything looks great but we have a lousy model.

The problems associated with using data mining are real, but how the heck do they happen? And, how do you avoid them? Read my next post to learn the answers to these questions!

Face it, you love regression analysis as much as I do. Regression is one of the most satisfying analyses in Minitab: get some predictors that should have a relationship to a response, go through a model selection process, interpret fit statistics like adjusted R2 and predicted R2, and make predictions. Yes, regression really is quite wonderful.

Except when it’s not. Dark, seedy corners of the data world exist, lying in wait to make regression confusing or impossible. Good old ordinary least squares regression, to be specific.

For instance, sometimes you have a lot of detail in your data, but not a lot of data. Want to see what I mean?

In Minitab, choose Help > Sample Data...
Open Soybean.mtw.

The data has 88 variables about soybeans, the results of near-infrared (NIR) spectroscopy at different wavelengths. But the data contains only 60 measurements, and the data are arranged to save 6 measurements for validation runs.

A Limit on Coefficients

With ordinary least squares regression, you only estimate as many coefficients as the data have samples. Thus, the traditional method that’s satisfactory in most cases would only let you estimate 53 coefficients for variables plus a constant coefficient.

This could leave you wondering about whether any of the other possible terms might have information that you need.

Multicollinearity

The NIR measurements are also highly collinear with each other. This multicollinearity complicates using statistical significance to choose among the variables to include in the model.

When the data have more variables than samples, especially when the predictor variables are highly collinear, it’s a good time to consider partial least squares regression.

How to Perform Partial Least Squares Regression

Try these steps if you want to follow along in Minitab Statistical Software using the soybean data:

Choose Stat > Regression > Partial Least Squares.
In Responses, enter Fat.
In Model, enter ‘1’-‘88’.
Click Options.
Under Cross-Validation, select Leave-one-out. Click OK.
Click Results.
Check Coefficients. Click OK twice.

One of the great things about partial least squares regression is that it forms components and then does ordinary least squares regression with them. Thus the results include statistics that are familiar. For example, predicted R2 is the criterion that Minitab uses to choose the number of components.

Minitab selects the model with the highest predicted R-squared.

Each of the 9 components in the model that maximizes the predicted R2 value is a complex linear combination of all 88 of the variables. So although the ANOVA table shows that you’re using only 9 degrees of freedom for the regression, the analysis uses information from all of the data.

The regression uses 9 degrees of freedom.

The full list of standardized coefficients shows the relative importance of each predictor in the model. (I’m only showing a portion here because the table is 88 rows long.)

Each variable has a standardized coefficient.

Ordinary least squares regression is a great tool that’s allowed people to make lots of good decision over the years. But there are times when it’s not satisfying. Got too much detail in your data? Partial least squares regression could be the answer.

Want more partial least squares regression now? Check out how Unifi used partial least squares to improve their processes faster.

The image of the soybeans is by Tammy Green and is licensed for reuse under thisCreative Commons License.

Data mining can be helpful in the exploratory phase of an analysis. If you're in the early stages and you're just figuring out which predictors are potentially correlated with your response variable, data mining can help you identify candidates. However, there are problems associated with using data mining to select variables.

In my previous post, we used data mining to settle on the following model and graphed one of the relationships between the response (C1) and a predictor (C7). It all looks great! The only problem is that all of these data are randomly generated! No true relationships are present.

Regression output for data mining example

Scatter plot for data mining example

If you didn't already know there was no true relationship between these variables, these results could lead you to a very inaccurate conclusion.

Let's explore how these problems happen, and how to avoid them

Why Do These Problems Occur with Data Mining?

The problem with data mining is that you fit many different models, trying lots of different variables, and you pick your final model based mainly on statistical significance, rather than being guided by theory.

What's wrong with that approach? The problem is that every statistical test you perform has a chance of a false positive. A false positive in this context means that the p-value is statistically significant but there really is no relationship between the variables at the population level. If you set the significance level at 0.05, you can expect that in 5% of the cases where the null hypothesis is true, you'll have a false positive.

Because of this false positive rate, if you analyze many different models with many different variables you will inevitably find false positives. And if you're guided mainly by statistical significance, you'll leave the false positives in your model. If you keep going with this approach, you'll fill your model with these false positives. That’s exactly what happened in our example. We had 100 candidate predictor variables and the stepwise procedure literally dredged through hundreds and hundreds of potential models to arrive at our final model.

As we’ve seen, data mining problems can be hard to detect. The numeric results and graph all look great. However, these results don’t represent true relationships but instead are chance correlations that are bound to occur with enough opportunities.

If I had to name my favorite R-squared, it would be predicted R-squared, without a doubt. However, even predicted R-squared can't detect all problems. Ultimately, even though the predicted R-squared is moderate for our model, the ability of this model to predict accurately for an entirely new data set is practically zero.

Theory, the Alternative to Data Mining

Data mining can have a role in the exploratory stages of an analysis. However, for all variables that you identify through data mining, you should perform a confirmation study using newly collected to data to verify the relationships in the new sample. Failure to do so can be very costly. Just imagine if we had made decisions based on the model above!

An alternative to data mining is to use theory as a guide in terms of both the models you fit and the evaluation of your results. Look at what others have done and incorporate those findings when building your model. Before beginning the regression analysis, develop an idea of what the important variables are, along with their expected relationships, coefficient signs, and effect magnitudes.

Building on the results of others makes it easier both to collect the correct data and to specify the best regression model without the need for data mining. The difference is the process by which you fit and evaluate the models. When you’re guided by theory, you reduce the number of models you fit and you assess properties beyond just statistical significance.

Theoretical considerations should not be discarded based solely on statistical measures.

Compare the coefficient signs to theory. If any of the signs contradict theory, investigate and either change your model or explain the inconsistency.
Use Minitab statistical software to create factorial plots based on your model to see if all the effects match theory.
Compare the R-squared for your study to those of similar studies. If your R-squared is very different than those in similar studies, it's a sign that your model may have a problem.

If you’re interested in learning more about these issues, read my post about how using too many phantom degrees of freedom is related to data mining problems.

Regardless of who you support in the upcoming U.S. election, we can all agree that it’s been a very bumpy ride! It’s been a particularly chaotic election cycle. Wouldn’t it be nice if we could peek into the future and see potential election results right now? That’s what we'll do in this post! clinton and trump

In 2012, I used binary logistic regression to predict that President Obama would be reelected for a second term. That model requires that an incumbent is running for reelection. With no incumbent this time, I’ll need another approach. I’ve decided to use a Monte Carlo simulation.

By simulating the election 100,000 times, we can examine the distribution of outcomes to determine probabilities for the election winner and to determine which states are the most important to win.

Using Monte Carlo Simulation for the Election

Monte Carlo simulations use a mathematical model to create simulated data for a system or a process in order to evaluate outcomes. I’ll simulate the upcoming election 100,000 times so we can determine which outcomes are more common or rare.

Imagine if we flip 50 coins. Basic probability tells us we should expect 25 heads and 25 tails, but while that is the most likely outcome, it happens only 11% of the time. There is a distribution of other outcomes around the most likely outcome.

The Monte Carlo simulation essentially treats the election as if we were flipping 51 coins (the states plus the District of Columbia). However, we’re using funny coins. For one thing, they have Donald Trump on one side and Hillary Clinton on the other! Also, these coins don’t necessarily have a 50/50 probability, and the probability changes over time. Currently, the Texas coin has 93% chance of showing Trump while the Wisconsin coin has an 80% chance of showing Clinton. The Florida coin, which is very important in our simulation, happens to be very balanced. It has a 51.1% chance of showing Clinton and 48.9% chance of showing Trump.

The U.S. Presidential election awards electoral votes to the winner of each state and the District of Columbia. The winner of a state gets all of the electoral votes for that state, which varies by population. When a candidate obtains 270 or more electoral votes, he or she wins the election.

I’ll have each state and Washington, D.C., flip their coin 100,000 times using the probabilities that Nate Silver calculated on November 2, 2016. The transfer equation for this simulation awards the electoral votes to the winner of each state.

Simulation Results for the Presidential Election

Distribution of simulated electoral votes for Hillary Clinton

The simulation results show that Hillary Clinton currently has the advantage. Over the 100,000 simulated elections, Clinton’s electoral votes range from 149 to 412, with the most likely value of 301. In 95% of the simulated results, Clinton’s electoral votes fall within the range of 247 to 355. Clinton obtains at least 270 electoral votes in 87% of the simulated elections.

While the simulation gives Clinton an overall 87% chance of winning, the probabilities change as candidates win specific states. For example, Florida is a crucial state in this election because it has the largest single state impact on a candidate’s probability of winning the election.

Probability of winning the election based on the winner of Florida

The pie chart shows the probabilities of winning based on the winner of Florida. If Trump doesn’t win Florida, he is essentially out of the race. In simulated elections where Clinton wins Florida, Trump wins the election only 2.5% of the time.

Using Binary Logistic Regression to Dig Deeper into the Simulation

We can also use binary logistic regression to probe our simulated results. Binary logistic regression produces odds ratios that help us identify the states which have the greatest impact on a candidate's probability of winning the election.

Here, an odds ratio represents the odds of winning the election if a candidate wins a given state divided by the odds of winning the election if a candidate loses that state. The larger the odds ratio, the more important the state is to win. Among the battleground states, there is quite a large range of odds ratios—from Florida at 137.3 to Iowa at 2.7. The list includes the top 10 battleground states.

State

Odds ratio

Florida

137.3

Pennsylvania

29.7

Ohio

23.3

Georgia

15.8

Michigan

15.3

North Carolina

13.4

Virginia

9.2

Arizona

6.8

Wisconsin

5.5

Colorado

4.6

The list is pretty cool because it quantifies the importance of each state, and the top states match those you hear about on the news media most frequently.

What to Watch for on Election Night

This simulation indicates that Hillary Clinton is favored to win the election. Consequently, I’m going to focus on what it will take for Donald Trump to win. The five most important states can indicate the direction that the entire election is headed. As an added benefit, these states are mostly in the Eastern time zone, so you can use them to gain an earlier idea of who will ultimately win and how the close the election is likely to be.

Here’s how to read the table below. I start out with the assumption that Trump wins Florida because otherwise he has only a 2.5% chance of winning. For each subsequent row in the table, I add in the next state from the top 5 in which he has the greatest probability of winning and indicate both the chance of winning that state and the election. For example, the second row shows that Trump has an 83.9% chance of winning Georgia and, if he wins both Florida and Georgia, he has a 26.9% chance of winning the election.

Each additional row after Georgia represents a state that is harder for Trump to win. Trump has to win at least four of these states to have a greater than 50% chance of winning the election.

Trump States

Chance of Trump Winning

Most Likely Electoral Votes

Florida (48.9%)

23.9%

285 Clinton

FL + GA (83.9%)

26.9%

283 Clinton

FL + GA + OH (61.2%)

37.2%

276 Clinton

FL + GA + OH + PA (22%)

70.5%

278 Trump

FL + GA + OH + PA + MI (21.2%)

91.9%

291 Trump

The table gets tough for Trump starting in the fourth row, where he needs to win Pennsylvania. However, if he wins Florida, Georgia, and Ohio—which is not an extremely unlikely combination—he'll have a 37% chance of winning the election. In this specific scenario, the electoral vote is likely to be closer than many might expect because Clinton's most likely number of electoral votes is 276. Of course, there is a margin of error around this expected value, which is why Trump has a chance to win.

In short, right now it is difficult for Trump to win, but it is entirely possible that the election will be a squeaker! Watching these key states will give you a forecast of where the race is headed.

There are a few caveats for these results. The probabilities for winning the election are based on simulated results. The underlying state probabilities are based on the status of the race on November 2 and these can change by Election Day. Additionally, early voting has already commenced in a number of states in which the state probabilities were different than they are now.

Despite these caveats, this Monte Carlo simulation shows the overall state of the race and which states are most important for a candidate’s chances of winning.

I’ve written about R-squared before and I’ve concluded that it’s not as intuitive as it seems at first glance. It can be a misleading statistic because a high R-squared is not always good and a low R-squared is not always bad. I’ve even said that R-squared is overrated and that the standard error of the estimate (S) can be more useful.

Even though I haven’t always been enthusiastic about R-squared, that’s not to say it isn’t useful at all. For instance, if you perform a study and notice that similar studies generally obtain a notably higher or lower R-squared, you should investigate why yours is different because there might be a problem.

In this blog post, I look at five reasons why your R-squared can be too high. This isn’t a comprehensive list, but it covers some of the more common reasons.

Is A High R-squared Value a Problem?

A very high R-squared value is not necessarily a problem. Some processes can have R-squared values that are in the high 90s. These are often physical process where you can obtain precise measurements and there's low process noise.

You'll have to use your subject area knowledge to determine whether a high R-squared is problematic. Are you modeling something that is inherently predictable? Or, not so much? If you're measuring a physical process, an R-squared of 0.9 might not be surprising. However, if you're predicting human behavior, that's way too high!

Compare your study to similar studies to determine whether your R-squared is in the right ballpark. If your R-squared is too high, consider the following possibilities. To determine whether any apply to your model specifically, you'll have to use your subject area knowledge, information about how you fit the model, and data specific details.

Reason 1: R-squared is a biased estimate

bathroom scale The R-squared in your regression output is a biased estimate based on your sample—it tends to be too high. This bias is a reason why some practitioners don’t use R-squared at all but use adjusted R-squared instead.

R-squared is like a broken bathroom scale that tends to read too high. No one wants that! Researchers have long recognized that regression’s optimization process takes advantage of chance correlations in the sample data and inflates the R-squared.

Adjusted R-squared does what you’d do with that broken bathroom scale. If you knew the scale was consistently too high, you’d reduce it by an appropriate amount to produce a weight that is correct on average.

Adjusted R-squared does this by comparing the sample size to the number of terms in your regression model. Regression models that have many samples per term produce a better R-squared estimate and require less shrinkage. Conversely, models that have few samples per term require more shrinkage to correct the bias.

For more information, read my posts about Adjusted R-squared and R-squared shrinkage.

Reason 2: You might be overfitting your model

An overfit model is one that is too complicated for your data set. You’ve included too many terms in your model compared to the number of observations. When this happens, the regression model becomes tailored to fit the quirks and random noise in your specific sample rather than reflecting the overall population. If you drew another sample, it would have its own quirks, and your original overfit model would not likely fit the new data.

Adjusted R-squared doesn't always catch this, but predicted R-squared often does. Read my post about the dangers of overfitting your model.

Reason 3: Data mining and chance correlations

If you fit many models, you will find variables that appear to be significant but they are correlated only by chance. While your final model might not be too complex for the number of observations (Reason 2), problems occur when you fit many different models to arrive at the final model. Data mining can produce high R-squared values even with entirely random data!

Before performing regression analysis, you should already have an idea of what the important variables are along with their relationships, coefficient signs, and effect magnitudes based on previous research. Unfortunately, recent trends have moved away from this approach thanks to large, readily available databases and automated procedures that build regression models.

For more information, read my post about using too many phantom degrees of freedom.

Reason 4: Trends in Panel (Time Series) Data

If you have time series data and your response variable and a predictor variable both have significant trends over time, this can produce very high R-squared values. You might try a time series analysis, or including time related variables in your regression model, such as lagged and/or differenced variables. Conveniently, these analyses and functions are all available in Minitab statistical software.

Reason 5: Form of a Variable

It's possible that you're including different forms of the same variable for both the response variable and a predictor variable. For example, if the response variable is temperature in Celsius and you include a predictor variable of temperature in some other scale, you'd get an R-squared of nearly 100%! That's an obvious example, but you can have the same thing happening more subtlety.

For more information about regression models, read my post about How to Choose the Best Regression Model.

Don’t Compare Regular Regression Coefficients to Determine Variable Importance

Takeaway: Larger coefficients don’t necessarily identify more important predictor variables.

Don’t Compare P-values to Determine Variable Importance

Effects that are trivial in the real world can have very low p-values. A statistically significant result may not be practically significant.

Takeaway: Low p-values don’t necessarily identify predictor variables that are practically important.

Do Compare These Statistics To Help Determine Variable Importance

Standardized regression coefficients

Takeaway: Look for the predictor variable with the largest absolute value for the standardized coefficient.

Change in R-squared when the variable is added to the model last

Takeaway: Look for the predictor variable that is associated with the greatest increase in R-squared.

An Example of Using Statistics to Identify the Most Important Variables in a Regression Model

The report with the graphs is produced by Multiple Regression in the Assistant menu. You can find this analysis in the Minitab menu: Assistant > Regression > Multiple Regression.

Coded coefficient table

Minitab's Assistant menu output that displays the incremental impact of the variables

Caveats for Using Statistics to Identify Important Variables

How you collect and measure your sample can bias the apparent importance of the variables in your sample compared to their true importance in the population.

To determine that there is a causal relationship, you typically need to perform a designed experiment rather than an observational study.

Non-Statistical Considerations for Identifying Important Variables

If you're just learning about regression, read my regression tutorial!

We tend to think of data mining in the context of big data, with its huge databases and servers stuffed with information. However, it can also occur on the smaller scale of a research study.

The comment below is a real one that illustrates this point.

“Then, I moved to the Regression menu and there I could add all the terms I wanted and more. Just for fun, I added many terms and performed backward elimination. Surprisingly, some terms appeared significant and my R-squared Predicted shot up. To me, your concerns are all taken care of with R-squared Predicted. If the model can still predict without the data point, then that's good.”

In this blog post, I’ll illustrate the problems associated with using data mining to build a regression model in the context of a smaller-scale analysis.

An Example of Using Data Mining to Build a Regression Model

Stepwise regression output

Scatterplot of two variables in regression model

The problems associated with using data mining are real, but how the heck do they happen? And, how do you avoid them? Read my next post to learn the answers to these questions!

Regression output for data mining example

Scatter plot for data mining example

If you didn't already know there was no true relationship between these variables, these results could lead you to a very inaccurate conclusion.

Let's explore how these problems happen, and how to avoid them

Why Do These Problems Occur with Data Mining?

Theory, the Alternative to Data Mining

Theoretical considerations should not be discarded based solely on statistical measures.

Compare the coefficient signs to theory. If any of the signs contradict theory, investigate and either change your model or explain the inconsistency.
Use Minitab statistical software to create factorial plots based on your model to see if all the effects match theory.
Compare the R-squared for your study to those of similar studies. If your R-squared is very different than those in similar studies, it's a sign that your model may have a problem.

If you’re interested in learning more about these issues, read my post about how using too many phantom degrees of freedom is related to data mining problems.

By simulating the election 100,000 times, we can examine the distribution of outcomes to determine probabilities for the election winner and to determine which states are the most important to win.

Using Monte Carlo Simulation for the Election

Simulation Results for the Presidential Election

Distribution of simulated electoral votes for Hillary Clinton

Probability of winning the election based on the winner of Florida

Using Binary Logistic Regression to Dig Deeper into the Simulation

State

Odds ratio

Florida

137.3

Pennsylvania

29.7

Ohio

23.3

Georgia

15.8

Michigan

15.3

North Carolina

13.4

Virginia

9.2

Arizona

6.8

Wisconsin

5.5

Colorado

4.6

The list is pretty cool because it quantifies the importance of each state, and the top states match those you hear about on the news media most frequently.

What to Watch for on Election Night

Each additional row after Georgia represents a state that is harder for Trump to win. Trump has to win at least four of these states to have a greater than 50% chance of winning the election.

Trump States

Chance of Trump Winning

Most Likely Electoral Votes

Florida (48.9%)

23.9%

285 Clinton

FL + GA (83.9%)

26.9%

283 Clinton

FL + GA + OH (61.2%)

37.2%

276 Clinton

FL + GA + OH + PA (22%)

70.5%

278 Trump

FL + GA + OH + PA + MI (21.2%)

91.9%

291 Trump

In short, right now it is difficult for Trump to win, but it is entirely possible that the election will be a squeaker! Watching these key states will give you a forecast of where the race is headed.

Despite these caveats, this Monte Carlo simulation shows the overall state of the race and which states are most important for a candidate’s chances of winning.

Did you ever wonder why statistical analyses and concepts often have such weird, cryptic names?

One conspiracy theory points to the workings of a secret committee called the ICSSNN. The International Committee for Sadistic Statistical Nomenclature and Numerophobia was formed solely to befuddle and subjugate the masses. Its mission: To select the most awkward, obscure, and confusing name possible for each statistical concept.

A whistle-blower recently released the following transcript of a secretly recorded ICSSNN meeting:

"This statistical analysis seems pretty straightforward…"

“What does it do?”

“It describes the relationship between one or more 'input' variables and an 'output' variable. It gives you an equation to predict values for the 'output' variable, by plugging in values for the input variables."

“Oh dear. That sounds disturbingly transparent.”

“Yes. We need to fix that—call it something grey and nebulous. What do you think of 'regression'?”

“What’s 'regressive' about it?

“Nothing at all. That’s the point!”

“Re-gres-sion. It does sound intimidating. I’d be afraid to try that alone.”

“Are you sure it’s completely unrelated to anything? Sounds a lot like 'digression.' Maybe it’s what happens when you add up umpteen sums of squares…you forget what you were talking about.”

“Maybe it makes you regress and relive your traumatic memories of high school math…until you revert to a fetal position?”

“No, no. It’s not connected with anything concrete at all.”

“Then it’s perfect!”

“I don’t know...it only has 3 syllables. I’d feel better if it were at least 7 syllables and hyphenated.”

“I agree. Phonetically, it’s too easy…people are even likely to pronounce it correctly. Could we add an uvular fricative, or an interdental retroflex followed by a sustained turbulent trill?”

The Real Story: How Regression Got Its Name

Conspiracy theories aside, the term “regression” in statistics was probably not a result of the workings of the ICSSNN. Instead, the term is usually attributed to Sir Francis Galton.

Galton was a 19th century English Victorian who wore many hats: explorer, inventor, meteorologist, anthropologist, and—most important for the field of statistics—an inveterate measurement nut. You might call him a statistician’s statistician. Galton just couldn’t stop measuring anything and everything around him.

During a meeting of the Royal Geographical Society, Galton devised a way to roughly quantify boredom: he counted the number of fidgets of the audience in relation to the number of breaths he took (he didn’t want to attract attention using a timepiece). Galton then converted the results on a time scale to obtain a mean rate of 1 fidget per minute per person. Decreases or increases in the rate could then be used to gauge audience interest levels. (That mean fidget rate was calculated in 1885. I’d guess the mean fidget rate is astronomically higher today—especially if glancing at an electronic device counts as a fidget.)

Galton also noted the importance of considering sampling bias in his fidget experiment:

“These observations should be confined to persons of middle age. Children are rarely still, while elderly philosophers will sometimes remain rigid for minutes.”

But I regress…

Galton was also keenly interested in heredity. In one experiment, he collected data on the heights of 205 sets of parents with adult children. To make male and female heights directly comparable, he rescaled the female heights, multiplying them by a factor 1.08. Then he calculated the average of the two parents' heights (which he called the “mid-parent height”) and divided them into groups based on the range of their heights. The results are shown below, replicated on a Minitab graph.

For each group of parents, Galton then measured the heights of their adult children and plotted their median heights on the same graph.

Galton fit a line to each set of heights, and added a reference line to show the average adult height (68.25 inches).

Like most statisticians, Galton was all about deviance. So he represented his results in terms of deviance from the average adult height.

Based on these results, Galton concluded that as heights of the parents deviated from the average height (that is as they became taller or shorter than the average adult), their children tended to be less extreme in height. That is, the heights of the children regressed to the average height of an adult.

He calculated the rate of regression as 2/3 of the deviance value. So if the average height of the two parents was, say, 3 inches taller than the average adult height, their children would tend to be (on average) approximately 2/3*3 = 2 inches taller than the average adult height.

Galton published his results in a paper called “Regression towards Mediocrity in Hereditary Stature.”

So here’s the irony: The term regression, as Galton used it, didn't refer to the statistical procedure he used to determine the fit lines for the plotted data points. In fact, Galton didn’t even use the least-squares method that we now most commonly associate with the term “regression.” (The least-squares method had already been developed some 80 years previously by Gauss and Legendre, but wasn’t called “regression” yet.) In his study, Galton just "eyeballed" the data values to draw the fit line.

For Galton, “regression” referred only to the tendency of extreme data values to "revert" to the overall mean value. In a biological sense, this meant a tendency for offspring to revert to average size ("mediocrity") as their parentage became more extreme in size. In a statistical sense, it meant that, with repeated sampling, a variable that is measured to have an extreme value the first time tends to be closer to the mean when you measure it a second time.

Later, as he and other statisticians built on the methodology to quantify correlation relationships and to fit lines to data values, the term “regression” become associated with the statistical analysis that we now call regression. But it was just by chance that Galton's original results using a fit line happened to show a regression of heights. If his study had showed increasing deviance of childrens' heights from the average compared to their parents, perhaps we'd be calling it "progression" instead.

So, you see, there’s nothing particularly “regressive” about a regression analysis.

And that makes the ICSSNN very happy.

Don't Regress....Progress

Never let intimidating terminology deter you from using a statistical analysis. The sign on the door is often much scarier than what's behind it. Regression is an intuitive, practical statistical tool with broad and powerful applications.

If you’ve never performed a regression analysis before, a good place to start is the Minitab Assistant. See Jim Frost’s post on using the Assistant to perform a multiple regression analysis. Jim has also compiled a helpful compendium of blog posts on regression.

And don’t forget Minitab Help. In Minitab, choose Help > Help. Then click Tutorials > Regression, or Stat Menu > Regression.

Sources

Bulmer, M. Francis Galton: Pioneer or Heredity and Biometry. Johns Hopkins University Press, 2003.

Davis, L. J. Obsession: A History. University of Chicago Press, 2008.

Galton, F. “Regression towards Mediocrity in Hereditary Stature.” http://galton.org/essays/1880-1889/galton-1886-jaigi-regression-stature.pdf

Gillham, N. W. A Life of Sir Francis Galton. Oxford University Press, 2001.

Gould, S. J. The Mismeasure of Man. W. W. Norton, 1996.

rsquare If you regularly perform regression analysis, you know that R2 is a statistic used to evaluate the fit of your model. You may even know the standard definition of R2: the percentage of variation in the response that is explained by the model.

Fair enough. With Minitab Statistical Software doing all the heavy lifting to calculate your R2 values, that may be all you ever need to know.

But if you’re like me, you like to crack things open to see what’s inside. Understanding the essential nature of a statistic helps you demystify it and interpret it more accurately.

R-squared: Where Geometry Meets Statistics

So where does this mysterious R-squared value come from? To find the formula in Minitab, choose Help > Methods and Formulas. Click General statistics > Regression > Regression > R-sq.

rsqare

Some spooky, wacky-looking symbols in there. Statisticians use those to make your knees knock together.

But all the formula really says is: “R-squared is a bunch of squares added together, divided by another bunch of squares added together, subtracted from 1.“

rsquare annotation

What bunch of squares, you ask?

square dance guys

No, not them.

SS Total: Total Sum of Squares

First consider the "bunch of squares" on the bottom of the fraction. Suppose your data is shown on the scatterplot below:

original data

(Only 4 data values are shown to keep the example simple. Hopefully you have more data than this for your actual regression analysis! )

Now suppose you add a line to show the mean (average) of all your data points:

scatterplot with line

The line y = mean of Y is sometimes referred to the “trivial model” because it doesn’t contain any predictor (X) variables, just a constant. How well would this line model your data points?

One way to quantify this is to measure the vertical distance from the line to each data point. That tells you how much the line “misses” each data point. This distance can be used to construct the sides of a square on each data point.

pinksquares

If you add up the pink areas of all those squares for all your data points you get the total sum of squares (SS Total), the bottom of the fraction.

SS Total

SS Error: Error Sum of Squares

Now consider the model you obtain using regression analysis.

regression model

Again, quantify the "errors" of this model by measuring the vertical distance of each data value from the regression line and squaring it.

ss error graph

If you add the green areas of theses squares you get the SS Error, the top of the fraction.

ss error formula

So R2 basically just compares the errors of your regression model to the errors you’d have if you just used the mean of Y to model your data.

R-Squared for Visual Thinkers

rsquare final

The smaller the errors in your regression model (the green squares) in relation to the errors in the model based on only the mean (pink squares), the closer the fraction is to 0, and the closer R2 is to 1 (100%).

That’s the case shown here. The green squares are much smaller than the pink squares. So the R2 for the regression line is 91.4%.

But if the errors in your reqression model are about the same size as the errors in the trivial model that uses only the mean, the areas of the pink squares and the green squares will be similar, making the fraction close to 1, and the R2 close to 0.

That means that your model, isn't producing a "tight fit" for your data, generally speaking. You’re getting about the same size errors you’d get if you simply used the mean to describe all your data points!

R-squared in Practice

Now you know exactly what R2 is. People have different opinions about how critical the R-squared value is in regression analysis. My view? No single statistic ever tells the whole story about your data. But that doesn't invalidate the statistic. It's always a good idea to evaluate your data using a variety of statistics. Then interpret the composite results based on the context and objectives of your specific application. If you understand how a statistic is actually calculated, you'll better understand its strengths and limitations.

Related link

Want to see how another commonly used analysis, the t-test, really works? Read this post to learn how the t-test measures the "signal" to the "noise" in your data.

One of the biggest pieces of international news last year was the so-called "Brexit" referendum, in which a majority of voters in the United Kingdom cast their ballots to leave the European Union (EU).

Polling station in the United Kingdom That outcome shocked the world. Follow-up media coverage has asserted that the younger generation prefers to remain in the EU since that means more opportunities on the continent. The older generation, on the other hand, prefers to leave the EU.

As a statistician, I wanted to look at the data to see what I could find out about the Brexit vote, and recently the BBC published an article that included some detailed data.

In this post, I'll use Minitab Statistical Software to explore the data from the BBC site along with the data from the Electoral Commission website. I hope this exploration will give you some ideas about how you might use publicly available data to get insights about your customers or other aspects of your business.

The electoral commission data contains the voting details of all 382 regions in the United Kingdom. It includes information on voter turnout, the percent who voted to leave the EU, and the percent who voted to remain. (If you'd like to follow along, open the BrexitData1 and BrexitData2 worksheets in Minitab 18. If you don't already have Minitab, you can download the 30-day trial.)

I began by creating scatterplots (in Minitab, go to Graph > Scatterplot...) of the percentage of voter turnout against the percentage of the population that voted to leave for each region, as shown below.

Scatterplot of Brexit Voter Data1

Scatterplot of Brexit Voter Data, #2

According to commentators, areas with high voter turnout had a tendency to vote to leave, as the elderly were more likely to turn up to vote. There is also a perceptible difference between the plots for the different areas.

To make this easier to analyze, I created an indicator variable called “decided to leave” in my Minitab worksheet. This variable takes the value of 1 if the area voted to leave the EU, and takes the value 0 otherwise. Tallying the number of areas in each region that voted to leave or remain (Stat > Tables > Tally Individual Variables...) yields the following:

Tabulated Brexit Statistics: Region, Decided to Leave

There are indeed regional differences. For example, London and Scotland voted strongly to remain while North East and North West voted strongly to leave. So, do we see greater voter turnout in the regions that voted to leave? Looking at the average turnout in each region (using Stat > Display Descriptive Statistics...), we have the following:

Brexit Data - Descriptive Statistics

Surprisingly, the average turnout of regions that voted strongly to leave is not very different from the turnout of regions that voted strongly to remain. For example, the average turnout of 69.817% in London compared to 70.739% in North West.

The data set analyzed in the BBC article contains localised voting data supplied to the BBC by councils which counted the EU referendum. This data is more detailed than the regional data from the Electoral Commission, and it includes a detailed breakdown of how the people in individual electoral wards voted.

The BBC asked all the counting areas for these figures. Three councils did not reply. The remaining missing data could be due to any of the following reasons:

The council refused to give the information to the BBC.
No geographical information was available because all ballot boxes were mixed before counting.
The council conducted a number of mini-counts that combined ballot boxes in a way that does not correspond to individual wards.

For those wards that have voting data, I also gathered the following information from the last census for each area.

Percent of population in an area with level 4 qualification or higher. This includes individuals with a higher certificate/diploma, foundation degree, undergraduate degree, or master’s degree up to a doctorate. I will call this variable “degree” to represent individuals holding degrees or equivalent qualification.
Percentage of young people (age 18-29) in an area.
Percentage of middle-aged (age 30-59) in an area.
Percentage of elderly (age 65 or above) in an area.

There is some difference in how some wards are defined between this data set and the data from the last census, perhaps due to changes in ward boundaries. Thus, for some wards, it was not possible to match the corresponding percentages of different age groups and degree holders. Therefore, some areas had to be omitted from my analysis, leaving me with data from a total of 1,069 wards.

With the exception of Scotland, Northern Ireland, and Wales, I have data from wards in all regions of the UK. The number of measurements from each region appears below.

Brexit Data, Descriptive Statistics N

As with the Electoral Commission data, let’s begin by looking at some graphs. Below is a scatterplot of the percentage voting to leave against the percent of the population with a degree in an area.

Scatterplot of Brexit Data: Leave % vs. Degree

As you can see, the higher the percentage of people in an area who had a degree, the lower the percentage of the population that voted to leave. However, there are exceptions. For example, for Osterley and Spring Grove in Hounslow, the percentage that voted to leave is 63.41%, with a higher percentage of degree holders at 37.5566%. However, the area has a small proportion of young adults, at 19.3538%.

Let's look at the voting behaviour for different age groups. I created scatterplots of the percentage that voted to leave against different age groups.

The next plot shows percentage that voted to leave against the percentage of young people (age 18-29) in an area:

Scatterplot of Brexit Data: Leave% vs Young

Areas with a higher percentage of young people appear to have a smaller percentage of people who voted to leave.

The following plot shows the percentage of the population that voted to leave against the percentage of elderly residents:

Scatterplot of Brexit Data: Leave% vs. Elderly

This plot shows the opposite situation shown in the previous one: areas with a higher proportion of elderly residents voted more strongly to leave.

These scatterplots support what’s being said in pieces such as the article on the BBC's website. However, in statistics, we like to verify that the relationship is significant. Let’s look at the correlation coefficients (Stat > Basic Statistics > Correlation...).

Brexit Data: Correlation - Leave%, Degree, Young, Elderly

The correlation output in Minitab includes a p-value. If the p-value is less than the chosen significance level, it tells you the correlation coefficient is significantly different from 0—in other words, a correlation exists. Since we selected an alpha value (or significance level) of 0.05, we can say that all the coefficients calculated above are significant and that there are correlations between these factors.

Thus, the proportion of degree holders in an area has a strong negative impact on voting to leave. On the other hand, the proportion of elderly residents in an area has a strong positive impact on voting to leave.

Going a step further, I fit a regression model (Stat > Regression > Regression > Fit Regression Model...) that links the percent voting to leave with the proportion of degree holders and different age groups.

Brexit Data Regression: Leave% vs Degree, Young, Middle-age, Elderly

While there is no need to use the equation to make a prediction, we can still get some interesting information from the results.

The different age groups and proportion of degree holders all have an impact on the percentage voting to leave. The coefficient for the “degree” term is negative, and this implies for each unit increase in the percent of degree holders, the percentage voting to leave drops by 1.4095. On the other hand, for a unit increase in the percentage of elderly, the percentage voting to leave increases by 1.2732. In addition, there is a significant interaction between the percentage of degree holders and young people: Every unit increase in this interaction term only increases the percent voting to leave by 0.00641.

The results I obtained when I analyzed the data with Minitab support the commonly held view that younger voters preferred to remain in the EU, while older voters preferred to leave. The analysis also underscores the complicated politics surrounding Brexit, a reality that became apparent in the recent general election. One thing seems certain now that Brexit talks are imminent: balancing the needs and desires of the people from different age groups and backgrounds will be a tremendous task.

Fourier nonlinear function Previously, I’ve written about when to choose nonlinear regression and how to model curvature with both linear and nonlinear regression. Since then, I’ve received several comments expressing confusion about what differentiates nonlinear equations from linear equations. This confusion is understandable because both types can model curves.

So, if it’s not the ability to model a curve, what is the difference between a linear and nonlinear regression equation?

Linear Regression Equations

Linear regression requires a linear model. No surprise, right? But what does that really mean?

A model is linear when each term is either a constant or the product of a parameter and a predictor variable. A linear equation is constructed by adding the results for each term. This constrains the equation to just one basic form:

Response = constant + parameter * predictor + ... + parameter * predictor

Y = b o + b1X1 + b2X2 + ... + bkXk

In statistics, a regression equation (or function) is linear when it is linear in the parameters. While the equation must be linear in the parameters, you can transform the predictor variables in ways that produce curvature. For instance, you can include a squared variable to produce a U-shaped curve.

Y = b o + b1X1 + b2X12

This model is still linear in the parameters even though the predictor variable is squared. You can also use log and inverse functional forms that are linear in the parameters to produce different types of curves.

Here is an example of a linear regression model that uses a squared term to fit the curved relationship between BMI and body fat percentage.

Linear model with squared term

Nonlinear Regression Equations

While a linear equation has one basic form, nonlinear equations can take many different forms. The easiest way to determine whether an equation is nonlinear is to focus on the term “nonlinear” itself. Literally, it’s not linear. If the equation doesn’t meet the criteria above for a linear equation, it’s nonlinear.

That covers many different forms, which is why nonlinear regression provides the most flexible curve-fitting functionality. Here are several examples from Minitab’s nonlinear function catalog. Thetas represent the parameters and X represents the predictor in the nonlinear functions. Unlike linear regression, these functions can have more than one parameter per predictor variable.

Nonlinear functionOne possible shape Power (convex): Theta1 * X^Theta2 Power function in nonlinear regression

Weibull growth: Theta1 + (Theta2 - Theta1) * exp(-Theta3 * X^Theta4) Weibull growth function in nonlinear regression

Fourier: Theta1 * cos(X + Theta4) + (Theta2 * cos(2*X + Theta4) + Theta3 Fourier function for nonlinear regression

Here is an example of a nonlinear regression model of the relationship between density and electron mobility.

Nonlinear regression model for electron mobility

The nonlinear equation is so long it that it doesn't fit on the graph:

Mobility = (1288.14 + 1491.08 * Density Ln + 583.238 * Density Ln^2 + 75.4167 * Density Ln^3) / (1 + 0.966295 * Density Ln + 0.397973 * Density Ln^2 + 0.0497273 * Density Ln^3)

Linear and nonlinear regression are actually named after the functional form of the models that each analysis accepts. I hope the distinction between linear and nonlinear equations is clearer and that you understand how it’s possible for linear regression to model curves! It also explains why you’ll see R-squared displayed for some curvilinear models even though it’s impossible to calculate R-squared for nonlinear regression.

If you're learning about regression, read my regression tutorial!

In April 2017, overbooking of flight seats hit the headlines when a United Airlines customer was dragged off a flight. A TED talk by Nina Klietsch gives a good, but simplistic explanation of why overbooking is so attractive to airlines.

Overbooking is not new to the airlines; these strategies were officially sanctioned by The American Civil Aeronautics Board in 1965, and since that time complex statistical models have been researched and developed to set the ticket pricing and overbooking strategies to deliver maximum revenue to the airlines.

In this blog, I would like to look at one aspect of this: the probability of a no-show. In Klietsch’s talk, she assumed that the probability of a no-show (a customer not turning up for a flight) is identical for all customers. In reality, this is not true—factors such as time of day, price, time since booking, and whether a traveler is alone or in a group will impact the probability of a no show.

By using this information about our customers, we can predict the probability of a no-show using binary logistic regression. This type of modeling is common to many services and industries. Some of the applications, in addition to predicting no-shows, include:

Credit scores: What is the probability of default?
Marketing offers: What are the chances you'll buy a product based on a specific offer?
Quality: What is the probability of a part failing?
Human resources: What is the sickness absence rate likely to be?

In all cases, your outcome (the event you are predicting) is discrete and can be split into two separate groups; for example, purchase/no purchase, pass/fail, or show/no show. Using the characteristics of your customers or parts as predictors you can use this modeling technique to predict the outcome.

cereal purchase worksheet Let’s look at an example. I was unable to find any airline data, so I am illustrating this with one of our Minitab sample data sets, Cerealpurchase.mtw.

In this example, a food company surveys consumers to measure the effectiveness of their television ad in getting viewers to buy their cereal. The Bought column has the value 1 if the respondent purchased the cereal, and the value 0 if not. In addition to asking if respondents have seen the ad, the survey also gathers data on the household income and the number of children, which the company also believes might influence the purchase of this cereal.

Using Stat > Regression > Binary Logistic Regression,I entered the details of the response I wanted to predict, Bought, and the value in the Response Event which indicated a purchase. I then entered the Continuous predictor, Income and the Categorical predictors Children and ViewAd. My completed dialog box looks like this:

binary logistic regression dialog

After pressing OK, Minitab performs the analysis and displays the results in the Session window. From this table at the top of the output I can see that the researchers surveyed a sample of 71 customers, of which 22 purchased the cereal.

response information

With Logistic regression, the output features a Deviance Table instead of an Analysis of Variance Table. The calculations and test statistics used with this type of data are different, but we still use the P-value on the far right to determine which factors have an effect on our response.

deviance table

As we would when using other regression methods, we are going to reduce the model by eliminating non-significant terms one at a time. In this case, as highlighted above, Income is not significant. We can simply press Ctrl-E to recall the last dialog box, remove the Income term from the model, and rerun the analysis. Minitab returns the following results:

deviance table

After removing Income, we can see that both Children and ViewAd are significant at the 0.05 significance level. This could be good news for the Marketing Department, as it clearly indicates that viewing the ad did influence the decision to buy. However from this table it is not possible to see if this effect is positive or negative.

To understand this, we need to look at another part of the output. In Binary Logistic Regression, we are trying to estimate the probability of an event. To do this we use the Odds Ratio, which compares the odds of two events by dividing the odds of success under condition A by the odds of success under condition B.

Odds Ratio

In this example, the Odds Ratio for Children is telling us that respondents who reported they do have children are 5.1628 times more likely to purchase the cereal than those who did not report having children. The good news for the Marketing Department is that customers who viewed the ad were 3.0219 times more likely to purchase the cereal. If the Odds Ratio was less than 1, we would conclude that seeing the advert reduces sales!

storage dialog The other way to look at these results is to calculate the probability of purchase and analyse this.

It is easy to calculate the probability of a sale by clicking on the Storage button in the Binary Logistic Regression dialog box and checking the box labeled Fits (event probabilities). This will store the probability of purchase in the worksheet.

data with stored fits

Using the fits data, we can produce a table summarizing the Probability of Purchase for all the combinations of Children and ViewAd, as follows:

tabulated statistics

In the rows we have the Children indicator, and in the columns we have the ViewAd indicator. In each cell the top number is the probability of cereal purchase, and the bottom number is the count of customers observed in each of the groups.

Based on this table, customers with children who have seen the ad have a 51% chance of purchase, whereas customers without children who have not seen the ad have a 6% chance of purchase.

Now let's bring this back to our airline example. Using the information about their customers' demographics and flight preferences, an airline can use binary logistic regression to estimate the probabilities of a “no-show” for a whole plane and then determine by how much they should overbook seats. Of course, no model is perfect, and as we saw with United, getting it wrong can have severe consequences.

Maybe you're just getting started with analyzing data. Maybe you're reasonably knowledgeable about statistics, but it's been a long time since you did a particular analysis and you feel a little bit rusty. In either case, the Assistant menu in Minitab Statistical Software gives you an interactive guide from start to finish. It will help you choose the right tool quickly, analyze your data properly, and even interpret the results appropriately.

One type of analysis many practitioners struggle with is multiple regression analysis, particularly an analysis that aims to optimize a response by finding the best levels for different variables. In this post, we'll use the Assistant to complete a multiple regression analysis and optimize the response.

Identifying the Right Type of Regression

In our example, we'll use a data set based on some solar energy research. Scientists found the position of focal points could be used to predict total heat flux. The goal of our analysis will be to use the Assistant to find the ideal position for these focal points.

When you select Assistant > Regression in Minitab, the software presents you with an interactive decision tree. If you need more explanation about a decision point, just click on the diamonds to see detailed information and examples.

Minitab's Assistant menu interactive decision tree

This data set has three X variables, or predictors, and we're looking to fit a model and optimize the response. For this goal, the tree leads to the Optimize Response button located at the bottom right. Clicking that button brings up a simple dialog box to complete.

HeatFlux is the response variable. The X variables are the focal points located in each direction, East, West, North, and South. Based on previous knowledge, we know we should use 234 as the target heat flux value of 234, but we could also ask the Assistant to maximize or minimize the response. Because we checked the box labeled "Fit 2-way interactions and quadratic terms," the Assistant also will check for curvature and interactions.

Minitab's Assistant menu dialog box

When we press "OK," the Assistant quickly generates a regression model for the X variables using stepwise regression. It presents the results in a series of reports written in plain, easy-to-follow language.

Summary Report

Multiple regression summary report for Minitab's Assistant

This Summary Report delivers the "big picture" about the analysis and its results. With a p-value less than 0.001, this report shows that the regression model is statistically significant, with an R-squared value of 96.15%! The comments window shows which X variables the model includes: East, South, and North, as well as interaction terms. To model curvature, the model also includes several polynomial terms.

Effects Report

Effects report for Minitab's Assistant menu

The Effects Report shows all of the interaction and main effects included in the model. The presence of curved lines indicates the Assistant used a polynomial term to fit a curve.

In this report, the East*South interaction is significant. This means the effect of one variable on heat flux varies based on the other variable. If South has a low setting (31.84), heat flux is reduced by increasing East. But if South is set high (40.55), the heat flux increases as East gets higher.

Diagnostic Report

Multiple regression diagnostic report for Minitab's Assistant

The Diagnostic Report shows you the plot of residuals versus fitted values, and indicates any unusual points that ought to be investigated. This report has flagged two points, but these are not necessarily problematic, since based on the criteria for large residuals we'd expect roughly 5% of the observations to be flagged. The report also identifies two points that had unusual X values; clicking the points reveals which worksheet row they are in.

Model Building Report

Multiple regression model building report for Minitab's Assistant

The Model Building Report details how the Assistant arrived at the final regression model. It also contains the regression equation, identifies the variables that contribute the most information, and indicates whether the X variables are correlated. In this model, North contributes the most information. Even though East is not significant, since it is part of a higher-order term the Assistant includes it.

This is a good opportunity to point out how The Assistant helps ensure that an analysis is done in the best way. For example, the Assistant uses standardized X variables to create the regression model. That's because standardizing the X variables removes most of the correlation between linear and higher-order terms, which reduces the chance of adding these terms to your model if they aren't needed. However, the Assistant still displays the final model in natural (unstandardized) units.

Prediction and Optimization Report

Multiple regression prediction and optmization report for Minitab's Assistant

The Assistant's Prediction and Optimization Report provides solutions for obtaining the targeted heat flux value of 234. The optimal settings for the focal points have been identified as East 37.82, South 31.84, and North 16.01. The model predicts that these settings will deliver a heat flux of 234, with a prediction interval of 216 to 252. But the Assistant provides alternate solutions you may want to consider, particularly in cases where specialized subject area expertise might be critical.

Report Card

Multiple regression report card for Minitab's Assistant

Finally, the Report Card prevents you from missing potential problems that could make your results unreliable. In this case, the report suggests collecting a larger sample and investigating the unusual residuals. It also shows that normality is not an issue for these data. Finally, it provides a helpful reminder to validate the model's optimal values by doing confirmation runs.

The Assistant's methods are based on established statistical practice, guidelines in the literature, and simulations performed by Minitab's statisticians. You can read the technical white paper for Multiple Regression in the Assistant if you would like all the details.

Overfitting a model is a real problem you need to beware of when performing regression analysis. An overfit model result in misleading regression coefficients, p-values, and R-squared statistics. Nobody wants that, so let's examine what overfit models are, and how to avoid falling into the overfitting trap.

Put simply, an overfit model is too complex for the data you're analyzing. Rather than reflecting the entire population, an overfit regression model is perfectly suited to the noise, anomalies, and random features of the specific sample you've collected. When that happens, the overfit model is unlikely to fit another random sample drawn from the same population, which would have its own quirks.

A good model should fit not just the sample you have, but any new samples you collect from the same population.

For an example of the dangers of overfitting regression models, take a look at this fitted line plot:

Even though this model looks like it explains a lot of variation in the response, it's too complicated for this sample data. In the population, there is no true relationship between the predictor and this response, as is explained in detail here.

Basics of Inferential Statistics

For more insight into the problems with overfitting, let's review a basic concept of inferential statistics, in which we try to draw conclusions about a population from a random sample. The sample data is used to provide unbiased estimates of population parameters and relationships, and also in testing hypotheses about the population.

In inferential statistics, the size of your sample affects the amount of information you can glean about the population. If you want to learn more, you need larger sample sizes. Trying to wrest too much information from a small sample isn't going to work very well.

For example, with a sample size of 20, you could probably get a good estimate of a single population mean. But estimating two population means with a total sample size of 20 is a riskier proposition. If you want to estimate three or more population means with that same sample, any conclusions you draw are going to be pretty sketchy.

In other words, trying to learn too much from a sample leads to results that aren't as reliable as we'd like. In this example, as the observations per parameter decreases from 20 to 10 to 6.7 and beyond, the parameter estimates will become more unreliable. A new sample would likely yield different parameter estimates.

How Sample Size Relates to an Overfit Model

Similarly, overfitting a regression model results from trying to estimate too many parameters from too small a sample. In regression, a single sample is used to estimate the coefficients for all of the terms in the model. That includes every predictor, interaction, and polynomial term. As a result, the number of terms your can safely accommodate depends on the size of your sample.

Larger samples permit more complex models, so if the question or process you're investigating is very complicated, you'll need a sample size large enough to support that complexity. With an inadequate sample size, your model won't be trustworthy.

So your sample needs enough observations for each term. In multiple linear regression, 10-15 observations per term is a good rule of thumb. A model with two predictors and an interaction, therefore, would require 30 to 45 observations—perhaps more if you have high multicollinearity or a small effect size.

Avoiding Overfit Models

You can detect overfit through cross-validation—determining how well your model fits new observations. Partitioning your data is one way to assess how the model fits observations that weren't used to estimate the model.

For linear models, Minitab calculates predicted R-squared, a cross-validation method that doesn't require a separate sample. To calculate predicted R-squared, Minitab systematically removes each observation from the data set, estimates the regression equation, and determines how well the model predicts the removed observation.

A model that performs poorly at predicting the removed observations probably conforms to the specific data points in the sample, and can't be generalized to the full population.

The best solution to an overfitting problem is avoidance. Identify the important variables and think about the model that you are likely to specify, then plan ahead to collect a sample large enough handle all predictors, interactions, and polynomial terms your response variable might require.

Jim Frost discusses offers some good advice about selecting a model in How to Choose the Best Regression Model. Also, check out his post about how too many phantom degrees of freedom can lead to overfitting, too.

Can Regression and Statistical Software Help You Find a Great Deal on a Used Car?

What the Heck Are Sums of Squares in Regression?

Creating Value from Your Data

How to Identify the Most Important Predictor Variables in Regression Models

Problems Using Data Mining to Build Regression Models

How to Save a Failing Regression with PLS

Problems Using Data Mining to Build Regression Models, Part Two

Simulating the U.S. Presidential Election of 2016

Five Reasons Why Your R-squared Can Be Too High

How to Identify the Most Important Predictor Variables in Regression Models

Problems Using Data Mining to Build Regression Models

Problems Using Data Mining to Build Regression Models, Part Two

Simulating the U.S. Presidential Election of 2016

So Why Is It Called "Regression," Anyway?

R-Squared: Sometimes, a Square is just a Square

Gleaning Insights from Election Data with Basic Statistical Tools

What Is the Difference between Linear and Nonlinear Equations in Regression Analysis?

How to Estimate the Probability of a No-Show using Binary Logistic Regression

The Easiest Way to Do Multiple Regression Analysis

How to Avoid Overfitting Your Regression Model