Regression Analysis

Example of an overfit regression model In regression analysis, overfitting a model is a real problem. An overfit model can cause the regression coefficients, p-values, and R-squared to be misleading. In this post, I explain what an overfit model is and how to detect and avoid this problem.

An overfit model is one that is too complicated for your data set. When this happens, the regression model becomes tailored to fit the quirks and random noise in your specific sample rather than reflecting the overall population. If you drew another sample, it would have its own quirks, and your original overfit model would not likely fit the new data.

Instead, we want our model to approximate the true model for the entire population. Our model should not only fit the current sample, but new samples too.

The fitted line plot illustrates the dangers of overfitting regression models. This model appears to explain a lot of variation in the response variable. However, the model is too complex for the sample data. In the overall population, there is no real relationship between the predictor and the response. You can read about the model here.

Fundamentals of Inferential Statistics

To understand how overfitting causes these problems, we need to go back to the basics for inferential statistics.

The overall goal of inferential statistics is to draw conclusions about a larger population from a random sample. Inferential statistics uses the sample data to provide the following:

Unbiased estimates of properties and relationships within the population.
Hypothesis tests that assess statements about the entire population.

An important concept in inferential statistics is that the amount of information you can learn about a population is limited by the sample size. The more you want to learn, the larger your sample size must be.

You probably understand this concept intuitively, but here’s an example. If you have a sample size of 20 and want to estimate a single population mean, you’re probably in good shape. However, if you want to estimate two population means using the same total sample size, it suddenly looks iffier. If you increase it to three population means and more, it starts to look pretty bad.

The quality of the results worsens when you try to learn too much from a sample. As the number of observations per parameter decreases in the example above (20, 10, 6.7, etc), the estimates become more erratic and a new sample is less likely to reproduce them.

Applying These Concepts to Overfitting Regression Models

In a similar fashion, overfitting a regression model occurs when you attempt to estimate too many parameters from a sample that is too small. Regression analysis uses one sample to estimate the values of the coefficients for all of the terms in the equation. The sample size limits the number of terms that you can safely include before you begin to overfit the model. The number of terms in the model includes all of the predictors, interaction effects, and polynomials terms (to model curvature).

Larger sample sizes allow you to specify more complex models. For trustworthy results, your sample size must be large enough to support the level of complexity that is required by your research question. If your sample size isn’t large enough, you won’t be able to fit a model that adequately approximates the true model for your response variable. You won’t be able to trust the results.

Just like the example with multiple means, you must have a sufficient number of observations for each term in a regression model. Simulation studies show that a good rule of thumb is to have 10-15 observations per term in multiple linear regression.

For example, if your model contains two predictors and the interaction term, you’ll need 30-45 observations. However, if the effect size is small or there is high multicollinearity, you may need more observations per term.

How to Detect and Avoid Overfit Models

Cross-validation can detect overfit models by determining how well your model generalizes to other data sets by partitioning your data. This process helps you assess how well the model fits new observations that weren't used in the model estimation process.

Minitab statistical software provides a great cross-validation solution for linear models by calculating predicted R-squared. This statistic is a form of cross-validation that doesn't require you to collect a separate sample. Instead, Minitab calculates predicted R-squared by systematically removing each observation from the data set, estimating the regression equation, and determining how well the model predicts the removed observation.

If the model does a poor job at predicting the removed observations, this indicates that the model is probably tailored to the specific data points that are included in the sample and not generalizable outside the sample.

To avoid overfitting your model in the first place, collect a sample that is large enough so you can safely include all of the predictors, interaction effects, and polynomial terms that your response variable requires. The scientific process involves plenty of research before you even begin to collect data. You should identify the important variables, the model that you are likely to specify, and use that information to estimate a good sample size.

For more about the model selection process, read my blog post, How to Choose the Best Regression Model. Also, check out my post about overfitting regression models by using too many phantom degrees of freedom. The methods described above won't necessarily detect this problem.

Ever use dental floss to cut soft cheese? Or Alka Seltzer to clean your toilet bowl? You can find a host of nonconventional uses for ordinary objects online. Some are more peculiar than others.

Ever use ordinary linear regression to evaluate a response (outcome) variable of counts?

Technically, ordinary linear regression was designed to evaluate a a continuous response variable. A continuous response variable, such as temperature or length, is measured along a continuous scale that includes fractional (decimal) values. In practice, however, ordinary linear regression is often used to evaluate a response of count data, which are whole numbers as 0, 1, 2, and so on.

You can do that. Just like you can use a banana to clean a DVD. But there are things to watch out for if you do that. To examine issues related to performing ordinary linear regression analysis with count data, consider the following scenario.

Kids, Ants, and Sandwiches

A bored kid in a backyard makes a great scientist. One day, three Australian kids wondered which of their lunch sandwiches would attract more meat ants: Peanut butter, Vegemite, or Ham and pickles.

Note: Meat ants are an aggressive species of Australian ant that can kill a poisonous cane toad. Vegemite is a slightly bitter, salty brown paste made from brewer’s yeast extract.

To test their hypotheses, the kids starting dropping pieces of the three sandwiches and counting the number of ants on each sandwich after a set amount of time. Years later, as an adult, one of the kids replicated this childhood experiment with increased rigor. You can find the details of his modified experiment and the sample data it produced on the web site of the American Statistical Association.

Preparing the Data

To make the data and the results easier to interpret, I coded and sorted the original sample data set using the Code and Sort commands in Minitab's Data menu. If you want to see those data manipulation maneuvers, click here to open the project file in Minitab, then open the Report Pad to see the instructions. If you don't have a copy of Minitab, you can download a free 30-day trial version.

After coding and sorting, the combination of factor levels for each sandwich used for ant bait are easy to see in the worksheet, and the data values are arranged in the order that they were collected.

For example, row 9 shows that ham and pickles on rye with butter was the 9th piece of sandwich bait used—and it attracted 65 meat ants.

Performing Linear Regression

Are meat ants statistically more likely to swarm a ham sandwich—or will the pickles be a turnoff? Do they gravitate to the creamy comfort of butter? Or will salty, malty Vegemite drive them wild?

To evaluate the data using ordinary linear regression, choose Stat > Regression > Fit Regression Model. Fill out the dialog box as shown below and click OK.

First, examine the ANOVA table to determine whether any of the predictors are statistically significant.

At the 0.1 level of significance, both Filling and Butter predictors are statistically significant (p-value < 0.1). What matters to a meat ant, it seems, is not the bread, but what's between it.

To see how each of the levels of the factors relate to the number of ants (the response), examine the Coefficients table.

Each coefficient value is calculated in relation to the reference level for the variable, which has a coefficient of 0. Whatever level isn’t shown in the table is the reference level. So for the Filling variable, the reference level is Vegemite.

Tip: You can see the reference levels used for each variable by clicking the Coding button on the Regression dialog box. If you want the coefficients to be calculated relative to a different level, simply change the reference level in the drop-down list and rerun the analysis.

So what do these coefficient values mean? Generally speaking, larger coefficients are associated with a response of greater magnitude. The positive coefficients indicate a positive association, and the negative coefficients indicate a negative association.

For example, the positive coefficient of 27.28 for ham and pickles indicates that many more ants are attracted to the ham and pickles over Vegemite. The p-value of 0.000 for the coefficient indicates that the difference between ham and pickles and Vegemite is statistically significant. Based on these results, meat ants appear to be aptly named!

The Regression Equation: Caveat with a Count Response

The output for ordinary linear regression also includes a regression equation. The equation can be used to estimate the value of the response for specific values of the predictor variables

For categorical predictors, substitute a value of 1 into the equation for the levels at which you want to predict a response, and substitute 0 for the other levels.

For example, using the equation above, the number of meat ants that you can expect to be attracted by a peanut butter sandwich, without butter, on white bread, is estimated at: 24.31 + 7.04(0) + 1.12(0) - 1.21(1) + 0.0(0) + 8.31(1) + 27.28(0) + 0.0(1) + 11.40(0) ≈ 31.41 ants. (You can have Minitab do these calculations for you. Simply choose Stat > Regression > Regression > Predict and enter the predictor levels in the dialog box.)

One issue that can arise if you use ordinary linear regression with a count response is that, at certain predictor levels, the regression equation may estimate negative values for the response. But a negative "count" of ants—or anything else—doesn't make any sense. In that case, the equation may not be practically useful.

For this particular data set, it's not a problem. Using the regression equation, the lowest possible estimated response is for a Vegemite sandwich on white bread without butter (24.31 - 1.21), which yields an estimate of about 23 ants. Negative estimates don't occur primarily because the counts in this data set are all considerably greater than 0. But often that's not the case.

Evaluating the Model Fit and Assumptions

Regardless of whether you're performing ordinary linear regression with a continuous response variable or a discrete response variable of counts, it's important to assess the model fit, investigate extreme outliers, and check the model assumptions. If there's a serious problem, your results might not be valid.

The R-squared (adj) value suggests this model explains about half of the variation of the ant count (47.35%). Not great—but not not bad for a linear regression model with only a few categorical predictors. For this particular analysis, the ANOVA output also includes a p-value for lack-of-fit.

If the p-value for lack-of-fit is less than 0.05, there's statistically significant evidence that the model does not fit the data adequately. For this model, the p-value here is greater than 0.05. That means there's not sufficient evidence to conclude that the model doesn't fit well. That's a good thing.

Minitab's regression output also flags unusual observations, based on the size of their residuals. Residuals, also called "model errors", measure how much the response values estimated by the regression model differ from the actual response values in your data. The smaller a residual, the closer the value estimated by the model is to the actual value in your data. If a residual is unusually large, it suggests the the observation may be an outlier that's "bucking the trend" of your model.

For the ant count sample data, three observations are flagged as unusual:

If you see unusual values in this table, it's not a cause for alarm. Generally, you can expect roughly 5% of the data values to have large standardized residuals. But if there's a lot more than that, or if the size of a residual is unusually large, you should investigate.

For this sample data set of 48 observations, the number of unusual observations is not worrisome. However, two of the observations (circled in red) appear to be very much out-of-whack with the other observations. To figure out why, I went back to the original sample data set online, and found this note from the experimenter:

"Two results are large outliers. A reading of 97 was due to…leaving a portion of sandwich behind from the previous observation (i.e., there were already ants there); and one of 2 was due to [the sandwich portion be placed] too far away from the entrance to the [ant] hill.”

Because these outliers can be attributed to a special (out-of-the-ordinary) cause, it would be OK to remove them and re-run the analysis, as long you clearly state that you have done so (and why). However, in this case, removing these two outliers doesn't significantly change the overall results of the linear regression analysis, anyway (for brevity, I won't include those results here).

Finally, examine the model assumptions for the regression analysis. In Minitab, choose Stat > Regression > Fit Regression Model. Then click Graphs and check Four in one.

The two plots on the left (the Normal Probability Plot and the Histogram) help you assess whether the residuals are normally distributed. Although normality of the residuals is a formal assumption for ordinary linear regression, the analysis is fairly robust (resilient) to this assumption if the data set is sufficiently large (greater than 15 or so). Here, the points fall along the line of the normal probability plot and the histogram shows a fairly normal distribution. All is well..

Constant variance of the residuals is a more critical assumption for linear regression. That means the residuals should be distributed fairly evenly and randomly across all the fitted (estimated) values. To assess constant variance, look at the Residuals versus Fits plot in the upper right. In the plot above, the points appear to be randomly scattered on both sides of the line representing a residual value of 0. Again, no evidence of a problem.

With this sample data, using ordinary linear regression with a count response seems to work OK. But with different count data, might things have worked out differently? We'll examine that in the next post (Part 2).

Meanwhile, kick back and fix yourself a ham and pickle sandwich on rye with butter. And keep an eye out for meat ants.

My previous post showed an example of using ordinary linear regression to model a count response. For that particular count data, shown by the blue circles on the dot plot below, the model assumptions for linear regression were adequately satisfied.

But frequently, count data may contain many values equal or close to 0. Also, the distribution of the counts may be right-skewed. In the quality field, this commonly occurs when you count the number of defects on each item, or the number of defectives in a sample.

So let's suppose that the number of ants coming to each sandwich portion was instead the count data shown by the red square symbols on the dot plot.

If you want to follow along, open the Minitab project file with the new count data. Set up and analyze the ordinary linear regression model the same way as in Part 1. You should get the following result:

For the new count data, notice that the general relationship between the predictors and the count response are similar to those in the original data set. Both Filling and Butter are statistically significant predictors for the ant count response (p < 0.1). Similarly, as before, the coefficients table shows that Ham and Pickles, With Butter are the predictor levels associated with the highest ant count.

But for the new count data, look what happens to the regression equation:

The equation now yields negative counts for some predictor values. For example, the estimated ant count for a Vegemite sandwich on white bread without butter is approximately -0.4. So if you drop that particular sandwich on a sidewalk in Australia, you can expect a little less than negative one-half of an ant to appear. Amodel that predicts antimatter. Hmm. Intriguing. But not very practical.

What about the model assumptions?

With the new count response data, the Residuals Versus Fits plot suggests that the critical assumption of constant variance may be violated. The spread of the residuals appears to increase as the fitted values of the model increase. This classic "megaphone" pattern in the residual plot is a problem—the model estimates get more erratic at higher fitted values.

When this happens, one common approach is to transform the response data to stabilize the variance. In fact, Minitab's linear regression analysis includes an option to perform a Box-Cox transformation (which is a family of power transformations that includes log transform, square root, and other transformation functions) for situations like this. But here's the catch: In many cases, count data can be problematic to transform, especially if they contain the value 0.

For example, try to perform the Box-Cox transformation in Minitab with the new count data, and you'll get this error message.

Even if you try a transformation that can handle counts of 0, you might run into problems due to poor discrimination in your count data. So, when you use ordinary linear regression with a count response, and one of the critical assumptions aren't met—you may find yourself up a creek without a log (or other) transform.

And even if your count data don't include 0, or you manage to find a transformation that works (or use sleight-of-hand to replace 0s in your data set with very minuscule decimal values to make all the data positive), the resulting model with the transformed values may still yield problematic estimates for a count response.

Now what?

Well, instead of using Alka Seltzer to clean your toilet bowl, how about using a product that'd been specifically designed to clean it, such as a toilet bowl cleaner?

That is, instead of using ordinary linear regression, which is technically designed to evaluate a continuous response, why not use a regression analysis specifically designed to analyze a count response? Stay tuned for the next post (Part 3).

I recently guest lectured for an applied regression analysis course at Penn State. Now, before you begin making certain assumptions—because as any statistician will tell you, assumptions are important in regression—you should know that I have no teaching experience whatsoever, and I’m not much older than the students I addressed.

I’m just 5 years removed from my undergraduate days at Virginia Tech, so standing behind the podium felt backwards. But it certainly provoked fond memories of my days in Hutcheson Hall—home to Virginia Tech’s Department of Statistics—learning the same concepts these students are learning.

minitab's regression menu I remember what it was like to be in their shoes. For instance, I had no idea how calculating an R-squared value by hand could meaningfully contribute to my life after college—especially when software like Minitab can so easily do the work for me.

Now I know better.

I wanted to show these students how regression and other statistical concepts could be applied in their future lines of work, and how integral a role statisticians can play in any business. As a student, I failed to grasp this because I was more concerned with the letter grade I needed to obtain to maintain a high GPA. Giving this guest lecture was an opportunity to renounce my old, flawed mentality.

What 5 years in the real world has taught me

Since graduating, I’ve engaged with many Minitab users, and I’ve worked with Minitab’s statistical consultants, who address a variety of business requests for statistical help. I’ve encountered numerous practical applications of how the tools in Minitab help people who need to analyze their data—many of whom lack formal statistical training—select the proper analysis, draw useful conclusions, and make key business decisions that lead to improved processes and increased profits.

If you want to draw actionable conclusions from your data, selecting and using the appropriate statistical tool is half the battle! In so many cases, people choose the wrong tools because they misunderstand what kind of data they have, or they haven’t fully defined the problem they are trying to solve. Questions like, ‘Are my data categorical or continuous,’ or ‘Are my observations independent,’ or ‘How are my data distributed,’ and ‘Does this analysis assume a particular distribution?’ are often difficult to answer, making selection of a statistical tool even more intimidating.

Assistant regression menu in Minitab 17 Fortunately, Minitab makes this entire process easy for professionals with any level of statistical expertise—our software provides the proper tool belt for solving statistical problems, including tools in the Assistant menu which offer detailed guidance to help the professional confidently choose the right analyses and make informed decisions for their business.

But formal training in statistical methods can give students a big advantage in the workplace. And regression—a very practical tool for modeling data and predicting outcomes—was my platform for communicating this idea, based on the experiences of some of Minitab’s own consulting statisticians.

What problems can regression solve?

If the data were accessible, I could use a regression analysis to show one thing that hasn’t changed about college—early morning classes have awful attendance rates, no matter what the subject is. My first lecture was during a Friday 8:00 a.m. section; the room was about half empty. And of those who actually attended, about half drowsily filtered in during the first 10 minutes of class.

But as I talked, I could practically see the synapses firing for some of those students. I presented several examples where regression came to the rescue of businesses in real world settings. These businesses had particular questions, and Minitab’s statistical consultants used regression to provide the answers.

I showed the students how:

A pharmaceutical company used regression to assess the stability of the active ingredient in a drug to predict its shelf life in order to meet FDA regulations and identify a suitable expiration date for the drug.
A credit card company applied regression analysis to predict monthly gift card sales and improve yearly revenue projections.
A hotel franchise used regression to identify a profile for and predict potential clients who might default on a timeshare loan in order to reduce loan qualification rates among high-risk clients, adjust interest rates based on client risk factors, and minimize company losses.
An insurance company used regression to determine the likelihood of a true problem existing when a home insurance claim was filed, in order to discourage customers from filing excessive or petty claims.

The results of these regression analyses, along with help from Minitab’s statistical consultants, gave these companies the confidence to make decisions they knew would improve their business. They were encouraged to identify solutions to address problem areas, and to implement new processes within their organization as well as new strategies to promote products and services to their clientele—knowing they could collect data and use the same tools again in the future to prove that changes were impactful.

The moral of the story

In a world filled with data, students who learn statistics leave college with a skill set that is highly sought after. When it comes to working with data, they will have advantages over professionals who don’t have formal statistical training.

But as most of us know, there’s more to data analysis than memorizing a bunch of equations. And even experienced statisticians can forget some of the nuances involved in analyses they haven’t used in a while. What they don’t forget is how to attack a problem.

In the end, that’s what I wanted these students to understand—that data-driven questions really boil down to problem-solving. What question am I trying to answer, and how do I tackle it?

The real world is about identifying a problem when we encounter it, choosing the right tool to solve it, and interpreting the answer in a way that drives a manager or executive to enact change within a business. Because businesses face real questions and challenges that statisticians—and software like Minitab—can help answer.

I was recently asked a couple of questions about stability studies in Minitab.

Question 1: If you enter in a lower and upper spec in the Stability Study dialog window, why do I see only one confidence bound per fitted line on the resulting graph? Shouldn’t there be two?

You use a stability study to analyze the stability of a product over time and to determine the product's shelf life. In order to run this in Minitab, you need:

Measurement Data
Time Variable
Batch Factor(Optional)

Shown below is a sample of the first 14 rows of a Stability Study data set. The full data set can be found in our Sample Data folder within Minitab Statistical Software. You can download a free 30-day trial of the software if you're not already using it. You can access this folder via File > Open Worksheet, then click on the 'Look in Sample Data Folder' button. The file is called shelflife.mtw.

The Month column represents the month at which the age of the product was collected. The batch column represents where the product originated from. In the sixth row, for example, the drug concentration percentage for Batch 2 at 3 months was 99.478%.

With this information, the stability study will help you estimate the average length of time that the response will be within specification. To satisfy my inquisitor’s first question, we will use a lower spec of 90% and an upper spec of 105%.

The Stability Study dialog box:

The Resulting Graph:

Minitab first checks to see if the starting point of the fitted line is between specs, and then determines the direction of the slope of the fitted lines before deciding what limit to calculate the shelf life from. If the decrease in the mean response is significant, then Minitab calculates the shelf life relative to the lower specification limit.

If the increase in the mean response over time is significant, Minitab calculates the shelf life relative to the upper specification limit. How we choose our bound is then decided by what spec Minitab has sided with. Thus, the 95% lower bound is only being shown in relation to the corresponding fitted line above it. From a conceptual standpoint, if the slope of the mean response line is trending downward, then you'd be looking at where its worst case scenario, the 95% lower bound, intersects with that lower spec. The overall shelf life for the batches is at 54.79 months for a 90% concentration.

Question 2: I get asterisks for Shelf Life at each Batch, as show below:

Batch Shelf Life

1 *

2 *

3 *

4 *

5 *

Overall *

This question is closely related to the first. This depends on the slope’s direction and what specification, lower or upper, you have chosen. Most likely, you won’t run into this situation if:

a. Your fitted line has a significant negative slope and you are only inputting a lower spec.

b. Your fitted line has a significant positive slope and you are only inputting an upper spec.

If you run a stability study with two specs, you may receive these asterisks if the mean response at time = 0 is not within both specifications. You can see this when we use a lower spec of 90 and an upper spec of 98:

For all batches with negative slope, if the response starts out above the upper spec it could never potentially go out of spec in the future, at least based on this model's prediction. It can't calculate a shelf-life for those batches.

It’s a different story in the first question we discussed, as the mean response at time = 0 was below the upper spec:

On a side note, there is another situation which can cause you to obtain all asterisks for the shelf life of the batches. This will happen when the slopes of all fitted lines on the graph are simply not significant.

I hope this information helps you when you perform your next stability study!

Flag of the United States of America With Speaker John Boehner resigning, Kevin McCarthy quitting before the vote for him to be Speaker, and a possible government shutdown in the works, the Freedom Caucus has certainly been in the news frequently! Depending on your political bent, the Freedom Caucus has caused quite a disruption for either good or bad.

Who are these politicians? The Freedom Caucus is a group of approximately 40 Republicans in the U.S. House of Representatives. You may also know this group as the “Hell No” caucus, and they are a key part of the fractured Republican House. In all of the articles and blogs I’ve read, they are described an extremely conservative, far-right group. This extreme conservatism is generally considered to be the defining characteristic.

However, in the Republican presidential race, we’ve seen that the usual debate over the candidates’ conservative credentials has been overshadowed by the outsiders. In other words, there’s an assessment of each candidate’s conservativeness as well as their establishmentarianism.

Is there evidence that an establishment/anti-establishment split is also a factor among the Republicans in the House of Representatives and their search for a new Speaker of the House? In this blog post, I’ll use data and statistical analyses to test these hypotheses!

Data for these Analyses

I obtained the data for these analyses from voteview.com. This group runs an algorithm that uses roll call votes to estimate each politician’s conservativeness and their support of the party establishment. I added a variable that identifies Freedom Caucus membership using the information in this Wikipedia article.

For these data, higher conservative scores indicate that the politician is more conservative. Higher establishmentarianism scores indicate that the politician is more supportive of the establishment while lower scores indicate an anti-establishment position.

Scatterplot of the House Republicans

Graphing the data is always a good place to start for any analysis. The scatterplot below displays a point for each Republican member of the House by their Establishment and Conservativeness scores. The data points that are further right are more conservative. The points that are closer to the bottom are more anti-establishment. Red points identify members of the Freedom Caucus.

Scatterplot of House Republicans

The graph shows that not all members of the Freedom Caucus are extremely conservative. Some are right in the middle! However, all members of the Freedom Caucus are at least on the right half of the graph. These members are also in the bottom, anti-establishment half, which keeps the door open for the hypothesis that we'll test.

Binary Logistic Regression

Let’s test this formally with statistics. To do this, I’ll use binary logistic regression in Minitab statistical software because the response variable is binary. The Republican House members can only either belong to the Freedom Caucus (Yes) or not (No).

Response information table for binary logistic regression

The Response Information table displays general information about the analysis. There are 36 members of the Freedom Caucus out of 247 House Republicans in the analysis.

Deviance table for binary logistic regression

The Deviance Table is like the ANOVA table in a linear regression analysis. This table shows us that both the Conservativeness and Establishmentarianism of the politicians are very statistically significant (p = 0.000). We can conclude that changes in the values of these two predictors are associated with changes in the probability that a politician is a member of the Freedom Caucus.

The interaction between the two predictors is insignificant and I did not include it in the final model.

Graph the Results to Understand the Binary Logistic Regression Model

The easiest way to understand these results is to graph them. When you fit a variety of model types in Minitab 17, the analysis stores that model in the worksheet. You can then use a variety of handy features to quickly and easily explain what your model really means.

The graph below displays the probabilities associated with the values of the two predictors. The highest probabilities for Freedom Caucus membership are in the bottom right for politicians who are both very conservative and very anti-establishment.

Contour plot of the probability of belonging to the Freedom Caucus

In the main effects plot below, Minitab graphs the effect of each variable independently while the other variable is held constant.

Main effects plot of conservativeness and establishmentarianism

On the Conservativeness side, the graph shows that as a politician becomes more conservative (by moving right), their probability of membership in the Freedom Caucus increases. In fact, the probability really starts to shoot up fast around a score of 0.5. On the Establishmentarianism side, as a politician becomes more anti-establishment (by moving left), their probability of Freedom Caucus membership also increases at an increasing rate.

Collectively, the statistical analyses show that membership in the Freedom Caucus is not as simple as being on the far right end of the political spectrum. Instead, this group has a mixture of very conservative and anti-establishment sentiment driving their actions. Understanding this multidimensional fracture in the Republican Party helps explain why it is so difficult to form a more cohesive caucus and to choose a new Speaker of the House.

When Kevin McCarthy refused to run for Speaker, many called on Paul Ryan as the ideal candidate to unify the House Republicans. Although Ryan appears to have declined this call to duty, he provides a notion of what the ideal Speaker looks like in this new environment.

To compare McCarthy to Ryan across both characteristics, I standardized their raw scores to account for any differences in the scaling of the two variables. The table shows their Z-values, which is the number of standard deviations that each politician falls from the House Republican mean for each variable.

Conservatism Establishmentarianism McCarthy -0.169 0.549 Ryan 0.496 -1.180

Compared to McCarthy, Ryan has a moderately more conservative score, but he is notably more anti-establishment. This larger difference indicates which way the political winds are blowing!

Demon As Halloween approaches, you are probably taking the necessary steps to protect yourself from the various ghosts, goblins, and witches that are prowling around. Monsters of all sorts are out to get you, unless they’re sufficiently bribed with candy offerings!

I’m here to warn you about a ghoul that all statisticians and data scientists need to be aware of: phantom degrees of freedom. These phantoms are really sneaky. You can be out, fitting a regression model, looking at your output, and thinking everything is fine. Then, whammo, these phantoms get you! They suck the explanatory and predictive power right out of your regression model but, deviously, leave all of the output looking just fine. Now that’s truly spooky!

In this blog post, I’ll show you how these phantoms work and how to avoid their dastardly deeds!

What Are Normal Degrees of Freedom in Regression Models?

I’ve written previously about the dangers of overfitting your regression model. An overfit model is one that is too complicated for your data set.

You can learn only so much from a data set of a given size. A degree of freedom is a measure of how much you’ve learned. Your model uses these degrees of freedom with every parameter that it estimates. If you use too many, you’re overfitting the model. The end result is that the regression coefficients, p-values, and R-squared can all be misleading.

You can detect overfit models by looking at the number of observations per parameter estimate and assessing the predicted R-squared. However, these methods won’t necessarily detect the misbegotten effects of summoning an excessive number of phantom degrees of freedom!

In the degrees of freedom (DF) column in the ANOVA table below, you can see that this regression model uses 3 degrees of freedom out of a total of 28. It appears that this model is fine. Or is it? <Cue evil laugh!>

Analysis of variance table for a regression model

What Are Phantom Degrees of Freedom?

Phantom degrees of freedom are devilish because they latch onto you through the manner in which you settle on the final model. They are not detectable in the output for the final model even as they haunt your regression models.

The dangers of invoking too many phantom degrees of freedom!

Every time your incantation adds or removes predictors from a model based on a statistical test, you invoke a phantom degree of freedom because you’re learning something from your data set. However, even when you summon many phantom degrees of freedom during the model selection process, they are not evident in Minitab’s output for the final model. That is what makes them phantoms.

When you invoke too many phantoms, your regression model becomes haunted. This occurs because you’re performing many statistical tests, and every statistical test has a false positive rate. When you try many different models, you're bound to find variables that appear to be significant but are correlated only by chance. These relationships are nothing more than ghostly apparitions!

To protect yourself from this type of bewitching, you need to understand the environment that these phantoms inhabit. Phantom degrees of freedom have the strongest powers when you have a small-to-moderate sample size, many potential predictors, correlated predictors, and when the light of knowledge does not illuminate your conception of the true model.

In this scenario, you are likely to fit many possible models, adding and removing different predictors, and testing curvature and interaction terms in an attempt to conjure an answer out of the darkness. Perhaps you use an automatic incantation procedure like stepwise or best subsets regression. If you have multicollinearity, the parameter estimates are particularly unhinged.

The ANOVA table we saw above appears to be perfectly normal, but it could be haunted. To divine the truth, you must understand the entire ritual that incited the final model to materialize. If you start out with 20 variables, a sample size of 29, and fit many models to see what works, you could conjure a possessed model beguiling you to accept false conclusions.

In fact, this method of dredging through data to see what sticks casts such a diabolical spell that it can manifest a statistically significant regression model with a high R-squared from completely random data! Beware—this is the environment that the phantoms inhabit!

How to Protect Yourself from the Phantom Degrees of Freedom

To protect yourself from phantom degrees of freedom, information and advance planning are your best talismans. Use the following rites to shine the light of truth on your research and to guide yourself out of the darkness:

Conduct prior research about the important variables and their relationships to help you specify the best regression model without the need for data mining.
Collect a large enough sample size to support the level of model complexity that you will need.
Avoid data mining and keep track of how many phantom degrees of freedom that you raise before arriving at your final model.

For more information about avoiding haunted models, read my post about How to Choose the Best Regression Model.

Happy Halloween!

"Buer." Licensed under Public Domain via Commons.

By Matthew Barsalou, guest blogger

A problem must be understood before it can be properly addressed. A thorough understanding of the problem is critical when performing a root cause analysis (RCA) and an RCA is necessary if an organization wants to implement corrective actions that truly address the root cause of the problem. An RCA may also be necessary for process improvement projects; it is necessary to understand the cause of the current level performance before attempts are made to improve the performance.

Many statistical tests related to problem-solving can be performed using Minitab Statistical Software. However, the actual test you select should be based upon the type of data you have and what needs to be understood. The figure below shows various statistical options structured in a cause-and-effect diagram with the main branches based on characteristics that describe what the tests and methods are used for.

The main branch labeled “differences” is split into two high-level sub-branches: hypothesis tests that have an assumption of normality, and non-parametric tests of medians. The hypothesis tests assume data is normally distributed and can be used to compare means, variances, or proportions to either a given value or to the value of a second sample. An ANOVA can be performed to compare the means of two or more samples.

The non-parametric tests listed in the cause-and-effect diagram are used to compare medians, either to a specified value, or two or more medians, depending upon which test is selected. The non-parametric tests provide an option when data is too skewed to use other options, such as a Z-test.

Time may also be of interest when exploring a problem. If your data are recorded in order of occurrence, a time series plot can be created to show each value at the time it was produced; this may give insights into potential changes in a process.

A trend analysis looks much like the time series plot; however, Minitab also tests for potential trends in the data such as increasing or decreasing values over time. Exponential smoothing options are available to assign exponentially decreasing weights to the values over time when attempting to predict future outcomes.

Relationships can be explored using various types of regression analysis to identify potential correlations in the data such as the relationship between the hardness of steel and the quenching time of the steel. This can be helpful when attempting to identify the factors that influence a process. Another option for understanding relationships is Design of Experiments (DoE), where experiments are planned specifically to economically explore the effects and interactions between multiple factors and a response variable.

Another main branch is for capability and stability assessments. There are two main sub-branches here; one is for measures of process capability and performance and the other is for Statistical Process Control (SPC), which can assess the stability of a process.

The measures of process performance and capability can be useful for establishing the baseline performance of a process; this can be helpful in determining of process improvement activities have actually improved the process. The SPC sub-branch is split into three lower-level sub-branches; these are control charts for attribute data such as number of defective units, control charts for continues data such as diameters, and time-weighted charts that don’t give all values equal weights.

Control charts can be used for both assessing the current performance of a process such as by using an individual’s chart to determine if the process is in a states of statistical control, or for monitoring the performance of a process such as after improvements have been implemented.

Exploratory data analysis (EDA) can be useful for gaining insights to the problem using graphical methods. The individual values plot is useful for simply observing the position of each value relative to the other values in a data set. For example, a box plot can be helpful when comparing the means, medians and spread of data from multiple processes. The purpose of EDA is not to form conclusions, but to gain insights that can be helpful in forming tentative hypotheses or in deciding which type of statistical test to perform.

The tests and methods presented here do not cover all available statistical tests and methods in Minitab; however, they do provide a large selection of basic options to choose from.

These tools and methods are helpful when exploring a problem, but their use should not be limited to problem exploration. They can also be helpful for planning and verifying improvements. For example, an individual value plot may indicate one process performs better than a comparable process, and this can then be confirmed using a two-sample t test. Or, the settings of the better process can be used to plan a DoE to identify the optimal settings for the two processes and the improvements can be monitored using an xBar and S chart for the two processes.

About the Guest Blogger

Matthew Barsalou is a statistical problem resolution Master Black Belt at BorgWarner Turbo Systems Engineering GmbH. He is a Smarter Solutions certified Lean Six Sigma Master Black Belt, ASQ-certified Six Sigma Black Belt, quality engineer, and quality technician, and a TÜV-certified quality manager, quality management representative, and auditor. He has a bachelor of science in industrial sciences, a master of liberal studies with emphasis in international business, and has a master of science in business administration and engineering from the Wilhelm Büchner Hochschule in Darmstadt, Germany. He is author of the books Root Cause Analysis: A Step-By-Step Guide to Using the Right Tool at the Right Time, Statistics for Six Sigma Black Belts and The ASQ Pocket Guide to Statistics for Six Sigma Black Belts.

Did you ever wonder why statistical analyses and concepts often have such weird, cryptic names?

One conspiracy theory points to the workings of a secret committee called the ICSSNN. The International Committee for Sadistic Statistical Nomenclature and Numerophobia was formed solely to befuddle and subjugate the masses. Its mission: To select the most awkward, obscure, and confusing name possible for each statistical concept.

A whistle-blower recently released the following transcript of a secretly recorded ICSSNN meeting:

"This statistical analysis seems pretty straightforward…"

“What does it do?”

“It describes the relationship between one or more 'input' variables and an 'output' variable. It gives you an equation to predict values for the 'output' variable, by plugging in values for the input variables."

“Oh dear. That sounds disturbingly transparent.”

“Yes. We need to fix that—call it something grey and nebulous. What do you think of 'regression'?”

“What’s 'regressive' about it?

“Nothing at all. That’s the point!”

“Re-gres-sion. It does sound intimidating. I’d be afraid to try that alone.”

“Are you sure it’s completely unrelated to anything? Sounds a lot like 'digression.' Maybe it’s what happens when you add up umpteen sums of squares…you forget what you were talking about.”

“Maybe it makes you regress and relive your traumatic memories of high school math…until you revert to a fetal position?”

“No, no. It’s not connected with anything concrete at all.”

“Then it’s perfect!”

“I don’t know...it only has 3 syllables. I’d feel better if it were at least 7 syllables and hyphenated.”

“I agree. Phonetically, it’s too easy…people are even likely to pronounce it correctly. Could we add an uvular fricative, or an interdental retroflex followed by a sustained turbulent trill?”

The Real Story: How Regression Got Its Name

Conspiracy theories aside, the term “regression” in statistics was probably not a result of the workings of the ICSSNN. Instead, the term is usually attributed to Sir Francis Galton.

Galton was a 19th century English Victorian who wore many hats: explorer, inventor, meteorologist, anthropologist, and—most important for the field of statistics—an inveterate measurement nut. You might call him a statistician’s statistician. Galton just couldn’t stop measuring anything and everything around him.

During a meeting of the Royal Geographical Society, Galton devised a way to roughly quantify boredom: he counted the number of fidgets of the audience in relation to the number of breaths he took (he didn’t want to attract attention using a timepiece). Galton then converted the results on a time scale to obtain a mean rate of 1 fidget per minute per person. Decreases or increases in the rate could then be used to gauge audience interest levels. (That mean fidget rate was calculated in 1885. I’d guess the mean fidget rate is astronomically higher today—especially if glancing at an electronic device counts as a fidget.)

Galton also noted the importance of considering sampling bias in his fidget experiment:

“These observations should be confined to persons of middle age. Children are rarely still, while elderly philosophers will sometimes remain rigid for minutes.”

But I regress…

Galton was also keenly interested in heredity. In one experiment, he collected data on the heights of 205 sets of parents with adult children. To make male and female heights directly comparable, he rescaled the female heights, multiplying them by a factor 1.08. Then he calculated the average of the two parents' heights (which he called the “mid-parent height”) and divided them into groups based on the range of their heights. The results are shown below, replicated on a Minitab graph.

For each group of parents, Galton then measured the heights of their adult children and plotted their median heights on the same graph.

Galton fit a line to each set of heights, and added a reference line to show the average adult height (68.25 inches).

Like most statisticians, Galton was all about deviance. So he represented his results in terms of deviance from the average adult height.

Based on these results, Galton concluded that as heights of the parents deviated from the average height (that is as they became taller or shorter than the average adult), their children tended to be less extreme in height. That is, the heights of the children regressed to the average height of an adult.

He calculated the rate of regression as 2/3 of the deviance value. So if the average height of the two parents was, say, 3 inches taller than the average adult height, their children would tend to be (on average) approximately 2/3*3 = 2 inches taller than the average adult height.

Galton published his results in a paper called “Regression towards Mediocrity in Hereditary Stature.”

So here’s the irony: The term regression, as Galton used it, didn't refer to the statistical procedure he used to determine the fit lines for the plotted data points. In fact, Galton didn’t even use the least-squares method that we now most commonly associate with the term “regression.” (The least-squares method had already been developed some 80 years previously by Gauss and Legendre, but wasn’t called “regression” yet.) In his study, Galton just "eyeballed" the data values to draw the fit line.

For Galton, “regression” referred only to the tendency of extreme data values to "revert" to the overall mean value. In a biological sense, this meant a tendency for offspring to revert to average size ("mediocrity") as their parentage became more extreme in size. In a statistical sense, it meant that, with repeated sampling, a variable that is measured to have an extreme value the first time tends to be closer to the mean when you measure it a second time.

Later, as he and other statisticians built on the methodology to quantify correlation relationships and to fit lines to data values, the term “regression” become associated with the statistical analysis that we now call regression. But it was just by chance that Galton's original results using a fit line happened to show a regression of heights. If his study had showed increasing deviance of childrens' heights from the average compared to their parents, perhaps we'd be calling it "progression" instead.

So, you see, there’s nothing particularly “regressive” about a regression analysis.

And that makes the ICSSNN very happy.

Don't Regress....Progress

Never let intimidating terminology deter you from using a statistical analysis. The sign on the door is often much scarier than what's behind it. Regression is an intuitive, practical statistical tool with broad and powerful applications.

If you’ve never performed a regression analysis before, a good place to start is the Minitab Assistant. See Jim Frost’s post on using the Assistant to perform a multiple regression analysis. Jim has also compiled a helpful compendium of blog posts on regression.

And don’t forget Minitab Help. In Minitab, choose Help > Help. Then click Tutorials > Regression, or Stat Menu > Regression.

Sources

Bulmer, M. Francis Galton: Pioneer or Heredity and Biometry. Johns Hopkins University Press, 2003.

Davis, L. J. Obsession: A History. University of Chicago Press, 2008.

Galton, F. “Regression towards Mediocrity in Hereditary Stature.” http://galton.org/essays/1880-1889/galton-1886-jaigi-regression-stature.pdf

Gillham, N. W. A Life of Sir Francis Galton. Oxford University Press, 2001.

Gould, S. J. The Mismeasure of Man. W. W. Norton, 1996.

If you perform linear regression analysis, you might need to compare different regression lines to see if their constants and slope coefficients are different. Imagine there is an established relationship between X and Y. Now, suppose you want to determine whether that relationship has changed. Perhaps there is a new context, process, or some other qualitative change, and you want to determine whether that affects the relationship between X and Y.

For example, you might want to assess whether the relationship between the height and weight of football players is significantly different than the same relationship in the general population.

You can graph the regression lines to visually compare the slope coefficients and constants. However, you should also statistically test the differences. Hypothesis testing helps separate the true differences from the random differences caused by sampling error so you can have more confidence in your findings.

In this blog post, I’ll show you how to compare a relationship between different regression models and determine whether the differences are statistically significant. Fortunately, these tests are easy to do using Minitab statistical software.

In the example I’ll use throughout this post, there is an input variable and an output variable for a hypothetical process. We want to compare the relationship between these two variables under two different conditions. Here is the Minitab project file with the data.

Comparing Constants in Regression Analysis

When the constants (or y intercepts) in two different regression equations are different, this indicates that the two regression lines are shifted up or down on the Y axis. In the scatterplot below, you can see that the Output from Condition B is consistently higher than Condition A for any given Input value. We want to determine whether this vertical shift is statistically significant.

Scatterplot with two regression lines that have different constants.

To test the difference between the constants, we just need to include a categorical variable that identifies the qualitative attribute of interest in the model. For our example, I have created a variable for the condition (A or B) associated with each observation.

To fit the model in Minitab, I’ll use: Stat > Regression > Regression > Fit Regression Model. I’ll include Output as the response variable, Input as the continuous predictor, and Condition as the categorical predictor.

In the regression analysis output, we’ll first check the coefficients table.

Coefficients table that shows that the constants are different

This table shows us that the relationship between Input and Output is statistically significant because the p-value for Input is 0.000.

The coefficient for Condition is 10 and its p-value is significant (0.000). The coefficient tells us that the vertical distance between the two regression lines in the scatterplot is 10 units of Output. The p-value tells us that this difference is statistically significant—you can reject the null hypothesis that the distance between the two constants is zero. You can also see the difference between the two constants in the regression equation table below.

Regression equation table that shows constants that are different

Comparing Coefficients in Regression Analysis

When two slope coefficients are different, a one-unit change in a predictor is associated with different mean changes in the response. In the scatterplot below, it appears that a one-unit increase in Input is associated with a greater increase in Output in Condition B than in Condition A. We can see that the slopes look different, but we want to be sure this difference is statistically significant.

Scatterplot that shows two slopes that are different

How do you statistically test the difference between regression coefficients? It sounds like it might be complicated, but it is actually very simple. We can even use the same Condition variable that we did for testing the constants.

We need to determine whether the coefficient for Input depends on the Condition. In statistics, when we say that the effect of one variable depends on another variable, that’s an interaction effect. All we need to do is include the interaction term for Input*Condition!

In Minitab, you can specify interaction terms by clicking the Model button in the main regression dialog box. After I fit the regression model with the interaction term, we obtain the following coefficients table:

Coefficients table that shows different slopes

The table shows us that the interaction term (Input*Condition) is statistically significant (p = 0.000). Consequently, we reject the null hypothesis and conclude that the difference between the two coefficients for Input (below, 1.5359 and 2.0050) does not equal zero. We also see that the main effect of Condition is not significant (p = 0.093), which indicates that difference between the two constants is not statistically significant.

Regression equation table that shows different slopes

It is easy to compare and test the differences between the constants and coefficients in regression models by including a categorical variable. These tests are useful when you can see differences between regression models and you want to defend your conclusions with p-values.

If you're learning about regression, read my regression tutorial!

In the world of linear models, a hierarchical model contains all lower-order terms that comprise the higher-order terms that also appear in the model. For example, a model that includes the interaction term A*B*C is hierarchical if it includes these terms: A, B, C, A*B, A*C, and B*C.

Minitab dialog box that asks about a non-hierarchical regression model

Fitting the correct regression model can be as much of an art as it is a science. Consequently, there's not always a best model that everyone agrees on. This uncertainty carries over to hierarchical models because statisticians disagree on their importance. Some think that you should always fit a hierarchical model whereas others will say it's okay to leave out insignificant lower-order terms in specific cases.

Beginning with Minitab 17, you have the flexibility to specify either a hierarchical or a non-hierarchical linear model for a variety analyses in regression, ANOVA, and designed experiments (DOE). In the example above, if A*B is not statistically significant, why would you include it in the model? Or, perhaps you’ve specified a non-hierarchical model, have seen this dialog box, and you aren’t sure what to do?

In this blog post, I’ll help you decide between fitting a hierarchical or a non-hierarchical regression model.

Practical Reasons to Fit a Hierarchical Linear Model

Reason 1: The terms are all statistically significant or theoretically important

This one is a no-brainer—if all the terms necessary to produce a hierarchical model are statistically significant, you should probably include all of them in the regression model. However, even when a lower-order term is not statistically significant, theoretical considerations and subject area knowledge can suggest that it is a relevant variable. In this case, you should probably still include that term and fit a hierarchical model.

If the interaction term A*B is statistically significant, it can be hard to imagine that the main effect of A is not theoretically relevant at all even if it is not statistically significant. Use your subject area knowledge to decide!

Reason 2: You standardized your continuous predictors or have a DOE model

If you standardize your continuous predictors, you should fit a hierarchical model so that Minitab can produce a regression equation in uncoded (or natural) units. When the equation is in natural units, it’s much easier to interpret the regression coefficients.

If you standardize the predictors and fit a non-hierarchical model, Minitab can only display the regression equation in coded units. For an equation in coded units, the coefficients reflect the coded values of the data rather than the natural values, which makes the interpretation more difficult.

You should always consider a hierarchical model for DOE models because they always use standardized predictors. Starting with Minitab 17, standardizing the continuous predictors is an option for other linear models.

Even if you aren’t using a DOE model, this reason probably applies to you more often than you realize in the context of hierarchical models. When your model contains interaction terms or polynomial terms, you have a great reason to standardize your predictors. These higher-order terms often cause high levels of multicollinearity, which can produce poorly estimated coefficients, cause the coefficients to switch signs, and sap the statistical power of the analysis. Standardizing the continuous predictors can reduce the multicollinearity and related problems that are caused by higher-order terms.

Read my blog post about multicollinearity, VIFs, and standardizing the continuous predictors.

Why You Might Not Want to Fit a Hierarchical Linear Model

Models that contain too many terms can be relatively imprecise and can have a lessened ability to predict the values of new observations.

Consequently, if the reasons to fit a hierarchical model do not apply to your scenario, you can consider removing lower-order terms if they are not statistically significant.

Discussion

In my view, the best time to fit a non-hierarchical regression model is when a hierarchical model forces you to include many terms that are not statistically significant. Your model might be more precise without these extra terms.

However, keep an eye on the VIFs to assess multicollinearity. VIFs greater than 5 indicate that multicollinearity might be causing problems. If the VIFs are high, you may want to standardize the predictors, which can tip the balance towards fitting a hierarchical model. On the other hand, removing the interaction terms that are not significant can also reduce the multicollinearity.

Minitab output that shows the VIFs

You can fit the hierarchical model with standardization first to determine which terms are significant. Then, fit a non-hierarchical model without standardization and check the VIFs to see if you can trust the coefficients and p-values. You should also check the residual plots to be sure that you aren't introducing a bias by removing the terms.

Keep in mind that some statisticians believe you should always fit a hierarchical model. Their rationale, as I understand it, is that a lower-order term provides more basic information about the shape of the response function and a higher-order term simply refines it. This approach has more of a theoretical basis than a mathematical basis. It is not problematic as long as you don’t include too many terms that are not statistically significant.

Unfortunately, there is not always a clear-cut answer to the question of whether you should fit a hierarchical model. I hope this post at least helps you sort through the relevant issues.

In statistics, there are things you need to do so you can trust your results. For example, you should check the sample size, the assumptions of the analysis, and so on. In regression analysis, I always urge people to check their residual plots.

In this blog post, I present one more thing you should do so you can trust your regression results in certain circumstances—standardize the continuous predictor variables. Before you groan about having one more thing to do, let me assure you that it’s both very easy and very important. In fact, standardizing the variables can actually reveal statistically significant findings that you might otherwise miss!

When and Why to Standardize the Variables

You should standardize the variables when your regression model contains polynomial terms or interaction terms. While these types of terms can provide extremely important information about the relationship between the response and predictor variables, they also produce excessive amounts of multicollinearity.

Multicollinearity is a problem because it can hide statistically significant terms, cause the coefficients to switch signs, and make it more difficult to specify the correct model.

Your regression model almost certainly has an excessive amount of multicollinearity if it contains polynomial or interaction terms. Fortunately, standardizing the predictors is an easy way to reduce multicollinearity and the associated problems that are caused by these higher-order terms. If you don’t standardize the variables when your model contains these types of terms, you are at risk of both missing statistically significant results and producing misleading results.

How to Standardize the Variables

Minitab's coding dialog box Many people are not familiar with the standardization process, but in Minitab 17 it’s as easy as choosing an option and then proceeding along normally. All you need to do is click the Coding button in the main dialog and choose an option from Standardize continuous predictors.

To reduce multicollinearity caused by higher-order terms, choose an option that includes Subtract the mean or use Specify low and high levels to code as -1 and +1.

These two methods reduce the amount of multicollinearity. In my experience, both methods produce equivalent results. However, it’s easy enough to try both methods and compare the results. The -1 to +1 coding scheme is the method that DOE models use. I tend to use Subtract the mean because it’s a more intuitive process. Subtracting the mean is also known as centering the variables.

One caution: the other two standardization methods won't reduce the multicollinearity.

How to Interpret the Results When You Standardize the Variables

Conveniently, you can usually interpret the regression coefficients in the normal manner even though you have standardized the variables. Minitab uses the coded values to fit the model, but it converts the coded coefficients back into the uncoded (or natural) values—as long as you fit a hierarchical model. Consequently, this feature is easy to use and the results are easy to interpret.

I’ll walk you through an example to show you the benefits, how to identify problems, and how to determine whether they have been resolved. This example comes from a previous post where I show how to compare regression slopes. You can get the data here.

In the first model, the response variable is Output and the predictors are Input, Condition, and the interaction term, Input*Condition.

Regression results with unstandardized predictor variables

For the results above, if you use a significance level of 0.05, Input and Input*Condition are statistically significant, but Condition is not significant. However, VIFs greater than 5 suggest problematic levels of multicollinearity, and the VIFs for Condition and the interaction term are right around 5.

I’ll refit the model with the same terms but I’ll standardize the continuous predictors using the Subtract the mean method.

Regression results with standardized predictor variables

These results show that multicollinearity has been reduced because all of the VIFs are less than 5. Importantly, Condition is now statistically significant. Multicollinearity was obscuring the significance in the first model! The coefficients table shows the coded coefficients, but Minitab has converted them back into uncoded coefficients in the regression equation. You interpret these uncoded values in the normal manner.

This example shows the benefits of standardizing the variables when your regression model contains polynomial terms and interaction terms. You should always standardize when your model contains these types of terms. It is very easy to do and you’ll have more confidence that you’re not missing something important!

For more information, see my blog post What Are the Effects of Multicollinearity and When Can I Ignore Them? That post provides a more detailed explanation about the effects of multicollinearity and a different example of how standardizing the variables can reveal significant findings, and even a changing coefficient sign, that would have otherwise remained hidden.

If you're learning about regression, read my regression tutorial!

What is an interaction? It’s when the effect of one factor depends on the level of another factor. Interactions are important when you’re performing ANOVA, DOE, or a regression analysis. Without them, your model may be missing an important term that helps explain variability in the response!

For example, let’s consider 3-point shooting in the NBA. We previously saw that the number of 3-point attempts per game has been steadily increasing in the NBA. And there is no better example of this than the Golden State Warriors, who shoot 35% of their shots from behind the arc (2nd in the NBA). Seeing as how the Warriors currently lead the NBA in points per 100 possessions (a better indicator of offense than points per game since it accounts for pace), could it be that shooting more 3-pointers increases the number of points you score? For every NBA team since 1981, I collected their season totals for points per 100 possessions (ORtg) and the percentage of field goal attempts from 3-point range (3PAr). For example, if your 3PAr is 0.30, then 30% of your field goal attempts are 3-pointers (and the other 70% are from 2). Here is a fitted line plot of the two variables.

Fitted Line Plot

ANOVA Table

At first glance, it doesn’t look like shooting a lot of your shots from 3 has any effect on a team’s offensive rating. However, we’re missing an important variable. The Golden State Warriors don’t score a lot of points just because they shoot a lot of 3-pointers. They score a lot because they shoot a lot of 3-pointers andtheymake a lot of them.

So now let’s include each team’s percentage of successful 3-pointers (3P%) in the model.

ANOVA Table

Both of our terms are now significant, but the R-squared value is only 4.53. That means that our model explains only 4.53% of the variation in a team’s offensive rating. This is because we’re still leaving out an important term: the interaction! If your percentage of successful 3-pointers is low and you shoot a lot of 3-pointers, your offensive rating is going to be lower than if your percentage of successful 3-pointers is high and you shoot a lot of 3-pointers.

Let’s see what happens when we include the interaction term:

ANOVA Table

The interaction term is significant in the model, and our R-squared value has now increased to 20.27%!

When an interaction term is significant in the model, you should ignore the main effects of the variables and focus on the effect of the interaction. Minitab provides several tools to better help you understand this effect. The easiest to use is the line plot.

Interaction Plot

In this plot, the red line represents the highest value for percentage of successful 3-pointers (3P%) in the data, and the blue line represents the lowest. When you shoot significantly more 2-pointers than 3-pointers (the left side of the 3PAr axis) the offensive rating is similar for both the high and low settings of 3P%. But as you shoot fewer 2-pointers and more 3-pointers, offensive rating goes up for the high-success setting of 3-point shooting percentage, and drastically drops for the low-success setting.

Because 3P% is a continuous variable, we should be interested in seeing effects of the interaction for more than just the high and low setting. This can be accomplished using a contour plot.

Contour Plot

Now we can see the full range of values for both 3P% and 3PAr. The colors represent different ranges for offensive rating. Dark green represents a higher rate for offensive rating, while light green and blue represent lower offensive ratings.

We see that if your percentage of successful 3-pointers (3P%) is between approximately 33% and 38%, your 3PAr doesn’t have a large effect on your offensive rating. A 3P% above 38% that means that you should shoot more 3-pointers, where as a percentage below 33% means that means you should shoot fewer 3-pointers.

Now that we understand how the interaction works, let’s use our results to look as some NBA teams. So far in this NBA season, only five teams fall outside the 3P% range of 33% to 38%. Two teams make more than 38% of their 3-pointers (Warriors and Spurs) and 3 teams make less than 33% (Heat, Timberwolves, and Lakers). So do the Warriors and Spurs correctly shoot a high percentage of their field goals from 3, and do the Heat, Timberwolves, and Spurs shoot a high percentage of their shots from 2?

Contour Plot

The Warriors are good at shooting 3s, and they know it. They have the highest 3-point percentage in the NBA, and shoot the second-highest percentage of their field goals from 3 (the Rockets, who shoot the highest percentage of their field goals from 3, are not shown on the plot). On the other side, the Timberwolves are bad at shooting 3s, and they know it. They have the second-worst 3-point percentage and shoot the lowest percentage of their field goals from 3. The Heat also shoot poorly from 3, but they don’t take a lot as they rank 24th in the NBA in percentage of field goal attempts from 3-point range.

The interesting teams are the Spurs and the Lakers. The Spurs are second in the league, making 39.3% of their 3-pointers. However, only 22.4% of their field goals are 3-pointers, which is 26th in the league. They could benefit by shooting even a higher percentage of their shots from 3. And then there’s the Lakers. Despite ranking dead last in 3-point percentage, they shoot 29% of their field goals from 3. That’s good for 14th in the league. From this analysis, the Lakers are taking too many 3-pointers.

Now, this model purposely leaves out other predictors that could affect offensive rating (like 2-point shooting percentage). So don’t assume that 3-point shooting is all that goes into offensive rating. But it does give us a simple example of how interactions work and how you can use them to look at a real life process. Interactions can be an important part of any data model, so don’t neglect them!

caution

When running a binary logistic regression and many other analyses in Minitab, we estimate parameters for a specified model based on the sample data that has been collected. Most of the time, we use what is called Maximum Likelihood Estimation. However, based on specifics within your data, sometimes these estimation methods fail. What happens then?

Specifically, during binary logistic regression, an error comes up often enough that I want to explain what exactly it means, and offer some potential remedies for it. When you attempt to run your model, you may see the following error:

error

What's going on here? First, let's see what causes this error. Take a look at the following data set consisting of one response variable, Y, and one predictor variable, X.

X 1 2 3 4 4 5 5 6 Y 0 0 0 0 0 1 1 1

Note the key pattern. This data set can be simply described as follows:

If X <= 4, then Y=0 without fail. Similarly, if X >4, then Y=1, again without fail. This is what is known as "separation."

This "perfect prediction" of the response is what causes the estimates, and thus your model, to fail.

Often, separation occurs when the data set is too small to observe events with low probabilities. In the example above, it may be possible to observe a Y value of 1 with an X of less than 4, however, when dealing with smaller sample sizes and low probabilities, we didn't observe any instances of this in our data collection. The more predictors are in the model, the more likely separation is to occur because the individual groups in the data have smaller sample sizes.

Essentially, separation occurs when there is a category or range of a predictor with only one value of the response. We need diversity, or variation among the response to estimate the model.

So when separation happens, what can we do to proceed? With the data as is, there's no way to estimate those parameters; however, there are some things we can do to work around this issue.

1. Obtain more data. If possible, being able to get more data increases the probability that you will obtain different values for your response, thus eliminating the separation. If possible, this is a good first step.

2. Consider an alternative model. The more terms are in the model, the more likely that separation occurs for at least one variable. When you select terms for the model, you can check whether the exclusion of a term allows the maximum likelihood estimates to converge. If a useful model exists that does not use the term, you can continue the analysis with the new model.

3. Depending on the predictor variable in question, you may be able to manipulate your groupings to something that has events occurring. For example, you may have a predictor in your model with groups for both "Oranges" and "Apples." With such specific groups, it may be possible to see separation. However, that separation may disappear if you can combine those two levels into one specific grouping, such as "Fruit."

Seeing an error message like this can be frustrating, but it doesn't have to be the end of the line if you know some ways to work around it. Keep in mind these steps when analyzing a model, and you can overcome data issues such as this in the future.

I’ve written about R-squared before and I’ve concluded that it’s not as intuitive as it seems at first glance. It can be a misleading statistic because a high R-squared is not always good and a low R-squared is not always bad. I’ve even said that R-squared is overrated and that the standard error of the estimate (S) can be more useful.

Even though I haven’t always been enthusiastic about R-squared, that’s not to say it isn’t useful at all. For instance, if you perform a study and notice that similar studies generally obtain a notably higher or lower R-squared, you should investigate why yours is different because there might be a problem.

In this blog post, I look at five reasons why your R-squared can be too high. This isn’t a comprehensive list, but it covers some of the more common reasons.

Is A High R-squared Value a Problem?

A very high R-squared value is not necessarily a problem. Some processes can have R-squared values that are in the high 90s. These are often physical process where you can obtain precise measurements and there's low process noise.

You'll have to use your subject area knowledge to determine whether a high R-squared is problematic. Are you modeling something that is inherently predictable? Or, not so much? If you're measuring a physical process, an R-squared of 0.9 might not be surprising. However, if you're predicting human behavior, that's way too high!

Compare your study to similar studies to determine whether your R-squared is in the right ballpark. If your R-squared is too high, consider the following possibilities. To determine whether any apply to your model specifically, you'll have to use your subject area knowledge, information about how you fit the model, and data specific details.

Reason 1: R-squared is a biased estimate

bathroom scale The R-squared in your regression output is a biased estimate based on your sample—it tends to be too high. This bias is a reason why some practitioners don’t use R-squared at all but use adjusted R-squared instead.

R-squared is like a broken bathroom scale that tends to read too high. No one wants that! Researchers have long recognized that regression’s optimization process takes advantage of chance correlations in the sample data and inflates the R-squared.

Adjusted R-squared does what you’d do with that broken bathroom scale. If you knew the scale was consistently too high, you’d reduce it by an appropriate amount to produce a weight that is correct on average.

Adjusted R-squared does this by comparing the sample size to the number of terms in your regression model. Regression models that have many samples per term produce a better R-squared estimate and require less shrinkage. Conversely, models that have few samples per term require more shrinkage to correct the bias.

For more information, read my posts about Adjusted R-squared and R-squared shrinkage.

Reason 2: You might be overfitting your model

An overfit model is one that is too complicated for your data set. You’ve included too many terms in your model compared to the number of observations. When this happens, the regression model becomes tailored to fit the quirks and random noise in your specific sample rather than reflecting the overall population. If you drew another sample, it would have its own quirks, and your original overfit model would not likely fit the new data.

Adjusted R-squared doesn't always catch this, but predicted R-squared often does. Read my post about the dangers of overfitting your model.

Reason 3: Data mining and chance correlations

If you fit many models, you will find variables that appear to be significant but they are correlated only by chance. While your final model might not be too complex for the number of observations (Reason 2), problems occur when you fit many different models to arrive at the final model. Data mining can produce high R-squared values even with entirely random data!

Before performing regression analysis, you should already have an idea of what the important variables are along with their relationships, coefficient signs, and effect magnitudes based on previous research. Unfortunately, recent trends have moved away from this approach thanks to large, readily available databases and automated procedures that build regression models.

For more information, read my post about using too many phantom degrees of freedom.

Reason 4: Trends in Panel (Time Series) Data

If you have time series data and your response variable and a predictor variable both have significant trends over time, this can produce very high R-squared values. You might try a time series analysis, or including time related variables in your regression model, such as lagged and/or differenced variables. Conveniently, these analyses and functions are all available in Minitab statistical software.

Reason 5: Form of a Variable

It's possible that you're including different forms of the same variable for both the response variable and a predictor variable. For example, if the response variable is temperature in Celsius and you include a predictor variable of temperature in some other scale, you'd get an R-squared of nearly 100%! That's an obvious example, but you can have the same thing happening more subtlety.

For more information about regression models, read my post about How to Choose the Best Regression Model.

I was recently asked a couple of questions about stability studies in Minitab.

Question 1: If you enter in a lower and upper spec in the Stability Study dialog window, why do I see only one confidence bound per fitted line on the resulting graph? Shouldn’t there be two?

You use a stability study to analyze the stability of a product over time and to determine the product's shelf life. In order to run this in Minitab, you need:

Measurement Data
Time Variable
Batch Factor(Optional)

Shown below is a sample of the first 14 rows of a Stability Study data set. The full data set can be found in our Sample Data folder within Minitab Statistical Software, via Help > Sample Data in Minitab 17.3. Search for and open the ShelfLife.mtw file. You can download a free 30-day trial of the software if you're not already using it.

The Stability Study dialog box:

The Resulting Graph:

shelf life for all batches

If the increase in the mean response over time is significant, Minitab calculates the shelf life relative to the upper specification limit. How we choose our bound is then decided by what spec Minitab has sided with. Thus, the 97.5% lower bound is only being shown in relation to the corresponding fitted line above it. From a conceptual standpoint, if the slope of the mean response line is trending downward, then you'd be looking at where its worst case scenario, the 97.5% lower bound, intersects with that lower spec. The overall shelf life for the batches is at 53.39 months for a 90% concentration.

Question 2: I get asterisks for Shelf Life at each Batch, as show below:

Batch Shelf Life

1 *

2 *

3 *

4 *

5 *

Overall *

This question is closely related to the first. This depends on the slope’s direction and what specification, lower or upper, you have chosen. Most likely, you won’t run into this situation if:

a. Your fitted line has a significant negative slope and you are only inputting a lower spec.

b. Your fitted line has a significant positive slope and you are only inputting an upper spec.

shelf life for all batches 2

It’s a different story in the first question we discussed (and first graph we showed), as the mean response at time = 0 was below the upper spec.

I hope this information helps you when you perform your next stability study!

Translink Ticket Vending Machine found at all train stations in south-east Queensland. For one reason or another, the response variable in a regression analysis might not satisfy one or more of the assumptions of ordinary least squares regression. The residuals might follow a skewed distribution or the residuals might curve as the predictions increase. A common solution when problems arise with the assumptions of ordinary least squares regression is to transform the response variable so that the data do meet the assumptions. Minitab makes the transformation simple by including the Box-Cox button. Try it for yourself and see how easy it is!

The government in Queensland, Australia shares data about the number of complaints about its public transportation service.

I’m going to use the data set titled “Patronage and Complaints.” I’ll analyze the data a bit more thoroughly later, but for now I want to focus on the transformation. The variables in this data set are the date, the number of passenger trips, the number of complaints about a frequent rider card, and the number of other customer complaints. I'm using the range of the data from the week ending July 7th, 2012 to December 22nd 2013. I’m excluding the data for the last week of 2012 because ridership is so much lower compared to other weeks.

If you want to follow along, you can download my Minitab data sheet. If you don't already have it, you can download Minitab and use it free for 30 days.

Let’s say that we want to use the number of complaints about the frequent rider card as the response variable. The number of other complaints and the date are the predictors. The resulting normal probability plot of the residuals shows an s-curve.

The residuals do not appear normal.

Because we see this pattern, we’d like to go ahead and do the Box-Cox transformation. Try this:

Choose Stat > Regression > Regression > Fit Regression Model.
In Responses, enter the column with the number of complaints on the go card.
In Continuous Predictors, enter the columns that contain the other customer complaints and the date.
Click Options.
Under Box-Cox transformation, select Optimal λ.
Click OK.
Click Graphs.
Select Individual plots and check Normal plot of residuals.
Click OK twice.

The residuals are more normal.

The probability plot that results is more linear, although it still shows outlying observations where the number of complaints in the response are very high or very low relative to the number of other complaints. You'll still want to check the other regression assumptions, such as homoscedasticity.

So there it is, everything that you need to know to use a Box-Cox transformation on the response in a regression model. Easy, right? Ready for some more? Check out more of the analysis steps that Minitab makes easy.

The image of the Translink vending machine is by Brad Wood and is licensed for reuse under thisCreative Commons License.

Suppose you’ve collected data on cycle time, revenue, the dimension of a manufactured part, or some other metric that’s important to you, and you want to see what other variables may be related to it. Now what?

When I graduated from college with my first statistics degree, my diploma was bona fide proof that I'd endured hours and hours of classroom lectures on various statistical topics, including linear regression, ANOVA, and logistic regression.

However, there wasn’t a single class that put it all together and explained which tool to use when. I have all of this data for my Y and X's and I want to describe the relationship between them, but what do I do now?

Back then, I wish someone had clearly laid out which regression or ANOVA analysis was most suited for this type of data or that. Let's start with how to choose the right tool for a continuous Y…

Continuous Y, Continuous X(s)

Example:

Y: Weights of adult males

X’s: Age, Height, Minutes of exercise per week

What tool should you use? Regression

Where’s that in Minitab? Stat > Regression > Regression > Fit Regression Model

Continuous Y, Categorical X(s)

Example:

Y: Your Mario Kart Wii score

X’s: Wii controller type (racing wheel or standard), whether you stand or sit while playing, character (Mario, Luigi, Yoshi, Bowser, Peach)

What tool should you use? ANOVA

Where’s that in Minitab? Stat > ANOVA > General Linear Model > Fit General Linear Model

Continuous Y, Continuous AND Categorical X(s)

Example:

Y: Number of hours people sleep per night

X’s: Age, activity prior to sleeping (none, read a book, watch TV, surf the internet), whether or not the person has young children…“I had a bad dream, I'm thirsty, there’s a monster under my bed!”

What tool should you use? You have a choice of using either ANOVA or Regression

Where’s that in Minitab? Stat > ANOVA > General Linear Model > Fit General Linear Model orStat > Regression > Regression > Fit Regression Model

I personally prefer GLM because it offers multiple comparisons, which are useful if you have a significant categorical X with more than 2 levels. For example, suppose activity prior to sleep is significant. Comparisons will tell you which of the 4 levels—none, read a book, watch TV, surf the Internet—are significantly different from one another.

Do people who watch TV sleep, on average, the same as people who surf the Internet, but significantly less than people who do nothing or read? Or, perhaps, are internet surfers significantly different from the other three categories? Comparisons help you detect these differences.

Categorical Y

If Y is categorical, then you can use logistic regression for your continuous and/or categorical X’s. The 3 types of logistic regression are:

Binary: Y with 2 levels (yes/no, pass/fail)

Ordinal: Y with more than 2 levels that have a natural order (low/medium/high)

Nominal: Y with more than 2 levels that have no order (sedan/SUV/minivan/truck)

So the next time you have a bunch of X’s and a Y and you want to see if there's a relationship between them, here is a summary of which tool to use when:

Tool Selection Guide

For step-by-step instructions on how to use General Regression, General Linear Model, or Logistic Regression in Minitab Statistical Software, just navigate to any of these tools in Minitab and click Help in the bottom left corner of the dialog. You will then see ‘example’ located at the top of the Help screen. And Minitab customers can always contact Minitab Technical Support at 814-231-2682 or www.minitab.com/contact-us. Our Tech Support team is staffed with statisticians, and best of all, accessing them is free!

Technology is very much part of our lives nowadays. We use our smartphones to have video calls with our friends and family, and watch our favourite TV shows on tablets. Technology has also transformed the fitness industry with the increasing popularity of fitness trackers.

Recently, I got myself a fitness watch and it's becoming my favourite gadget. It can track how many steps I’ve taken, my heart rate during a workout, and how many calories I've burned during my workout and over the whole day. Based on the calories burned, I can adjust my diet to ensure I have eaten what I require for the day. I’ve been collecting data from my weekly Zumba sessions, gym workouts and lunch-time walks. After collecting data for over a month, I decided to do some analysis with it using Minitab. Below is a snapshot of the data I collected in Minitab.

fitbit data

For each activity, I have the following information:

Duration of exercise in minutes and seconds
Time spent (rounded to nearest minutes) on peak/high-intensity exercise heart-rate zone—heart rate greater than 85% of maximum
Time spent (rounded to nearest minutes) on cardio/medium-to-high-intensity exercise heart-rate zone—heart rate is 70 to 84% of maximum
Time spent (rounded to nearest minutes) on fat-burn/low-to-medium-intensity exercise heart-rate zone—heart rate is 50-69% of maximum
Average heart rate during the session
Total calories burned during the session

It appears that the higher average heart rate results in more calories burned. Also, this depends on time spent at different heart rate zones. Let’s do some calculation using correlation coefficients.

Correlation - Cardio and Calories

As expected, all three variables are positively correlated with calories burned. However, spending hours on the treadmill is probably not a very good way to burn calories. With the best summer weather just around the corner, I need a more efficient way to exercise to lose the few pounds from my indulgence in the winter months!

According to research, exercising at higher intensity can result in more calories burned due to the “afterburn” effect. The afterburn effect is the additional calories burned after intensive exercise. Recently, at my local gym, they have introduced 30-minute HIIT (high-intensity interval training) sessions, which I am considering taking. Hence, fitting a regression model using my data will probably help me make the decision.

In Minitab, I opened Stat > Regression > Regression > Fit Regression Model, and completed the dialog and sub-dialog boxes as shown below.

fitbit regression dialog

fitbit regression subdialog

Instead of using a trial-and-error approach to select terms for the model, I will use the stepwise approach to help me identify suitable terms for the model.

fitbit stepwise regression

And after I press OK on each of my dialogs, Minitab returns the regression equation:

Regression Equation for Fitbit Data

fitbit regression model summary

The final model is quite decent, as the three types of R-squared values are all above 80%. This implies I can use this model to make predictions. The regression equation appears complex, but I can use the response optimizer in Minitab 17 to identify optimum settings to achieve my goal.

There is a common belief that 1 pound of fat (0.45 kilogram) is approximately equal to 3500 calories. Let’s say I aim to burn about 300 calories in each session. This means after about 12 sessions I would have lost approximately a pound of fat, provided I also had a healthy diet. Since exercising at higher heart rate tends to burn more calories, I will also aim to maintain an average heart rate between, say, 128 and 148, which for me works out as somewhere between 70-80% of maximum heart rate.

With all the conditions above, using Stat > Regression > Regression > Response Optimizer, here are some screenshots of the dialog boxes.

response optimizer for fitbit

response optimizer options for fitbit data

My target calorie burn rate is 300, and getting above 300 would be a bonus. Hence, I am using 310 as the upper limit.

fitbit upper limit

I would like to spend no more than 45 minutes per session and hence I am using a maximum of 30 minutes exercising in the cardio zone, and 15 minutes in the fat-burn zone.

Response optimization output for fitbit data

Fitbit optimizer response plot

To achieve my goal, I need to exercise in the cardio zone for about 21 minutes, exercise in the fat burn zone for about 15 minutes, and maintain my average heart rate at about 148 for the session.

I understand that the HIIT sessions involve very intense bursts of exercise followed by short, sometimes active, recovery periods. This type of training gets and keeps your heart rate up. Based on this, if out of a 30-minute HIIT session I can maintain about 21 minutes in the cardio zone, and spend the rest of the session exercising in the fat-burn zone, I will be close to achieving my goal. I can always supplement this by a few minutes on the exercise bike or cross-trainer after the class.

Another good feature with the response optimizer is that I can evaluate different settings to see how the changes can affect the response. Let's consider the days when the HIIT class is not offered and I need to use the machines. I normally go for a longer session on the cross trainer (20-30 minutes), followed by a quick 10-minute session on the step machine. From past experience, I can easily get into the cardio heart-rate zone when using the cross-trainer. Now I can use the optimizer to predict the calories burned for 30 minutes of working out in the cardio zone and 10 minutes in the fat-burn zone. I will also use a lower average heart rate of 140.

By clicking on the current setup, I can input new settings.

Fitbit response optimizer new settings

response optimizer for fitbit data cardio heart rate zone

Well, this solution is not too far off from my target of 300 calories burned!

It’s turned out to be an enjoyable and informative experience analysing my own fitness data to see what my best workout options are. Taking the data collected by my fitness tracker and doing further analysis on it has definitely helped me to decide on how to exercise wisely and efficiently.

Gym photo by Indigo Fitness Club Zurich, used under Creative Commons 2.0 license.

In my last post, we took the red pill and dove deep into the unarguably fascinating and uncompromisingly compelling world of the matrix plot. I've stuffed this post with information about a topic of marginal interest...the marginal plot.

Margins are important. Back in my English composition days, I recall that margins were particularly prized for the inverse linear relationship they maintained with the number of words that one had to string together to complete an assignment. Mathematically, that relationship looks something like this:

Bigger margins = fewer words

stuffed crust In stark contrast to my concept of margins as information-free zones, the marginal plot actually utilizes the margins of a scatterplot to provide timely and important information about your data. Think of the marginal plot as the stuffed-crust pizza of the graph world. Only, instead of extra cheese, you get to bite into extra data. And instead of filling your stomach with carbs and cholesterol, you're filling your brain with data and knowledge. And instead of arriving late and cold because the delivery driver stopped off to canoodle with his girlfriend on his way to your house (even though he's just not sure if the relationship is really working out: she seems distant lately and he's not sure if it's the constant cologne of consumables about him, or the ever-present film of pizza grease on his car seats, on his clothes, in his ears?)

...anyway, unlike a cold, late pizza, marginal plots are always fresh and hot, because you bake them yourself, in Minitab Statistical Software.

I tossed some randomly-generated data around and came up with this half-baked example. Like the pepperonis on a hastily prepared pie, the points on this plot are mostly piled in the middle, with only a few slices venturing to the edges. In fact, some of those points might be outliers.

Scatterplot of C1 vs C2

If only there were an easy, interesting, and integrated way to assess the data for outliers when we make a scatterplot.

Boxplots are a useful way look for outliers. You could make separate boxplots of each variable, like so:

Boxplot of C1 Boxplot of C2

It's fairly easy to relate the boxplot of C1 to the values plotted on the y-axis of the scatterplot. But it's a little harder to relate the boxplot of C2 to the scatterplot, because the y-axis on the boxplot corresponds to the x-axis on the scatterplot. You can transpose the scales on the boxplot to make the comparison a little easier. Just double-click one of the axes and select Transpose value and category scales:

Boxplot of C2, Transposed

That's a little better. The only thing that would be even better is if you could put each boxplot right up against the scatterplot...if you could stuff the crust of the scatterplot with boxplots, so to speak. Well, guess what? You can! Just choose Graph > Marginal Plot > With Boxplots, enter the variables and click OK:

Marginal Plot of C1 vs C2

Not only are the boxplots nestled right up next to the scatterplot, but they also share the same axes as the scatterplot. For example, the outlier (asterisk) on the boxplot of C2 corresponds to the point directly below it on the scatterplot. Looks like that point could be an outlier, so you might want to investigate further.

Marginal plots can also help alert you to other important complexities in your data. Here's another half-baked example. Unlike our pizza delivery guy's relationship with his girlfriend, it looks like the relationship between the fake response and the fake predictor represented in this scatterplot really is working out:

Scatterplot of Fake Response vs Fake Predictor

In fact, if you use Stat > Regression > Fitted Line Plot, the fitted line appears to fit the data nicely. And the regression analysis is highly significant:

Fitted Line_ Fake Response versus Fake Predictor

Regression Analysis: Fake Response versus Fake Predictor The regression equation is Fake Response = 2.151 + 0.7723 Fake Predictor S = 2.12304 R-Sq = 50.3% R-Sq(adj) = 49.7% Analysis of Variance Source DF SS MS F P Regression 1 356.402 356.402 79.07 0.000 Error 78 351.568 4.507 Total 79 707.970

But wait. If you create a marginal plot instead, you can augment your exploration of these data with histograms and/or dotplots, as I have done below. Looks like there's trouble in paradise:

Marginal Plot of Fake Response vs Fake Predictor, with Histograms

Like the poorly made pepperoni pizza, the points on our plot are distributed unevenly. There appear to be two clumps of points. The distribution of values for the fake predictor is bimodal: that is, it has two distinct peaks. The distribution of values for the response may also be bimodal.

Why is this important? Because the two clumps of toppings may suggest that you have more than one metaphorical cook in the metaphorical pizza kitchen. For example, it could be that Wendy, who is left handed, started placing the pepperonis carefully on the pie and then got called away, leaving Jimmy, who is right handed, to quickly and carelessly complete the covering of cured meats. In other words, it could be that the two clumps of points represent two very different populations.

When I tossed and stretched the data for this example, I took random samples from two different populations. I used 40 random observations from a normal distribution with a mean of 8 and a standard deviation of 1.5, and 40 random observations from a normal distribution with a mean of 13 and a standard deviation of 1.75. The two clumps of data are truly from two different populations. To illustrate, I separated the two populations into two different groups in this scatterplot:

Scatterplot with Groups

This is a classic conundrum that can occur when you do a regression analysis. The regression line tries to pass through the center of the data. And because there are two clumps of data, the line tries to pass through the center of each clump. This looks like a relationship between the response and the predictor, but it's just an illusion. If you separate the clumps and analyze each population separately, you discover that there is no relationship at all:

Fitted Line_ Fake Response 1 versus Fake Predictor 1

Regression Analysis: Fake Response 1 versus Fake Predictor 1 The regression equation is Fake Response 1 = 9.067 - 0.1600 Fake Predictor 1 S = 1.64688 R-Sq = 1.5% R-Sq(adj) = 0.0% Analysis of Variance Source DF SS MS F P Regression 1 1.609 1.60881 0.59 0.446 Error 38 103.064 2.71221 Total 39 104.673

Fitted Line_ Fake Response 2 versus Fake Predictor 2

Regression Analysis: Fake Response 2 versus Fake Predictor 2 The regression equation is Fake Response 2 = 12.09 + 0.0532 Fake Predictor 2 S = 1.62074 R-Sq = 0.3% R-Sq(adj) = 0.0% Analysis of Variance Source DF SS MS F P Regression 1 0.291 0.29111 0.11 0.741 Error 38 99.818 2.62679 Total 39 100.109

If only our unfortunate pizza delivery technician could somehow use a marginal plot to help him assess the state of his own relationship. But alas, I don't think a marginal plot is going to help with that particular analysis. Where is that guy anyway? I'm getting hungry.

The Danger of Overfitting Regression Models

Regression with Meat Ants: Analyzing a Count Response (Part 1)

Regression with Meat Ants: Analyzing a Count Response (Part II)

Regression in the Real World

Specification Limits and Stability Studies

Statistical Analyses of the House Freedom Caucus and the Search for a New Speaker

Beware of Phantom Degrees of Freedom that Haunt Your Regression Models!

Practical Statistical Problem Solving Using Minitab to Explore the Problem

So Why Is It Called "Regression," Anyway?

How to Compare Regression Slopes

When Should You Fit a Non-Hierarchical Regression Model?

When Is It Crucial to Standardize the Variables in a Regression Model?

Understanding Interactions with NBA 3-Point Shooting

What Is Complete Separation in Binary Logistic Regression?

Five Reasons Why Your R-squared Can Be Too High

Specification Limits and Stability Studies

See How Easily You Can Do a Box-Cox Transformation in Regression

Regression versus ANOVA: Which Tool to Use When

Using Fitness Tracker Data to Make Wise Decisions: Are You Working Out in the Right Zone?

Using Marginal Plots, aka "Stuffed-Crust Charts"