Want to create or adapt books like this? Learn more about how Pressbooks supports open publishing practices.
22: Prediction
Chapter 22 Guiding Questions
What is the difference between prediction and explanation?
What assumptions underlie regression-based prediction models?
When is prediction appropriate in applied research contexts?
How can predictive results be misinterpreted or overstated?
22.1 Predictive Thinking
In quantitative research, prediction involves using statistical models to estimate the value of a dependent variable based on one or more independent variables. The goal is to understand the relationship between these variables and to apply that understanding to forecast outcomes for future or unseen data. Prediction plays a crucial role in supporting evidence-based decision-making and helping researchers anticipate likely results. It is important to distinguish prediction from explanation, which focuses on identifying the underlying causes of a relationship. Prediction is primarily concerned with forecasting outcomes as accurately as possible, while explanation aims to test theories and understand why those outcomes occur.
22.2 Understanding Regression
Regression is a statistical method used to model the relationship between an independent variable (also called a predictor) and a dependent variable (also called an outcome). It helps quantify how changes in the predictor are associated with changes in the outcome, often for the purpose of making predictions or understanding the nature of the relationship. While other relational tests such as correlation measure the strength and direction of a relationship between two variables, regression goes further by modeling that relationship and estimating how much the outcome changes for each unit change in the predictor. This makes regression especially useful for forecasting and for evaluating the unique contribution of one or more predictors.
Regression models can range from simple to complex, depending on the number of predictors involved and the nature of the outcome variable. A simple regression model includes one predictor and is often used to examine the basic relationship between two variables. More complex models can include multiple predictors, interaction terms, or transformations of variables to capture more detailed relationships. Regression can also be applied to different types of outcome variables, such as continuous outcomes in linear regression or categorical outcomes in logistic regression. Despite these differences, the core purpose remains the same: to understand how variables relate to one another and to make informed predictions based on those relationships.
22.3 Multiple Linear Regression
Multiple Linear Regression is used to predict the value of a continuous dependent variable based on two or more independent variables. The relationship between the dependent variable and each independent variable is assumed to be linear, meaning that changes in the predictors are associated with consistent, proportional changes in the outcome. This method is commonly applied when researchers aim to understand how multiple factors together relate to or influence a single outcome.
When including nominal independent variables in a regression model, they must be transformed into a numerical format the model can interpret, typically through a process called dummy coding. Dummy coding creates binary variables to represent the categories of a nominal variable. For variables with only two categories, one dummy variable is created to indicate membership in one category. For nominal variables with more than two categories, multiple dummy variables are created, one fewer than the total number of categories. Each dummy variable compares one category to a designated reference group.
This approach allows the regression model to estimate how each category influences the dependent variable relative to the reference group. The coefficient for each dummy variable reflects how much higher or lower the predicted outcome is compared to the baseline category, while holding all other variables constant.
Assumptions
Multiple Linear Regression relies on several key assumptions to produce valid results. First, there should be a linear relationship between the dependent variable and each independent variable. The residuals (the differences between observed and predicted values) should be approximately normally distributed. The model also assumes homoscedasticity, meaning that the variance of the residuals is consistent across all levels of the independent variables. Observations must be independent of one another, meaning that the value of one observation should not influence another. If data are collected over time or in sequence, the model also assumes no autocorrelation, which occurs when residuals are correlated across observations. Finally, there should be little or no multicollinearity among the independent variables; predictors should not be too highly correlated with each other.
How To: Multiple Linear Regression
To run the Multiple Linear Regression in Jamovi, go to the Analyses tab, select Regression, then Linear Regression.
Select the DV and move it to the Dependent Variable box.
Move interval IVs to the Covariates box.
Move nominal IVs to the Factors box.
Optional: Under the Reference Levels drop-down, select the group of interest for each nominal IV.
Under the Assumption Checks drop-down, select: Autocorrelation test, Collinearity statistics, Normality test, Q-Q plot of residuals, Residual plots, and Cook’s distance.
Under the Model Fit drop-down, select: R, R², Adjusted R², AIC (optional), and F test.
Under the Model Coefficients drop-down, select: Standardized estimate, Confidence intervals
Under the Estimated Marginal Means drop-down, move the first variable in the left box to the Marginal Means box.
Select the Add New Term Button and add each additional variable in the left box as a separate term.
Under Output, select Marginal means plots and Marginal means tables.
TIP: Jamovi automatically applies dummy coding to the nominal variables.
Understanding the Output
The output from the Multiple Linear Regression is shown below. The screenshots separate the results for display purposes, but the full output appears in a single Jamovi output window when all test options are selected.
Figure 22.1a. Multiple Linear Regression Results with Assumption Tests
Figure 22.1b. Multiple Linear Regression Results with Q–Q plot for Residual Normality
Figure 22.1c. Multiple Linear Regression Results with Residual Plots
Figure 22.1d. Multiple Linear Regression Results with Model Fit Measures and Regression Coefficients
Figure 22.1e. Multiple Linear Regression Results with Estimated Marginal Means with Plot
Figure 22.1f. Multiple Linear Regression Results with Estimated Marginal Means with Plot
Begin by examining the Model Fit Measures table. The R² value indicates the proportion of variance in the dependent variable that is explained by the predictors in the model. The adjusted R² provides a similar estimate but adjusts for the number of predictors included in the model. The overall model test (F test) evaluates whether the set of predictors, taken together, explains a statistically significant amount of variance in the outcome variable.
Next, review the Model Coefficients table. Each row represents a predictor in the model. The unstandardized estimate (B) indicates how much the dependent variable is expected to change for a one-unit increase in the predictor, assuming the other predictors remain constant. The standard error (SE) reflects the precision of that estimate. The t statistic and p-value indicate whether the relationship between the predictor and the dependent variable is statistically significant. The standardized estimate (β) expresses the strength of the relationship in standardized units, allowing for comparison of the relative influence of predictors measured on different scales. The confidence interval provides a range of plausible values for the coefficient.
The Cook’s distance summary helps identify observations that have a disproportionate influence on the regression model. A common guideline is that Cook’s distance values greater than 1 may indicate highly influential observations, while values substantially larger than the average may warrant closer inspection.
The Durbin–Watson test evaluates whether residuals are independent across observations. The Durbin–Watson statistic ranges from 0 to 4, with values near 2 indicating little or no autocorrelation. Values substantially below 2 suggest positive autocorrelation, while values substantially above 2 suggest negative autocorrelation. Jamovi also reports a p-value, which tests the null hypothesis that the residuals are independent. A non-significant p-value suggests that the assumption of independent errors has not been violated.
The collinearity statistics assess whether predictors are too highly correlated with one another. The variance inflation factor (VIF) indicates how much the variance of a regression coefficient is inflated due to correlation among predictors. A common guideline is that VIF values above 5 (and especially above 10) may indicate problematic multicollinearity. The tolerance value, which is the reciprocal of VIF, should generally be greater than .20 to indicate that predictors are not excessively correlated.
The normality test and Q–Q plot assess whether the residuals are approximately normally distributed. When the points in the Q–Q plot follow the diagonal reference line, the normality assumption is considered reasonable.
The residual plots provide a visual check of model assumptions such as linearity and homoscedasticity. A random scatter of points around the horizontal axis generally suggests that the model assumptions are satisfied.
Finally, the Estimated Marginal Means tables and plots illustrate the predicted values of the dependent variable at selected values of the predictors. These values help visualize how the predicted outcome changes as the predictors increase or decrease and provide confidence intervals that indicate the precision of those predictions.
Phrasing Results: Multiple Linear Regression
Use this template to phrase significant model results:
A Multiple Linear Regression was conducted to examine whether [IV1] and [IV2] significantly influenced [DV].
The overall regression model was significant, F(df1, df2) = [F statistic], p < [approximate p-value], with an R² of [R² value], indicating that approximately [XX]% of the variance in [DV] was explained by the model.
Use this template to phrase the significant coefficient results:
[IV1] was a significant predictor of [DV], showing a [positive/negative] relationship (β = [unstandardized estimate], p < [approximate p-value]).
[IV2] also significantly predicted [DV] and demonstrated a [positive/negative] relationship (β = [unstandardized estimate], p < [approximate p-value]).
The standardized coefficient for [IV1] (β* = [standardized estimate]) indicated that it had a greater influence on [DV] than [IV2] (β* = [standardized estimate]).
Use this template to phrase non-significant model results:
A Multiple Linear Regression was conducted to examine whether [IV1] and [IV2] influenced [DV].
The overall regression model was not significant, F(df1, df2) = [F statistic], p = [p-value], R² = [R² value].
Use this template to phrase the non-significant coefficient results:
[IV1] was not a significant predictor of [DV] (β = [estimate], p = [p-value]).
[IV2] was also not a significant predictor (β = [estimate], p = [p-value]).
TIP: Replace the content inside the brackets with your variables and results, then remove the brackets.
TIP: Follow the coefficient template to add more IVs as needed.
TIP: Use a mix of the phrasing language from the templates to match your results if you have a mix of significant and non-significant results within your coefficients.
22.4 Binomial Logistic Regression
Binomial Logistic Regression is used when the dependent variable is binary, meaning it has two possible outcomes. This model predicts the probability that an observation falls into one of the two outcomes based on one or more independent variables. Unlike linear regression, which predicts a continuous value, logistic regression models the log odds of the outcome. Log odds are the natural logarithm of the odds, where odds represent the ratio of the probability of the event occurring to the probability of it not occurring. The model then uses a logistic function to convert the log odds into predicted probabilities between 0 and 1. Logistic regression is widely used when the outcome is categorical and researchers want to estimate the likelihood of a specific outcome given the values of the predictors.
As in multiple linear regression, nominal independent variables must be transformed into binary variables using dummy coding. These dummy variables are then included in the model as predictors, allowing the model to estimate the effect of each category relative to a reference group.
Assumptions
Binomial Logistic Regression relies on several key assumptions to produce valid results. First, the observations should be independent, meaning that the response of one participant does not influence another. Logistic regression does not require the predictors to be normally distributed or have equal variances, but it does assume a linear relationship between each continuous predictor and the log odds of the outcome. Additionally, the model assumes there is little or no multicollinearity among the independent variables. Highly correlated predictors can distort the estimated effects.
How To: Binomial Logistic Regression
To run the Binomial Logistic Regression in Jamovi, go to the Analyses tab, select Regression, then 2 Outcomes-Binomial under Logistic Regression.
Select the two-group nominal DV and move it to the Dependent Variable box.
Move interval IVs to the Covariates box.
Move nominal IVs to the Factors box.
Under the Reference Levels drop-down, select the baseline DV variable group (e.g., the group opposite of the outcome of interest).
Optional: Select the group of interest for each nominal IV.
Under the Assumption Checks drop-down, select Collinearity statistics.
Under the Model Fit drop-down, select: Deviance, AIC (optional), Overall model test, and Nagelkerke’s R².
Under the Model Coefficients drop-down, select: Odds ratio and Confidence interval under Odds ratio.
Under the Estimated Marginal Means drop-down, move the first variable in the left box to the Marginal Means box.
Select the Add New Term Button and add each additional variable in the left box as a separate term.
Under Output, select Equal cell weight, Confidence intervals, Marginal means plots, and Marginal means tables.
Under the Prediction drop-down, select: Cut-off plot and AUC.
Under Predictive Measures, select: Classification table, Accuracy, Specificity, and Sensitivity.
TIP: Jamovi automatically applies dummy coding to the nominal variables.
Understanding the Output
The output from the Binomial Logistic Regression is shown below. The screenshots separate the results for display purposes, but the full output appears in a single Jamovi output window when all test options are selected.
Figure 22.2a. Binomial Logistic Regression Results with Variable Reference Levels Selected and Assumption Test
Figure 22.2b. Binomial Logistic Regression Results with Model Fit and Coefficient Estimates with Odds Ratios
Figure 22.2c. Binomial Logistic Regression Results with Estimated Marginal Means and Plot
Figure 22.2d. Binomial Logistic Regression Results with Estimated Marginal Means and Plot
Figure 22.2e. Binomial Logistic Regression Results with Estimated Marginal Means and Plot for Nominal Predictor
Figure 22.2f. Binomial Logistic Regression Prediction Results with Cut-Off Plot
Interpretation of a binomial logistic regression begins with the overall model test, which evaluates whether the set of predictors collectively improves the ability to classify outcomes compared with a model that contains no predictors. A statistically significant result indicates that the predictors, taken together, provide useful information for distinguishing between the outcome categories.
Next, examine the model coefficients table, which reports the estimated effect of each predictor on the log odds of the outcome. The Estimate indicates the direction and size of the relationship, the Z statistic and p value test whether the predictor contributes significantly to the model, and the odds ratio expresses the effect in more interpretable terms. Odds ratios greater than 1 indicate increased odds of the outcome as the predictor increases, whereas odds ratios less than 1 indicate decreased odds. The confidence interval helps assess the precision of the estimate; intervals that include 1 suggest the predictor may not meaningfully change the odds.
For categorical predictors, the coefficients compare each category to a reference group, allowing interpretation of how the odds of the outcome differ across categories.
The collinearity statistics assess whether predictors are excessively correlated with one another. Variance inflation factor (VIF) values near 1 and high tolerance values suggest that multicollinearity is not a concern and that each predictor contributes unique information to the model.
The estimated marginal means section presents predicted probabilities of the outcome at representative values of each predictor (typically one standard deviation below the mean, the mean, and one standard deviation above the mean, or across categories). These probabilities help translate the logistic model into more intuitive terms by illustrating how the likelihood of the outcome changes across different predictor values.
Finally, the prediction output evaluates how well the model classifies cases. The classification table compares observed outcomes with predicted outcomes based on a chosen probability cut-off. Predictive measures summarize performance: accuracy reflects overall correct classifications, sensitivity indicates how well the model identifies cases in the positive category, specificity reflects correct identification of the negative category, and the area under the curve (AUC) summarizes the model’s overall ability to distinguish between the two outcome groups. Higher AUC values indicate stronger discriminatory ability.
Phrasing Results: Binomial Logistic Regression
Use this template to phrase significant model results:
A Binomial Logistic Regression was conducted to examine whether [IV1] and [IV2] influenced the likelihood of [DV target group].
A significant model was found, χ²(df) = [χ² statistic], p < [approximate p-value], with a Nagelkerke’s R² of [Nagelkerke’s R² value]. This indicates that the model explained approximately [XX]% of the variance in the likelihood of [DV target group].
Use this template to phrase the significant coefficient results:
[IV1 reference group] was [more/less] likely than [IV1 group 2] to [DV target group] (β = [estimate], OR = [odds ratio], p < [approximate p-value]).
Additionally, [IV2 reference group] was [more/less] likely than [IV2 group 2] to [DV target group] (β = [estimate], OR = [odds ratio], p < [approximate p-value]).
Use this template to phrase non-significant model results:
A Binomial Logistic Regression was conducted to examine whether [IV1] and [IV2] influenced the likelihood of [DV target group].
The overall model was not significant, χ²(df) = [χ² statistic], p = [p-value].
Use this template to phrase the non-significant coefficient results:
The comparison between [IV1 reference group] and [IV1 group 2] was not significant for predicting [DV target group] (β = [estimate], OR = [odds ratio], p = [p-value]).
Additionally, the comparison between [IV2 reference group] and [IV2 group 2] was not significant (β = [estimate], OR = [odds ratio], p = [p-value]).
TIP: Replace the content inside the brackets with your variables and results, then remove the brackets.
TIP: Follow the coefficient template to add more IVs as needed.
TIP: Use a mix of the phrasing language from the templates to match your results if you have a mix of significant and non-significant results within your coefficients.
22.5 Assuming Regression Establishes Causation
Regression models estimate relationships and predictive patterns among variables. However, statistical association does not imply causation. Without experimental control, random assignment, and appropriate design safeguards, regression results reflect relationships within the data rather than cause-and-effect mechanisms. Interpreting regression coefficients as evidence of causal influence in non-experimental studies overstates what the design can support. Causal claims depend on research design, not statistical procedure alone.
Chapter 22 Summary and Key Takeaways
Prediction in applied research involves using statistical models to estimate an outcome based on the values of one or more predictor variables. Two commonly used predictive models are multiple linear regression and binomial logistic regression. Multiple linear regression estimates a continuous outcome using several independent variables, while binomial logistic regression predicts the probability of a binary outcome. Both models require attention to assumptions, such as linearity (for linear regression), multicollinearity, and independence of observations. Nominal predictors must be properly coded, typically using dummy variables, to ensure accurate estimation. These regression methods are widely used across disciplines to support decision-making, identify key predictors, and forecast future outcomes. Both regression techniques can be performed efficiently in Jamovi, which provides tools for model estimation, assumption checks, and output interpretation.
Regression models how predictor variables are related to an outcome and is used to explain or predict changes in that outcome.
Multiple linear regression is used to predict a continuous outcome from two or more predictors.
Binomial logistic regression is used to predict the probability of a binary outcome using categorical or continuous predictors.
Dummy coding is required when including nominal predictors in regression models.