20: Prediction
20.1 Predictive Thinking
In quantitative research, prediction involves using statistical models to estimate the value of a dependent variable based on one or more independent variables. The goal is to understand the relationship between these variables and to apply that understanding to forecast outcomes for future or unseen data. Prediction plays a crucial role in supporting evidence-based decision-making and helping researchers anticipate likely results. It is important to distinguish prediction from explanation, which focuses on identifying the underlying causes of a relationship. Prediction is primarily concerned with forecasting outcomes as accurately as possible, while explanation aims to test theories and understand why those outcomes occur.
20.2 Understanding Regression
Regression is a statistical method used to model the relationship between an independent variable (also called a predictor) and a dependent variable (also called an outcome). It helps quantify how changes in the predictor are associated with changes in the outcome, often for the purpose of making predictions or understanding the nature of the relationship. While other relational tests such as correlation measure the strength and direction of a relationship between two variables, regression goes further by modeling that relationship and estimating how much the outcome changes for each unit change in the predictor. This makes regression especially useful for forecasting and for evaluating the unique contribution of one or more predictors.
Regression models can range from simple to complex, depending on the number of predictors involved and the nature of the outcome variable. A simple regression model includes one predictor and is often used to examine the basic relationship between two variables. More complex models can include multiple predictors, interaction terms, or transformations of variables to capture more detailed relationships. Regression can also be applied to different types of outcome variables, such as continuous outcomes in linear regression or categorical outcomes in logistic regression. Despite these differences, the core purpose remains the same: to understand how variables relate to one another and to make informed predictions based on those relationships.
20.3 Multiple Linear Regression
Multiple Linear Regression is used to predict the value of a continuous dependent variable based on two or more independent variables. The relationship between the dependent variable and each independent variable is assumed to be linear, meaning that changes in the predictors are associated with consistent, proportional changes in the outcome. This method is commonly applied when researchers aim to understand how multiple factors together relate to or influence a single outcome.
When including nominal independent variables in a regression model, they must be transformed into a numerical format the model can interpret, typically through a process called dummy coding. Dummy coding creates binary variables to represent the categories of a nominal variable. For variables with only two categories, one dummy variable is created to indicate membership in one category. For nominal variables with more than two categories, multiple dummy variables are created, one fewer than the total number of categories. Each dummy variable compares one category to a designated reference group.
This approach allows the regression model to estimate how each category influences the dependent variable relative to the reference group. The coefficient for each dummy variable reflects how much higher or lower the predicted outcome is compared to the baseline category, while holding all other variables constant.
Assumptions
Multiple Linear Regression relies on several key assumptions to produce valid results. First, there should be a linear relationship between the dependent variable and each independent variable. The residuals (the differences between observed and predicted values) should be approximately normally distributed. The model also assumes homoscedasticity, meaning that the variance of the residuals is consistent across all levels of the independent variables. Observations must be independent of one another, meaning that the value of one observation should not influence another. If data are collected over time or in sequence, the model also assumes no autocorrelation, which occurs when residuals are correlated across observations. Finally, there should be little or no multicollinearity among the independent variables; predictors should not be too highly correlated with each other.
How To: Multiple Linear Regression
To run the Multiple Linear Regression in Jamovi, go to the Analyses tab, select Regression, then Linear Regression.
-
Select the DV and move it to the Dependent Variable box.
-
Move interval IVs to the Covariates box.
- Move nominal IVs to the Factors box.
- Optional: Under the Reference Levels drop-down, select the group of interest for each nominal IV.
-
Under the Assumption Checks drop-down, select: Autocorrelation test, Collinearity statistics, Normality test, Q-Q plot of residuals, Residual plots, and Cook’s distance.
- Under the Model Fit drop-down, select: R, R², Adjusted R², AIC, and F test.
-
Under the Model Coefficients drop-down, select: Standardized estimate, Confidence intervals
- Under the Estimated Marginal Means drop-down, move the first variable in the left box to the Marginal Means box.
- Select the Add New Term Button and add each additional variable in the left box as a separate term.
-
Under Output, select Marginal means plots and Marginal means tables.
TIP: Jamovi automatically applies dummy coding to the nominal variables.
Phrasing Results: Multiple Linear Regression
Use this template to phrase significant model results:
- A Multiple Linear Regression was conducted to examine whether [IV1] and [IV2] significantly influenced [DV].
- The overall regression model was significant, F(df1, df2) = [F statistic], p < [approximate p-value], with an R² of [R² value], indicating that approximately [XX]% of the variance in [DV] was explained by the model.
Use this template to phrase the significant coefficient results:
- [IV1] was a significant predictor of [DV], showing a [positive/negative] relationship (β = [unstandardized estimate], p < [approximate p-value]).
- [IV2] also significantly predicted [DV] and demonstrated a [positive/negative] relationship (β = [unstandardized estimate], p < [approximate p-value]).
- The standardized coefficient for [IV1] (β* = [standardized estimate]) indicated that it had a greater influence on [DV] than [IV2] (β* = [standardized estimate]).
Use this template to phrase non-significant model results:
- A Multiple Linear Regression was conducted to examine whether [IV1] and [IV2] influenced [DV].
- The overall regression model was not significant, F(df1, df2) = [F statistic], p = [p-value], R² = [R² value].
Use this template to phrase the non-significant coefficient results:
- [IV1] was not a significant predictor of [DV] (β = [estimate], p = [p-value]).
- [IV2] was also not a significant predictor (β = [estimate], p = [p-value]).
TIP: Replace the content inside the brackets with your variables and results, then remove the brackets.
TIP: Follow the coefficient template to add more IVs as needed.
TIP: Use a mix of the phrasing language from the templates to match your results if you have a mix of significant and non-significant results within your coefficients.
20.4 Binomial Logistic Regression
Binomial logistic regression is used when the dependent variable is binary, meaning it has two possible outcomes. This model predicts the probability that an observation falls into one of the two outcomes based on one or more independent variables. Unlike linear regression, which predicts a continuous value, logistic regression models the log odds of the outcome. Log odds are the natural logarithm of the odds, where odds represent the ratio of the probability of the event occurring to the probability of it not occurring. The model then uses a logistic function to convert the log odds into predicted probabilities between 0 and 1. Logistic regression is widely used when the outcome is categorical and researchers want to estimate the likelihood of a specific outcome given the values of the predictors.
As in multiple linear regression, nominal independent variables must be transformed into binary variables using dummy coding. These dummy variables are then included in the model as predictors, allowing the model to estimate the effect of each category relative to a reference group.
Assumptions
Binomial logistic regression relies on several key assumptions to produce valid results. First, the observations should be independent, meaning that the response of one participant does not influence another. Logistic regression does not require the predictors to be normally distributed or have equal variances, but it does assume a linear relationship between each continuous predictor and the log odds of the outcome. Additionally, the model assumes there is little or no multicollinearity among the independent variables. Highly correlated predictors can distort the estimated effects.
How To: Binomial Logistic Regression
To run the Binomial Logistic Regression in Jamovi, go to the Analyses tab, select Regression, then 2 Outcomes-Binomial under Logistic Regression.
-
Select the two-group nominal DV and move it to the Dependent Variable box.
- Move interval IVs to the Covariates box.
-
Move nominal IVs to the Factors box.
- Under the Reference Levels drop-down, select the DV variable group that is the opposite of the outcome of interest.
- Optional: Select the group of interest for each nominal IV.
-
Under the Assumption Checks drop-down, select Collinearity statistics.
- Under the Model Fit drop-down, select: Deviance, AIC, Overall model test, and Nagelkerke’s R².
-
Under the Model Coefficients drop-down, select: Odds ratio and Confidence interval under Odds ratio.
- Under the Estimated Marginal Means drop-down, move the first variable in the left box to the Marginal Means box.
- Select the Add New Term Button and add each additional variable in the left box as a separate term.
- Under Output, select Equal cell weight, Confidence intervals, Marginal means plots, and Marginal means tables.
-
Under the Prediction drop-down, select: Cut-off plot and AUC.
-
Under Predictive Measures, select: Classification table, Accuracy, Specificity, and Sensitivity.
TIP: Jamovi automatically applies dummy coding to the nominal variables.
Phrasing Results: Binomial Logistic Regression
Use this template to phrase significant model results:
- A Binomial Logistic Regression was conducted to examine whether [IV1] and [IV2] influenced the likelihood of [DV target group].
- A significant model was found, χ²(df) = [χ² statistic], p < [approximate p-value], with a Nagelkerke’s R² of [Nagelkerke’s R² value]. This indicates that the model explained approximately [XX]% of the variance in the likelihood of [DV target group].
Use this template to phrase the significant coefficient results:
- [IV1 reference group] was [more/less] likely than [IV1 group 2] to [DV target group] (β = [estimate], OR = [odds ratio], p < [approximate p-value]).
- Additionally, [IV2 reference group] was [more/less] likely than [IV2 group 2] to [DV target group] (β = [estimate], OR = [odds ratio], p < [approximate p-value]).
Use this template to phrase non-significant model results:
- A Binomial Logistic Regression was conducted to examine whether [IV1] and [IV2] influenced the likelihood of [DV target group].
- The overall model was not significant, χ²(df) = [χ² statistic], p = [p-value].
Use this template to phrase the non-significant coefficient results:
- The comparison between [IV1 reference group] and [IV1 group 2] was not significant for predicting [DV target group] (β = [estimate], OR = [odds ratio], p = [p-value]).
- Additionally, the comparison between [IV2 reference group] and [IV2 group 2] was not significant (β = [estimate], OR = [odds ratio], p = [p-value]).
TIP: Replace the content inside the brackets with your variables and results, then remove the brackets.
TIP: Follow the coefficient template to add more IVs as needed.
TIP: Use a mix of the phrasing language from the templates to match your results if you have a mix of significant and non-significant results within your coefficients.
Chapter 20 Summary and Key Takeaways
Prediction in applied research involves using statistical models to estimate an outcome based on the values of one or more predictor variables. Two commonly used predictive models are multiple linear regression and binomial logistic regression. Multiple linear regression estimates a continuous outcome using several independent variables, while binomial logistic regression predicts the probability of a binary outcome. Both models require attention to assumptions, such as linearity (for linear regression), multicollinearity, and independence of observations. Nominal predictors must be properly coded, typically using dummy variables, to ensure accurate estimation. These regression methods are widely used across disciplines to support decision-making, identify key predictors, and forecast future outcomes. Both regression techniques can be performed efficiently in Jamovi, which provides tools for model estimation, assumption checks, and output interpretation.
- Regression models how predictor variables are related to an outcome and is used to explain or predict changes in that outcome.
- Multiple linear regression is used to predict a continuous outcome from two or more predictors.
- Binomial logistic regression is used to predict the probability of a binary outcome using categorical or continuous predictors.
- Dummy coding is required when including nominal predictors in regression models.