library(lavaan)
library(semTools)
library(psych)
3 Confirmatory Factor Analysis
You can download the R code used in this lab by right-clicking this link and selecting “Save Link As…” in the drop-down menu: confirmatoryfactoranalysis.R
3.1 Loading R Packages
Remember, you only need to install a package once. But if you want to use the functionality of a package, you will need to “load” the package into your environment. To do that for lavaan (and the semTools and psych packages, which we’ll also use in this lab), we use the library()
function:
3.2 Loading data into our environment
Typically, you will load your own data into your environment, like we did in the Correlations lab. However, you can also use datasets that are included with R packages. To access those datasets, you can use the data()
function:
data("HolzingerSwineford1939")
If you look at your environment tab, you should see a new data frame called HolzingerSwineford1939
. We can take a look at the variables this dataframe using the describe()
function that is part of the psych
package:
describe(HolzingerSwineford1939)
vars n mean sd median trimmed mad min max range skew
id 1 301 176.55 105.94 163.00 176.78 140.85 1.00 351.00 350.00 -0.01
sex 2 301 1.51 0.50 2.00 1.52 0.00 1.00 2.00 1.00 -0.06
ageyr 3 301 13.00 1.05 13.00 12.89 1.48 11.00 16.00 5.00 0.69
agemo 4 301 5.38 3.45 5.00 5.32 4.45 0.00 11.00 11.00 0.09
school* 5 301 1.52 0.50 2.00 1.52 0.00 1.00 2.00 1.00 -0.07
grade 6 300 7.48 0.50 7.00 7.47 0.00 7.00 8.00 1.00 0.09
x1 7 301 4.94 1.17 5.00 4.96 1.24 0.67 8.50 7.83 -0.25
x2 8 301 6.09 1.18 6.00 6.02 1.11 2.25 9.25 7.00 0.47
x3 9 301 2.25 1.13 2.12 2.20 1.30 0.25 4.50 4.25 0.38
x4 10 301 3.06 1.16 3.00 3.02 0.99 0.00 6.33 6.33 0.27
x5 11 301 4.34 1.29 4.50 4.40 1.48 1.00 7.00 6.00 -0.35
x6 12 301 2.19 1.10 2.00 2.09 1.06 0.14 6.14 6.00 0.86
x7 13 301 4.19 1.09 4.09 4.16 1.10 1.30 7.43 6.13 0.25
x8 14 301 5.53 1.01 5.50 5.49 0.96 3.05 10.00 6.95 0.53
x9 15 301 5.37 1.01 5.42 5.37 0.99 2.78 9.25 6.47 0.20
kurtosis se
id -1.36 6.11
sex -2.00 0.03
ageyr 0.20 0.06
agemo -1.22 0.20
school* -2.00 0.03
grade -2.00 0.03
x1 0.31 0.07
x2 0.33 0.07
x3 -0.91 0.07
x4 0.08 0.07
x5 -0.55 0.07
x6 0.82 0.06
x7 -0.31 0.06
x8 1.17 0.06
x9 0.29 0.06
You can also learn more about these built-in datasets by going to its help page:
?HolzingerSwineford1939
3.3 CFA Step 1: Model Specification
To specify a model in lavaan
, we have to write it out in lavaan
syntax and assign it to an object (here called HSmodel). Basic syntax for a CFA follows this template:
factorname =~ indicator1 + indicator2 + indicator3
In our model, we have three factors (visual, textual, and speed) that each load onto three items (e.g., for visual: x1, x2, and x3). You don’t have to specify that factors are hypothesized to be correlated, lavaan
does this automatically (scroll down for an example of a CFA in which we specify that the factors should not be correlated).
#CFA model specification
<- "visual =~ x1 + x2 + x3
HSmodel textual =~ x4 + x5 + x6
speed =~ x7 + x8 + x9"
3.4 CFA Step 2: Model Estimation
Next, we need to estimate the model, using our data. In our lecture, we went over the four phases of estimation, but in R, model estimation simplifies to using the cfa()
function with our model syntax and data arguments:
<- cfa(model = HSmodel,
fit1 data = HolzingerSwineford1939)
3.5 CFA Step 3: Interpreting Model Fit and Parameter Estimates
There are several ways of extracting the model fit and parameter estimates from our fitted lavaan model (called fit1
). The most typical way to look at this output is by using the summary()
function. Within this function, we can ask for some extra output (fit measures, standardized estimate, R-squares):
summary(fit1,
fit.measures = TRUE,
standardized = TRUE,
rsquare = TRUE)
lavaan 0.6.17 ended normally after 35 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 21
Number of observations 301
Model Test User Model:
Test statistic 85.306
Degrees of freedom 24
P-value (Chi-square) 0.000
Model Test Baseline Model:
Test statistic 918.852
Degrees of freedom 36
P-value 0.000
User Model versus Baseline Model:
Comparative Fit Index (CFI) 0.931
Tucker-Lewis Index (TLI) 0.896
Loglikelihood and Information Criteria:
Loglikelihood user model (H0) -3737.745
Loglikelihood unrestricted model (H1) -3695.092
Akaike (AIC) 7517.490
Bayesian (BIC) 7595.339
Sample-size adjusted Bayesian (SABIC) 7528.739
Root Mean Square Error of Approximation:
RMSEA 0.092
90 Percent confidence interval - lower 0.071
90 Percent confidence interval - upper 0.114
P-value H_0: RMSEA <= 0.050 0.001
P-value H_0: RMSEA >= 0.080 0.840
Standardized Root Mean Square Residual:
SRMR 0.065
Parameter Estimates:
Standard errors Standard
Information Expected
Information saturated (h1) model Structured
Latent Variables:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
visual =~
x1 1.000 0.900 0.772
x2 0.554 0.100 5.554 0.000 0.498 0.424
x3 0.729 0.109 6.685 0.000 0.656 0.581
textual =~
x4 1.000 0.990 0.852
x5 1.113 0.065 17.014 0.000 1.102 0.855
x6 0.926 0.055 16.703 0.000 0.917 0.838
speed =~
x7 1.000 0.619 0.570
x8 1.180 0.165 7.152 0.000 0.731 0.723
x9 1.082 0.151 7.155 0.000 0.670 0.665
Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
visual ~~
textual 0.408 0.074 5.552 0.000 0.459 0.459
speed 0.262 0.056 4.660 0.000 0.471 0.471
textual ~~
speed 0.173 0.049 3.518 0.000 0.283 0.283
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.x1 0.549 0.114 4.833 0.000 0.549 0.404
.x2 1.134 0.102 11.146 0.000 1.134 0.821
.x3 0.844 0.091 9.317 0.000 0.844 0.662
.x4 0.371 0.048 7.779 0.000 0.371 0.275
.x5 0.446 0.058 7.642 0.000 0.446 0.269
.x6 0.356 0.043 8.277 0.000 0.356 0.298
.x7 0.799 0.081 9.823 0.000 0.799 0.676
.x8 0.488 0.074 6.573 0.000 0.488 0.477
.x9 0.566 0.071 8.003 0.000 0.566 0.558
visual 0.809 0.145 5.564 0.000 1.000 1.000
textual 0.979 0.112 8.737 0.000 1.000 1.000
speed 0.384 0.086 4.451 0.000 1.000 1.000
R-Square:
Estimate
x1 0.596
x2 0.179
x3 0.338
x4 0.725
x5 0.731
x6 0.702
x7 0.324
x8 0.523
x9 0.442
Model Fit
The output above is great, but it can be a lot. To look at just the fit indices, you can also use fitMeasures()
. This function will return a ton of fit indices if you do not include the fit.measures
argument. Here, we select the main indices that we’re interested in:
fitMeasures(fit1,
fit.measures = c("chisq", "df", "pvalue",
"cfi", "rmsea", "rmsea.ci.lower",
"rmsea.ci.upper", "srmr"))
chisq df pvalue cfi rmsea
85.306 24.000 0.000 0.931 0.092
rmsea.ci.lower rmsea.ci.upper srmr
0.071 0.114 0.065
We can include output = "text"
to make the output look a little bit nicer (more like the summary output above):
fitMeasures(fit1,
fit.measures = c("chisq", "df", "pvalue",
"cfi", "rmsea", "rmsea.ci.lower",
"rmsea.ci.upper", "srmr"),
output = "text")
Model Test User Model:
Test statistic 85.306
Degrees of freedom 24
P-value 0.000
User Model versus Baseline Model:
Comparative Fit Index (CFI) 0.931
Root Mean Square Error of Approximation:
RMSEA 0.092
Confidence interval - lower 0.071
Confidence interval - upper 0.114
Standardized Root Mean Square Residual:
SRMR 0.065
Based on all fit indices, the fit of this CFA is poor. First, the Chi-square statistic is significant, indicating poor fit. Sometimes, with larger sample sizes, a significant Chi-square simply means that there are a lot of small, trivial misspecifications. However, if we look at the CFI, TLI, RMSEA, and SRMR values and compare them to their suggested cutoff values (.95, .95, .06, .08), they also indicate that the model fits the data poorly.
Parameter Estimates
There is also a function we can use to extract just the standardized estimates, standardizedSolution()
. Within this function, we can specify that some of the output (z-statistics and p-values) are left out. Typically, significance of parameter estimates is evaluated using the unstandardized solution (which we saw above with the summary()
function), so we should not focus on significance of the standardized estimates. Again, we can include output = "text"
to get output that is easier to read:
standardizedSolution(fit1,
zstat = FALSE, pvalue = FALSE,
output = "text")
Latent Variables:
est.std Std.Err ci.lower ci.upper
visual =~
x1 0.772 0.055 0.664 0.880
x2 0.424 0.060 0.307 0.540
x3 0.581 0.055 0.473 0.689
textual =~
x4 0.852 0.023 0.807 0.896
x5 0.855 0.022 0.811 0.899
x6 0.838 0.023 0.792 0.884
speed =~
x7 0.570 0.053 0.465 0.674
x8 0.723 0.051 0.624 0.822
x9 0.665 0.051 0.565 0.765
Covariances:
est.std Std.Err ci.lower ci.upper
visual ~~
textual 0.459 0.064 0.334 0.584
speed 0.471 0.073 0.328 0.613
textual ~~
speed 0.283 0.069 0.148 0.418
Variances:
est.std Std.Err ci.lower ci.upper
.x1 0.404 0.085 0.238 0.571
.x2 0.821 0.051 0.722 0.920
.x3 0.662 0.064 0.537 0.788
.x4 0.275 0.038 0.200 0.350
.x5 0.269 0.038 0.194 0.344
.x6 0.298 0.039 0.221 0.374
.x7 0.676 0.061 0.557 0.794
.x8 0.477 0.073 0.334 0.620
.x9 0.558 0.068 0.425 0.691
visual 1.000 1.000 1.000
textual 1.000 1.000 1.000
speed 1.000 1.000 1.000
3.6 CFA Step 4: Making Model Adjustments
Our model did not fit our data very well, so we may want to make some adjustments to our model. Remember, there are always ways to make a model fit better (adding parameters), but if they are not informed and justified by theory then we end up with a model that will only fit our data (like the outfits made specifically for Taylor Swift during her Eras tour) and will not generalize to new samples (which is bad!).
We can use a function to help us identify the parameters that, when added, will result in the largest improvements in model fit (in terms of a reduction in the model Chi-square statistic), the modindices()
function (which stands for modification indices). In the code below, we ask that only parameters with relatively large modification indices (10 or above) are returned, and we ask that the output is sorted from largest improvement to smallest improvement.
modindices(fit1, minimum.value = 10, sort = TRUE)
lhs op rhs mi epc sepc.lv sepc.all sepc.nox
30 visual =~ x9 36.411 0.577 0.519 0.515 0.515
76 x7 ~~ x8 34.145 0.536 0.536 0.859 0.859
28 visual =~ x7 18.631 -0.422 -0.380 -0.349 -0.349
78 x8 ~~ x9 14.946 -0.423 -0.423 -0.805 -0.805
Model Re-Specification
We can use the information from the modification indices to re-specify our model. The largest index is for adding a factor loading that goes from the visual factor to x9 (mi = 36.41
), which is “Speeded discrimination straight and curved capitals”. From the description of x9, we can see that this test does involve visual ability as well, so I feel that we can justify this modification. This means that the item x9 now loads onto two factors. The second loading is also called a cross loading. By adding this cross loading, we’re making the interpretation of the visual and speed factors more complex.
#Reanalysis
<- "visual =~ x1 + x2 + x3 + x9
HSmodel2 textual =~ x4 + x5 + x6
speed =~ x7 + x8 + x9"
Model Estimation
Again, model estimation is straightforward:
<- cfa(model = HSmodel2,
fit2 data = HolzingerSwineford1939)
Model Fit and Parameter Interpretation
We can look at the model fit and parameter estimates of this new model.
summary(fit2,
fit.measures = TRUE,
standardized = TRUE,
rsquare = TRUE)
lavaan 0.6.17 ended normally after 34 iterations
Estimator ML
Optimization method NLMINB
Number of model parameters 22
Number of observations 301
Model Test User Model:
Test statistic 52.382
Degrees of freedom 23
P-value (Chi-square) 0.000
Model Test Baseline Model:
Test statistic 918.852
Degrees of freedom 36
P-value 0.000
User Model versus Baseline Model:
Comparative Fit Index (CFI) 0.967
Tucker-Lewis Index (TLI) 0.948
Loglikelihood and Information Criteria:
Loglikelihood user model (H0) -3721.283
Loglikelihood unrestricted model (H1) -3695.092
Akaike (AIC) 7486.566
Bayesian (BIC) 7568.123
Sample-size adjusted Bayesian (SABIC) 7498.351
Root Mean Square Error of Approximation:
RMSEA 0.065
90 Percent confidence interval - lower 0.042
90 Percent confidence interval - upper 0.089
P-value H_0: RMSEA <= 0.050 0.133
P-value H_0: RMSEA >= 0.080 0.158
Standardized Root Mean Square Residual:
SRMR 0.045
Parameter Estimates:
Standard errors Standard
Information Expected
Information saturated (h1) model Structured
Latent Variables:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
visual =~
x1 1.000 0.885 0.759
x2 0.578 0.098 5.918 0.000 0.511 0.435
x3 0.754 0.103 7.291 0.000 0.667 0.590
x9 0.437 0.081 5.367 0.000 0.387 0.384
textual =~
x4 1.000 0.989 0.851
x5 1.115 0.066 17.016 0.000 1.103 0.856
x6 0.926 0.056 16.685 0.000 0.916 0.838
speed =~
x7 1.000 0.666 0.612
x8 1.207 0.185 6.540 0.000 0.804 0.795
x9 0.675 0.112 6.037 0.000 0.450 0.447
Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
visual ~~
textual 0.396 0.072 5.506 0.000 0.453 0.453
speed 0.177 0.055 3.239 0.001 0.301 0.301
textual ~~
speed 0.136 0.051 2.675 0.007 0.206 0.206
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.x1 0.576 0.100 5.731 0.000 0.576 0.424
.x2 1.120 0.100 11.153 0.000 1.120 0.811
.x3 0.830 0.087 9.515 0.000 0.830 0.651
.x9 0.558 0.060 9.336 0.000 0.558 0.550
.x4 0.373 0.048 7.800 0.000 0.373 0.276
.x5 0.444 0.058 7.602 0.000 0.444 0.267
.x6 0.357 0.043 8.285 0.000 0.357 0.298
.x7 0.740 0.086 8.595 0.000 0.740 0.625
.x8 0.375 0.094 3.973 0.000 0.375 0.367
visual 0.783 0.134 5.842 0.000 1.000 1.000
textual 0.978 0.112 8.728 0.000 1.000 1.000
speed 0.444 0.097 4.567 0.000 1.000 1.000
R-Square:
Estimate
x1 0.576
x2 0.189
x3 0.349
x9 0.450
x4 0.724
x5 0.733
x6 0.702
x7 0.375
x8 0.633
But how do we know if this model is better (in terms of fitting our data) than the original CFA?
3.7 Comparing Multiple Models to Each Other
To compare the fit of the two models, we can use the compareFit()
function:
<- compareFit(fit1, fit2)
comp_fit1_fit2 summary(comp_fit1_fit2)
################### Nested Model Comparison #########################
Chi-Squared Difference Test
Df AIC BIC Chisq Chisq diff RMSEA Df diff Pr(>Chisq)
fit2 23 7486.6 7568.1 52.382
fit1 24 7517.5 7595.3 85.305 32.923 0.32567 1 9.586e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
####################### Model Fit Indices ###########################
chisq df pvalue rmsea cfi tli srmr aic bic
fit2 52.382† 23 .000 .065† .967† .948† .045† 7486.566† 7568.123†
fit1 85.306 24 .000 .092 .931 .896 .065 7517.490 7595.339
################## Differences in Fit Indices #######################
df rmsea cfi tli srmr aic bic
fit1 - fit2 1 0.027 -0.036 -0.052 0.02 30.923 27.216
The output of this function includes a comparison of fit based on the model Chi-square: the Chi-square difference test. If this test is significant, then it means that the model with fewer estimated parameters (here fit1
) fits the data significantly worse than the model with more estimated parameters (here fit2
), so we should select the model with more parameters (fit2
). If this test is not significant, then it means that the additional parameters of fit2
did not improve fit enough to result in a better model fit than that of fit1
, which means that we should stick with the simpler model (fit1
). What do the results above tell us?
The output also includes comparisons of the relative fit indices (AIC and BIC). Remember: lower values indicate better fit. Finally, the output also includes comparisons of other fit indices (e.g., CFI, RMSEA). These are sometimes also used to compare the fit of several models. However, there are no clear, universal guidelines on how different these indices need to be before they indicate an improvement/worsening in model fit.
3.8 How to specify different types of CFAs
Here is an example of how to specify a hierarchical CFA, where the three factors are indicator of a higher-order ability
factor:
<- "visual =~ x1 + x2 + x3
HSmodel3 textual =~ x4 + x5 + x6
speed =~ x7 + x8 + x9
ability =~ visual + textual + speed"
Here is an example of a CFA in which the factors are specified to be uncorrelated with each other:
<- "visual =~ x1 + x2 + x3
HSmodel4 textual =~ x4 + x5 + x6
speed =~ x7 + x8 + x9
visual ~~ 0*textual
visual ~~ 0*speed
textual ~~ 0*speed"
Factor correlations can be fixed to 0 (i.e., removed from the CFA) using the following template:
factorname1 ~~ 0*factorname2
Here is an example of a CFA in which the factors are specified to be correlated but the correlations are constrained to be equal (i.e., their parameter estimate is going to be the exact same value):
<- "visual =~ x1 + x2 + x3
HSmodel5 textual =~ x4 + x5 + x6
speed =~ x7 + x8 + x9
visual ~~ a*textual
visual ~~ a*speed
textual ~~ a* speed"
To constrain parameters to be equal, we can give them the same label (this is different from the factor name that we use to specify/name latent factors). The general format for these equality constraints is:
factorname1 ~~ label*factorname2
Or for constraining factor loadings to be equivalent:
factorname1 =~ label*x1 + label*x2 + label*x3
Your label can be any text string (e.g., a, b, eq), but remember to use different labels for different equality constraints. So, if you want to constrain your loadings and your factor correlations, use the label a
for the loadings and b
for the correlations.
3.9 Summary
In this R lab, you learned how to specify, estimate, evaluate and interpret CFAs. In addition, you learned how to re-specify a CFA and compare the fit across several models to select the best-fitting model. Finally, you were introduced to some examples of alternative CFA configurations, such as the higher-order CFA.