library(tidySEM)
library(dplyr)
library(tidyr)
library(ggplot2)
12 Latent Class Analysis
12.1 Loading R Packages
Load the required packages for this lab into your R environment.
12.2 Loading Data
Load the data into your environment. For this lab we will use a dataset based on N = 972 children whose caregivers completed a survey about their socio-emotional development. You can download the data by right-clicking this link and selecting “Save Link As…” in the drop-down menu: data/preschool.csv. Make sure to save it in the folder you are using for this class.
The full dataset and more information about this project can be found here: https://ldbase.org/datasets/38d4a723-c167-4908-a250-2cf29a4ff49b.
<- read.csv("data/preschool.csv")
plbs
<- plbs %>% mutate(across(everything(), ~if_else(.x == 1.5, 1, .x)),
plbs across(everything(), ~ factor(.x,
labels = c("Does not Apply",
"Sometimes Applies",
"Most Often Applies"),
ordered = T
)
)
)
colnames(plbs) <- c("dontcare", "cooperates", "distracted", "willinghelp", "silly", "acceptnew")
We will focus on six items of the Preschool Learning Behaviors Scale (PLBS). Here are the items:
- Adopts a don’t care attitude to success and failure.
- Cooperates in group activities.
- Is distracted too easily by what is going on in the room, or seeks distractions.
- Is willing to be helped.
- Invents silly ways of doing things
- Accepts new activities without fear or resistance
Caregivers could respond with one of three options: Does Not Apply, Sometimes Applies, Most Often Applies.
The tidySEM
package has a function descriptives()
that we can use to get some basic descriptive statistics for our items:
<- descriptives(plbs)
desc <- desc[, c("name", "n", "missing", "unique", "mode", "mode_value")]
desc desc
name n missing unique mode mode_value
1 dontcare 972 0 4 639 Does not Apply
2 cooperates 972 0 4 610 Most Often Applies
3 distracted 972 0 4 411 Sometimes Applies
4 willinghelp 972 0 4 624 Most Often Applies
5 silly 972 0 4 443 Does not Apply
6 acceptnew 972 0 4 492 Most Often Applies
We can also use ggplot()
to visualize the distribution of item responses:
<- plbs
plbs_plot <- pivot_longer(plbs_plot, everything())
plbs_plot
ggplot(plbs_plot, aes(x = value)) +
geom_histogram(stat="count") +
facet_wrap(~name) +
theme_bw()
12.3 Estimate range of latent class solutions
We typically come into an LCA with some theory-based expectations for how many classes might exist in the population. In addition, our overall sample size might put some restrictions on how many classes can feasibly be estimated.
set.seed(9710)
<- mx_lca(data = plbs, classes = 1:6)
res saveRDS(res, file = "data/LCA_results.RDS")
Instead of running the code below, you can download the results by right-clicking here and saving it into the data folder.
<- readRDS("data/LCA_results.RDS") res
12.4 Class Solution Comparison
<- table_fit(res) # model fit table
fit c("Name", "LL", "Parameters", "n","AIC", "BIC", "saBIC", "Entropy", "prob_min",
fit[, "prob_max", "n_min", "n_max", "np_ratio", "np_local")]
Name LL Parameters n AIC BIC saBIC Entropy
1 1 -5197.036 12 972 10418.072 10476.62 10438.512 1.0000000
2 2 -4946.100 25 972 9942.199 10064.18 9984.783 0.6514148
3 3 -4906.229 38 972 9888.457 10073.87 9953.185 0.6825587
4 4 -4873.592 51 972 9849.184 10098.03 9936.056 0.6264103
5 5 -4856.184 64 972 9840.367 10152.65 9949.382 0.6686257
6 6 -4847.063 77 972 9848.126 10223.84 9979.284 0.7046337
prob_min prob_max n_min n_max np_ratio np_local
1 1.0000000 1.0000000 1.00000000 1.0000000 81.00000 81.000000
2 0.8918582 0.9022232 0.49588477 0.5041152 38.88000 40.166667
3 0.6625695 0.9015618 0.08436214 0.5000000 25.57895 6.833333
4 0.6602120 0.8180830 0.04629630 0.3456790 19.05882 3.750000
5 0.7194970 0.8397305 0.04320988 0.3364198 15.18750 3.500000
6 0.5899152 0.8377095 0.02366255 0.3353909 12.62338 1.916667
What class solution is preferred according to the BIC?
We might want to explore additional class solutions, but we can see that, starting with four classes, there are < 5 observations per parameter in every class, which is quite low, especially given the total sample size. Estimating additional classes may cause the model to not be identified within classes. We can visualize the drop in BIC across class solutions using the plot()
function:
plot(fit)
The output above also gives us information about the entropy of the model. This value provides a summary of class separation. It actually represents \(1-entropyy\) but most people who use it for mixture models report it as simply entropy. If entropy is 0, there is no separations between classes, their response distributions overlap fully. If entropy is 1, all classes are fully separated, their response distributions are completely separated. So, higher values are preferred.
Lo-Mendell-Rubin Likelihood Ratio Test
We can also use the Lo-Mendell-Rubin Likelihood Ratio Test (LMR-LRT) to make pairwise comparisons across model solutions to test the Null hypothesis that two class solutions have equivalent fit (so a p-value < .05 indicates that one model fits better than the other).
lr_lmr(res)
Lo-Mendell-Rubin adjusted Likelihood Ratio Test:
null alt lr df p w2 p_w2
mix1 mix2 10.5 13 2.22e-16 0.588 2.22e-16
mix2 mix3 NA NA NA NA NA
mix3 mix4 NA NA NA NA NA
mix4 mix5 NA NA NA NA NA
mix5 mix6 NA NA NA NA NA
Bootstrapped Likelihood Ratio Test
An alternative to the LMR-LRT is the bootstrapped likelihood ratio test. This approach is very computationally expensive (i.e., slow), so we only ask for 10 bootstrapped samples here. For a publication you should use a much higher number (1000+). You can use packages such as the future
package to use multiple CPUs and speed things up with parallel computing.
set.seed(1)
<- BLRT(res, replications = 5) res_blrt
12.5 Interpreting Final Model Results
Before interpreting final estimates, we reorder the latent classes so that they are arranged from largest to smallest:
<- mx_switch_labels(res[[2]]) res_final
Classification Diagnostics
We can look at several indicators of classification accuracy and reliability. First, we can look at how many individuals are classified into each class based on their highest posterior probability:
class_prob(res_final, type = "sum.mostlikely") # assumes no classification error
$sum.mostlikely
class count proportion
1 class1 490 0.5041152
2 class2 482 0.4958848
The estimates above assume that each participant is classified into the correct class (does not incorporate any uncertainty). We can also ask for estimates that do account for that uncertainty. If there is no classification error (i.e., every person is classified into 1 class with 100% certainty), the above table would be identical to the below table (which does account for error):
class_prob(res_final, type = "sum.posterior") # includes classification error
$sum.posterior
class count proportion
1 class1 497.3807 0.5117086
2 class2 474.6193 0.4882914
#class_prob(res_final, type = "individual") # classification probabilities for each individual
Are there any major differences between the two classification tables? What does that say about the severity of classification errors?
Next, we can use a cross-table to understand the certainty of our classifications better. Again, we can either ignore classification error, using the functions below:
class_prob(res_final, type = "mostlikely.class") # values on diagonal should be > .7
$mostlikely.class
assigned.1 assigned.2
avgprob.1 0.89185824 0.1081418
avgprob.2 0.09777682 0.9022232
Or we can include classification error, using the function below:
class_prob(res_final, type = "avg.mostlikely") # values on diagonal should be > .7
$avg.mostlikely
meanprob.class1 meanprob.class2
assigned.1 0.9052924 0.09470764
assigned.2 0.1115926 0.88840737
For both tables, values on the diagonal represent classification certainty and should be close to 1.
For which of these classes is classification certainty lowest?
Parameter Estimates
Now that we’ve evaluated the classification diagnostics, we can interpret the parameter estimates, using the table_results()
function. With ordinal indicators in LCA, the most relevant estimates are the thresholds. Finding differences in thresholds across classes indicates that persons in different classes switch from one response to the next at a different point along the latent response scale. The code below allows us to focus on those Thresholds:
table_results(res_final, columns = c("label", "est", "confint",
"class")) %>% filter(stringr::str_detect(label, "Thresholds")) %>%
mutate(label = stringr::str_split_i(label, "\\.", 2)) %>%
pivot_wider(id_cols = label, names_from = class, values_from = est:confint, names_vary = "slowest")
# A tibble: 12 × 5
label est_class1 confint_class1 est_class2 confint_class2
<chr> <chr> <chr> <chr> <chr>
1 Thresholds[1,1] 1.04 [0.85, 1.24] -0.12 [-0.26, 0.02]
2 Thresholds[1,2] -2.86 [-3.91, -1.80] -1.35 [-1.52, -1.18]
3 Thresholds[1,3] 0.05 [-0.10, 0.20] -1.24 [-1.45, -1.03]
4 Thresholds[1,4] -2.26 [-2.60, -1.91] -1.70 [-1.91, -1.49]
5 Thresholds[1,5] 0.00 [-0.12, 0.13] -0.23 [-0.36, -0.11]
6 Thresholds[1,6] -1.69 [-1.90, -1.47] -1.09 [-1.25, -0.94]
7 Thresholds[2,1] 2.10 [1.69, 2.52] 1.07 [0.92, 1.23]
8 Thresholds[2,2] -1.62 [-2.02, -1.21] 0.55 [0.34, 0.75]
9 Thresholds[2,3] 1.58 [1.26, 1.91] 0.07 [-0.07, 0.21]
10 Thresholds[2,4] -0.90 [-1.06, -0.74] 0.10 [-0.04, 0.24]
11 Thresholds[2,5] 1.56 [1.34, 1.78] 0.96 [0.81, 1.11]
12 Thresholds[2,6] -0.59 [-0.74, -0.43] 0.58 [0.42, 0.74]
To make it easier to see if response patterns vary across classes, we can use the table_prob()
function, which transforms the thresholds into response probabilities. These do not come with a confidence interval, so ideally you first examine the thresholds to note relevant differences across classes and then look at these probabilities to see what they really mean:
table_prob(res_final) %>%
pivot_wider(names_from = group,
values_from = Probability)
# A tibble: 18 × 4
Variable Category class1 class2
<chr> <int> <dbl> <dbl>
1 dontcare 1 0.852 0.454
2 dontcare 2 0.131 0.405
3 dontcare 3 0.0178 0.141
4 cooperates 1 0.00213 0.0884
5 cooperates 2 0.0507 0.619
6 cooperates 3 0.947 0.293
7 distracted 1 0.519 0.107
8 distracted 2 0.424 0.422
9 distracted 3 0.0569 0.471
10 willinghelp 1 0.0120 0.0443
11 willinghelp 2 0.173 0.496
12 willinghelp 3 0.815 0.460
13 silly 1 0.502 0.408
14 silly 2 0.439 0.423
15 silly 3 0.0597 0.169
16 acceptnew 1 0.0457 0.138
17 acceptnew 2 0.233 0.581
18 acceptnew 3 0.721 0.281
The threshold results above indicate that all thresholds are significantly different across classes (this aligns with the very high entropy for this class solution). Focusing on the response probabilities, it appears that caregivers in class 1 are more likely to endorse positive learning behaviors (cooperates, wiling to accept help, accept new), whereas caregivers in class 2 are more likely to endorse negative learning behaviors (don’t care, distracted). The two classes’ responses are similar for the silly item.
Visualize the classes
Tables can provide a lot of detail, but it can be challenging to unnderstand the bigger picture. That’s why we like to visualize response patterns of the latent classes. The plot_prob()
function includes all classes in one plot, using subplots. The theme()
element was added to make the item labels more readable.
plot_prob(res_final) +
theme(axis.text.x = element_text(angle=45, vjust = 0.5))
What label would you give each of these classes?
12.6 Predicting membership for new individuals
Finally, you can use the LCA results to predict most likely class membership of new cases, as long as you have their scores on the indicator items. Below, I define a new participant (using an existing case from the original sample so the variables are ordered factors) who has a somewhat inconsistent class (e.g., the distracted item Sometimes Applies):
<- plbs[10,]
plbs_new plbs_new
dontcare cooperates distracted willinghelp
10 Does not Apply Most Often Applies Sometimes Applies Most Often Applies
silly acceptnew
10 Does not Apply Sometimes Applies
predict(res_final, newdata = plbs_new)
Running mix2 with 0 parameters
class1 class2 predicted
[1,] 0.8485255 0.1514745 1
12.7 Summary
In this R lab, you were introduced to the steps involved in specifying, estimating, evaluating, comparing and interpreting the results of latent class analyses. For now, this is the final lab for this class!