Discussion:
[R-sig-ME] pre/post with partial participation
Paul Johnson
2018-11-13 16:11:22 UTC
Permalink
I have a crazy ANOVA question and would appreciate your advice.

We have a project that did a pre-post measurement, but the
participation in the data collection was haphazard. There are only 19
people that participated pre-post, but there are about 40 that
participated only in the pre phase and 30 that participated in the
post phase.

I don't have the data to show you, but I made some up. I've got
ID=1:19 for the ones who are both in pre and post data samples, and ID
20:59 are in pre only and 60:89 are post only.

In my first example, the data has no true random effect and lmer gets
the correct estimate, the random effect is estimated as 0.00, or close
to it. I find that slight differences in the way I generate the data
(either I get exactly 0.0 or 7 x 10^-13 or similar).

This way of making the data generates a "full pre/post" data set and
then throws away pre and post observations for the missing cases:

set.seed(234234)
dat4 <- data.frame(ID = rep(1:89, 2), x = gl(2, 89, labels = c("pre", "post")))
err <- rnorm(length(dat4$x), 0, 1)
b <- 0
beta <- 4
dat4$y <- ifelse(dat4$x == "pre", 40 + err, 40 + beta + err) + b
## Now omit the
## post measurement for ID 20:59
## pre measurement for ID 60:89
dat4 <- dat4[!(dat4$ID %in% 20:59 & dat4$x == "post"), ]
dat4 <- dat4[!(dat4$ID %in% 60:89 & dat4$x == "pre"), ]

library(lme4)

m1 <- lmer(y ~ x + (1 | ID), dat4)
summary(m1)

Output shows nearly 0 random ID variance:

Linear mixed model fit by REML ['lmerMod']
Formula: y ~ x + (1 | ID)
Data: dat4

REML criterion at convergence: 313.3

Scaled residuals:
Min 1Q Median 3Q Max
-2.4502 -0.6261 -0.0361 0.5753 3.5321

Random effects:
Groups Name Variance Std.Dev.
ID (Intercept) 7.342e-13 8.569e-07
Residual 1.043e+00 1.021e+00
Number of obs: 108, groups: ID, 89

Fixed effects:
Estimate Std. Error t value
(Intercept) 39.8946 0.1330 299.99
xpost 4.2518 0.1974 21.54

I thought that was a happy result, the pre/post effect is estimated
reasonably and the estimator does not find a random effect if there is
none.

Then I put in a random effect.
set.seed(234234)
dat5 <- data.frame(ID = rep(1:89, 2), x = gl(2, 89, labels = c("pre", "post")))
err <- rnorm(length(dat5$x), 0, 1)
b <- rep(rnorm(89, 0, 1), 2)
beta <- 4
dat5$y <- ifelse(dat5$x == "pre", 40 + err, 40 + beta + err) + b
dat5 <- dat5[!(dat5$ID %in% 20:59 & dat5$x == "post"), ]
dat5 <- dat5[!(dat5$ID %in% 60:89 & dat5$x == "pre"), ]
m2 <- lmer(y ~ x + (1 | ID), dat5)
summary(m2)
summary(m2)
Linear mixed model fit by REML ['lmerMod']
Formula: y ~ x + (1 | ID)
Data: dat5

REML criterion at convergence: 367.1

Scaled residuals:
Min 1Q Median 3Q Max
-1.8537 -0.3634 -0.0253 0.3611 2.1063

Random effects:
Groups Name Variance Std.Dev.
ID (Intercept) 1.2441 1.1154
Residual 0.6632 0.8144
Number of obs: 108, groups: ID, 89

Fixed effects:
Estimate Std. Error t value
(Intercept) 39.9205 0.1704 234.29
xpost 4.1333 0.2069 19.98


This "seems" to work reasonably. What dangers await?
--
Paul E. Johnson http://pj.freefaculty.org
Director, Center for Research Methods and Data Analysis http://crmda.ku.edu

To write to me directly, please address me at pauljohn at ku.edu.
Hugo Quené
2018-11-14 12:16:48 UTC
Permalink
Dear Paul,

Thanks for the interesting example.
In this case we do know the true ID effects (b) so we can inspect the
table(dat5$ID) -> tab5 # nr of obs per ID
plot( b[1:89], ranef(m2)$ID[,1], xlab="true ID effect", ylab="estimated ID
effect", pch=ifelse(tab5==2,16,1), cex=2,
     main="m2 of dat5" )
abline(a=0,b=1,lty=4)
legend("top",pch=c(16,1), legend=c("two obs","one obs"), pt.cex=2, ncol=2 )
This confirms that the random estimates for ID deviate more from their
true value if there is only 1 data point available than if there are 2
data points available. With more data points per ID it becomes easier to
separate ID (b) and residual (err) random effects. In other words, some
of the err variance is now considered as part of ID variance. Thus with
the incomplete data in dat5, the variance between ID is overestimated
(estimate 1.24, true 1.00), as illustrated in the plot. Conversely, the
err variance is underestimated (estimate 0.66, true 1.00).

HTH! With kind regards, Hugo Quené
--
Prof.dr. Hugo Quené | hoogleraar Kwantitatieve Methoden | onderwijsdirecteur Undergraduate School | Dept Talen Literatuur en Communicatie | Utrecht inst of Linguistics OTS | Universiteit Utrecht | Trans 10 | kamer1.43 | 3512 JK Utrecht | The Netherlands |+31 30 253 6070 |***@uu.nl |www.uu.nl/gw/medewerkers/HQuene |www.hugoquene.nl |uu.academia.edu/HugoQuene <http://uu.academia.edu/HugoQuene> |


[[alternative HTML version deleted]]
Loading...