[R-sig-ME] Non-Normal and Heteroskedastic Residuals in Longitudinal Model Due to Non-Normal DV - Percentile Bootstrap Sufficient, or Wild Bootstrap Needed?

Discussion:

David Jones

2018-06-11 13:59:24 UTC

Permalink

I am looking to model quality of life (QOL) as a DV over time. The DV
shows strong negative skew. I am wondering about the best way to
handle this (more detail below). Frequency distribution of QOL and
example code are also at the end of this message.

Many participants just say that their quality of life is great, and
thus there is a ceiling effect with many values clustered at the
highest value. While the distribution resembles y=e^x, I have not been
able to fit a distribution via GLMM that results in normally
distributed and homoskedastic residuals (including gamma and inverse
gaussian). A number of DV transformations have not worked either
(e.g., log, exponential, Box-Cox), in large part because of the large
proportion of values at the maximum level of QOL, which creates a
spike at the end of the distribution. I could try zero-inflated models
by transforming the dv (multiply by -1 and put the starting value at
0), but even then there will still be a disproportionate number of
values clustered at one end.

My question: I am particularly interested in fixed effects parameters
from a longitudinal model, and was thinking of testing these
parameters by using percentile bootstrap CIs via confint(). However,
the residuals from a lmer model are both non-normal and
heteroskedastic - will percentile bootstrap of beta coefficients
address this, or can only the wild bootstrap address these issues (as
it is targeted to residuals)? I have a basic understanding of the
bootstrap but am not an expert regarding its use in linear models.

Many thanks!

# Example lmer code
model <- lmer(QOL ~ poly(time, 2) + (time | ID), data=dataset, REML =
FALSE )

# Frequency distribution

QOL valid_percent
25 0.000308261
30 0.000308261
32 0.000308261
34 0.000616523
38 0.000308261
41 0.000308261
45 0.000308261
46 0.000308261
47 0.000308261
48 0.000616523
49 0.000616523
50 0.000616523
51 0.000308261
52 0.000308261
53 0.001541307
54 0.000616523
55 0.001233046
56 0.000616523
57 0.000924784
58 0.000308261
59 0.000924784
60 0.000924784
61 0.001849568
62 0.001541307
63 0.003082614
64 0.001849568
65 0.00215783
66 0.002466091
67 0.004007398
68 0.002466091
69 0.004007398
70 0.002466091
71 0.003699137
72 0.006781751
73 0.004932183
74 0.006781751
75 0.006165228
76 0.007090012
77 0.007706535
78 0.008631319
79 0.010789149
80 0.015104809
81 0.014488286
82 0.01541307
83 0.020345253
84 0.025893958
85 0.03298397
86 0.036066585
87 0.053020962
88 0.064426634
89 0.080147966
90 0.088779285
91 0.452219482

Philippi, Tom

2018-06-14 17:54:05 UTC

Permalink

David--

I apologize in advance for not answering your precise question, but no one
else has responded, and this response might be more helpful than nothing.

If I understand your frequency data, nearly half of your observations are
tied at the extreme value of 91. No transform is going to make that
distribution approximately normal. Without rather large sample sizes, most
forms of bootstrapping will not produce confidence intervals with nominal
and symmetric coverage. Further, modeling changes in the _mean_ of such
values can muddle or mislead on changes over time.

If you are primarily interested in the fixed effects, would quantile
regression perhaps address your questions of interest? I don't know
"quality of life", but in my field, when I have oddly-distributed response
variables, I'm almost always interested in more than the mean, as the
temporal changes are more than a simple shift of the entire distribution.
For your example data, if 45% of the responses were 91, then longitudinal
trends in a mean are driven by a mixture of changes in that fraction plus
shifts in the length or width of the tail of lower values. Quantile
regression on the lower quantiles (the median in the above data is 90)
might be more informative, as well as more applicable to such data. If
subjects either converge on high scores over time, or start out with high
scores but then diverge as some fraction of subjects accumulate health
problems and have their scores decline over time, quantile regression might
better characterize such changes.

I have used lqmm with longitudinal data on limpet sizes with fixed plots as
random effects, and am exploring it for temporal trends in water quality
The vignette for lqmm uses the Orthodont data from nlme, and includes the
equivalent of (1 + time | subject) as a random effect. lqmm includes a
bootstrap function for objects of class lqm or lqmm. I have yet to
simulate highly skewed or mixture model WQ data to see if (when)
bootstrapped confidence intervals have reasonable coverage, but that is in
the queue for this fall.

Also, perhaps the real experts on this list can chime in on the form of
your model. While I understand mixed models with linear terms for time as
a fixed effect and within-subject random effect, I'm not clear on what
linear and quadratic fixed effect terms but only linear within-subject
terms means, especially if subjects differ in starting or drop-out times.

My apologies for not directly answering your question. And certainly your
mileage will vary.

Tom

"To do science is to search for repeated patterns, not simply to accumulate
facts..." --Robert MacArthur 1972, Geographical Ecology

"Statistical methods of analysis are intended to aid the interpretation of
data that are subject to appreciable haphazard variability" --Cox &
Hinkley 1974; Theoretical Statistics

Post by David Jones
I am looking to model quality of life (QOL) as a DV over time. The DV
shows strong negative skew. I am wondering about the best way to
handle this (more detail below). Frequency distribution of QOL and
example code are also at the end of this message.
Many participants just say that their quality of life is great, and
thus there is a ceiling effect with many values clustered at the
highest value. While the distribution resembles y=e^x, I have not been
able to fit a distribution via GLMM that results in normally
distributed and homoskedastic residuals (including gamma and inverse
gaussian). A number of DV transformations have not worked either
(e.g., log, exponential, Box-Cox), in large part because of the large
proportion of values at the maximum level of QOL, which creates a
spike at the end of the distribution. I could try zero-inflated models
by transforming the dv (multiply by -1 and put the starting value at
0), but even then there will still be a disproportionate number of
values clustered at one end.
My question: I am particularly interested in fixed effects parameters
from a longitudinal model, and was thinking of testing these
parameters by using percentile bootstrap CIs via confint(). However,
the residuals from a lmer model are both non-normal and
heteroskedastic - will percentile bootstrap of beta coefficients
address this, or can only the wild bootstrap address these issues (as
it is targeted to residuals)? I have a basic understanding of the
bootstrap but am not an expert regarding its use in linear models.
Many thanks!
# Example lmer code
model <- lmer(QOL ~ poly(time, 2) + (time | ID), data=dataset, REML =
FALSE )
# Frequency distribution
QOL valid_percent
25 0.000308261
30 0.000308261
32 0.000308261
34 0.000616523
38 0.000308261
41 0.000308261
45 0.000308261
46 0.000308261
47 0.000308261
48 0.000616523
49 0.000616523
50 0.000616523
51 0.000308261
52 0.000308261
53 0.001541307
54 0.000616523
55 0.001233046
56 0.000616523
57 0.000924784
58 0.000308261
59 0.000924784
60 0.000924784
61 0.001849568
62 0.001541307
63 0.003082614
64 0.001849568
65 0.00215783
66 0.002466091
67 0.004007398
68 0.002466091
69 0.004007398
70 0.002466091
71 0.003699137
72 0.006781751
73 0.004932183
74 0.006781751
75 0.006165228
76 0.007090012
77 0.007706535
78 0.008631319
79 0.010789149
80 0.015104809
81 0.014488286
82 0.01541307
83 0.020345253
84 0.025893958
85 0.03298397
86 0.036066585
87 0.053020962
88 0.064426634
89 0.080147966
90 0.088779285
91 0.452219482
_______________________________________________
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

[[alternative HTML version deleted]]

David Jones

2018-06-14 23:47:05 UTC

Permalink

Hi Tom,

Thank you for your detailed follow up. You are correct that many of
the observations are at the extreme value. I am fortunate to have a
fairly large sample (~700 participants with roughly 8 timepoints
each), and I would be hopeful that bootstrapping could come to the
rescue. That being said, it's a tricky situation as you suggest.

I had not considered quantile regression, and a mixed quantile
approach might be a great way to get at this. I am very grateful for
this overall suggestion as well as the specifics to look for in the
lqmm vignette (and how it corresponds to nlme). It is a difficult
analytic situation and your input has been very helpful.

David

Post by Philippi, Tom
David--
I apologize in advance for not answering your precise question, but no one
else has responded, and this response might be more helpful than nothing.
If I understand your frequency data, nearly half of your observations are
tied at the extreme value of 91. No transform is going to make that
distribution approximately normal. Without rather large sample sizes, most
forms of bootstrapping will not produce confidence intervals with nominal
and symmetric coverage. Further, modeling changes in the _mean_ of such
values can muddle or mislead on changes over time.
If you are primarily interested in the fixed effects, would quantile
regression perhaps address your questions of interest? I don't know
"quality of life", but in my field, when I have oddly-distributed response
variables, I'm almost always interested in more than the mean, as the
temporal changes are more than a simple shift of the entire distribution.
For your example data, if 45% of the responses were 91, then longitudinal
trends in a mean are driven by a mixture of changes in that fraction plus
shifts in the length or width of the tail of lower values. Quantile
regression on the lower quantiles (the median in the above data is 90) might
be more informative, as well as more applicable to such data. If subjects
either converge on high scores over time, or start out with high scores but
then diverge as some fraction of subjects accumulate health problems and
have their scores decline over time, quantile regression might better
characterize such changes.
I have used lqmm with longitudinal data on limpet sizes with fixed plots as
random effects, and am exploring it for temporal trends in water quality
The vignette for lqmm uses the Orthodont data from nlme, and includes the
equivalent of (1 + time | subject) as a random effect. lqmm includes a
bootstrap function for objects of class lqm or lqmm. I have yet to simulate
highly skewed or mixture model WQ data to see if (when) bootstrapped
confidence intervals have reasonable coverage, but that is in the queue for
this fall.
Also, perhaps the real experts on this list can chime in on the form of your
model. While I understand mixed models with linear terms for time as a
fixed effect and within-subject random effect, I'm not clear on what linear
and quadratic fixed effect terms but only linear within-subject terms means,
especially if subjects differ in starting or drop-out times.
My apologies for not directly answering your question. And certainly your
mileage will vary.
Tom
"To do science is to search for repeated patterns, not simply to accumulate
facts..." --Robert MacArthur 1972, Geographical Ecology
"Statistical methods of analysis are intended to aid the interpretation of
data that are subject to appreciable haphazard variability" --Cox &
Hinkley 1974; Theoretical Statistics