[R-sig-ME] GLMM for proportions

Discussion:

Thierry Onkelinx

2018-06-06 14:24:15 UTC

Dear Nicolas,

The cbind(success, failure) notation is used when we aggregate (sum)
the number of successes and failures. The data generating process
behind it, are a series of trials which result in either success or
failure. Hence their sum will be integer.

We need to know more about your data generating process in order to
give you sensible advice. Scaling the data by using different units is
wrong. Compare binom.test(c(1, 9)) and binom.test(c(1000, 9000)). Both
yield exactly the same proportion, but their confidence interval are
very different. Why? c(1000, 9000) is much more informative than c(1,
9).

Best regards,

ir. Thierry Onkelinx
Statisticus / Statistician

Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE
AND FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
***@inbo.be
Havenlaan 88 bus 73, 1000 Brussel
www.inbo.be

///////////////////////////////////////////////////////////////////////////////////////////
To call in the statistician after the experiment is done may be no
more than asking him to perform a post-mortem examination: he may be
able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does
not ensure that a reasonable answer can be extracted from a given body
of data. ~ John Tukey
///////////////////////////////////////////////////////////////////////////////////////////

Dear list,
I have a question regarding GLMM's for proportion fitted with lme4.
Such models are fitted using the binomial family. When I fit such models, I
use, on the left side of the formula : cbind(success,failure).
Problem is when, for example, data are durations (duration of success and
duration of failure) that are not integer numbers if speaking in seconds.
When fitting a GLM, one can use directly in the left part of the formula a
variable that is the proportion of success. When trying to do this for a
non-integer # successes in a binomial glm! »
To avoid this, biologists I work sometimes with, used ms instead of s for
their duration times of success and failure but then the associated tests
are too powerfull...
I am not able to tell if the displayed warning message is of concern or not.
So my question is : do you think it is better to use ms instead of s or
directly the proportion?
Thanks in advance for any help that can be provided
Best regards
--
Nicolas Poulin
Ingénieur de Recherche
Centre de Statistique de Strasbourg (CeStatS)
http://www.math.unistra.fr/CeStatS/
Tél : 03 68 85 0189
IRMA, UMR 7501
Université de Strasbourg et CNRS
7 rue René-Descartes
67084 Strasbourg Cedex
_______________________________________________
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

poulin

2018-06-06 14:42:16 UTC

Permalink

Yes I mean milliseconds instead of seconds but I made a mistake. Its not
milliseconds but 0.01s. My bad.

But as I explained in my second mail it's more going from fames (lasting
0.04s) to 0.01 s.

Actually the main problem was that each video was not lasting the same
time. Hence it is not possible to analyse time without reference to the
total lasting of the video

Nicolas Poulin
Ingénieur de Recherche
Centre de Statistique de Strasbourg (CeStatS)
http://www.math.unistra.fr/CeStatS/

Tél : 03 68 85 0189

IRMA, UMR 7501
Université de Strasbourg et CNRS
7 rue René-Descartes
67084 Strasbourg Cedex

This is more generally a GLM (rather than GLMM) question.
Can you clarify a little bit more? When you say "ms instead of s" do
you mean milliseconds rather than seconds?
If you actually have durations, a Gamma(link="log") or plain
log-Normal analysis (i.e. log-transform and then linear model) might
work. In either case, values of exactly zero will be technically
problematic, and will require you to think a bit more about the
data-generating process.
If you have fractions of a time interval then Beta regression might
work (in glmmTMB or brms or mgcv), or you can logit transform or
(old-fashionedly) arcsin-sqrt transform ...

Dear list,
I have a question regarding GLMM's for proportion fitted with lme4.
Such models are fitted using the binomial family. When I fit such
models, I use, on the left side of the formula : cbind(success,failure).
Problem is when, for example, data are durations (duration of success
and duration of failure) that are not integer numbers if speaking in
seconds.
When fitting a GLM, one can use directly in the left part of the formula
a variable that is the proportion of success. When trying to do this for
a GLMM one will have the warning message : « In eval (family$initalize,
rho): non-integer # successes in a binomial glm! »
To avoid this, biologists I work sometimes with, used ms instead of s
for their duration times of success and failure but then the associated
tests are too powerfull...
I am not able to tell if the displayed warning message is of concern or
not.
So my question is : do you think it is better to use ms instead of s or
directly the proportion?
Thanks in advance for any help that can be provided
Best regards

_______________________________________________
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

Andrews, Chris

2018-06-06 14:42:49 UTC

Permalink

And when Thierry says sum the number of success and failures, he is referring to outcomes of _independent_ trials. It is unlikely that your counts of microseconds are from independent trials.

Chris

-----Original Message-----
From: Thierry Onkelinx [mailto:***@inbo.be]
Sent: Wednesday, June 06, 2018 10:24 AM
To: poulin
Cc: r-sig-mixed-models
Subject: Re: [R-sig-ME] GLMM for proportions

Dear Nicolas,

The cbind(success, failure) notation is used when we aggregate (sum)
the number of successes and failures. The data generating process
behind it, are a series of trials which result in either success or
failure. Hence their sum will be integer.

We need to know more about your data generating process in order to
give you sensible advice. Scaling the data by using different units is
wrong. Compare binom.test(c(1, 9)) and binom.test(c(1000, 9000)). Both
yield exactly the same proportion, but their confidence interval are
very different. Why? c(1000, 9000) is much more informative than c(1,
9).

Best regards,

ir. Thierry Onkelinx
Statisticus / Statistician

Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE
AND FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
***@inbo.be
Havenlaan 88 bus 73, 1000 Brussel
www.inbo.be

///////////////////////////////////////////////////////////////////////////////////////////
To call in the statistician after the experiment is done may be no
more than asking him to perform a post-mortem examination: he may be
able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does
not ensure that a reasonable answer can be extracted from a given body
of data. ~ John Tukey
///////////////////////////////////////////////////////////////////////////////////////////

**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues

Crump, Ron

2018-06-06 15:44:44 UTC

Permalink

It seems more like a set of different time-series (one for each video) with
(0,1) data at each point (frame within video) rather than one observation
per video. So some random effects to model video and time? Auto-correlation
(does that work in a logistic GLMM?)? Random (video) intercepts? Plus some
other stuff.

Ron.

poulin

2018-06-06 14:33:36 UTC

Permalink

Thanks Thierry for this advice. Yes I was aware of this. Actually, the
data were obtained by analysing videos frame by frame. The video's
resolution was such that each frame "duration" is considered to be
0.04s. My first advice to the biologists was to use the numbers of
frames for both number of success and failure. They did not want this
because they want to speak (and analyse) in term of real duration.
Hence, using ms instead of frames is multiplying the number of attemps
by 4.

They have publish the results last year
(https://peerj.com/articles/3227/) but someone wrote to the editor to
tell the statistical approach was wrong and to use directly the
proportions in the GLMM. This person did not mention that, doing this, a
warning message was displayed.

Best regards

Nicolas Poulin
Ingénieur de Recherche
Centre de Statistique de Strasbourg (CeStatS)
http://www.math.unistra.fr/CeStatS/

Tél : 03 68 85 0189

IRMA, UMR 7501
Université de Strasbourg et CNRS
7 rue René-Descartes
67084 Strasbourg Cedex

Post by Thierry Onkelinx
Dear Nicolas,
The cbind(success, failure) notation is used when we aggregate (sum)
the number of successes and failures. The data generating process
behind it, are a series of trials which result in either success or
failure. Hence their sum will be integer.
We need to know more about your data generating process in order to
give you sensible advice. Scaling the data by using different units is
wrong. Compare binom.test(c(1, 9)) and binom.test(c(1000, 9000)). Both
yield exactly the same proportion, but their confidence interval are
very different. Why? c(1000, 9000) is much more informative than c(1,
9).
Best regards,
ir. Thierry Onkelinx
Statisticus / Statistician
Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE
AND FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
Havenlaan 88 bus 73, 1000 Brussel
www.inbo.be
///////////////////////////////////////////////////////////////////////////////////////////
To call in the statistician after the experiment is done may be no
more than asking him to perform a post-mortem examination: he may be
able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does
not ensure that a reasonable answer can be extracted from a given body
of data. ~ John Tukey
///////////////////////////////////////////////////////////////////////////////////////////

Ben Bolker

2018-06-06 14:47:50 UTC

Permalink

The problem even with using frames is that it's hard to believe that
the behaviour in one frame is independent of the behaviour in the next
(an assumption of the binomial response). So I agree that a binomial
approach is probably wrong.

Possibilities:

- using a quasibinomial model would take care of at least some of the
non-independence problem
- a Beta model
- transformed ratios

Thanks Thierry for this advice. Yes I was aware of this. Actually, the
data were obtained by analysing videos frame by frame. The video's
resolution was such that each frame "duration" is considered to be
0.04s. My first advice to the biologists was to use the numbers of
frames for both number of success and failure. They did not want this
because they want to speak (and analyse) in term of real duration.
Hence, using ms instead of frames is multiplying the number of attemps
by 4.
They have publish the results last year
(https://peerj.com/articles/3227/) but someone wrote to the editor to
tell the statistical approach was wrong and to use directly the
proportions in the GLMM. This person did not mention that, doing this, a
warning message was displayed.
Best regards
Nicolas Poulin
Ingénieur de Recherche
Centre de Statistique de Strasbourg (CeStatS)
http://www.math.unistra.fr/CeStatS/
Tél : 03 68 85 0189
IRMA, UMR 7501
Université de Strasbourg et CNRS
7 rue René-Descartes
67084 Strasbourg Cedex

_______________________________________________
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models