[R-sig-ME] Phylogenetic Logistic Regression for non-binary data: best practices and programs?

Discussion:

jonnations

2018-05-31 14:42:01 UTC

Hi Listserv,

I am new to this type of work and have tried to make this as clear as
possible.

I am working on a project that models habitat use (y = ground(0) vs.
tree(1)) and body size (x = body size, continuous). My y variables are from
the formula:

y=((tree captures / tree effort)) / (tree captures / tree effort) +
(ground captures / ground effort)

which should provide a ratio of captures in a given habitat while
accounting for effort. My y values are mostly binary, but some species'
values are between 0 and 1. The data look like this example:

y = c(0, 0, 0, 0, 0, 0, 0.25, 0.4, 0.6, 0.9, 0.9, 1, 1, 1, 1)

My goal for the model is to use the species with known habitat "scores" to
predict the habitat value (y) of species from their body size value (x).

There are 2 "random" effects in the model, the relatedness of the species
(the phylogeny, Rp) and the intraspecific variation of the x measurement
(Rs). These are both very important as my 150 data points are distributed
between 22 species.

Using logistic regression, the model takes the form: logit (Pr ( Y = 1 ))
= a + Bx + Rp + Rs + e

I have two questions for the group. First, is it appropriate to use
logistic regression (or a logit link) on these kinds of non-binary y
values? I have found several examples online of logistic regression with
non-binary variables (links below) but I have not found a publication with
a study design like mine.

Second, any suggestions of programs for setting up the model? I am
interested in using a bayesian glmm method (MCMCglmm, jags, etc.), however
I am worried that the programs will view these data as non-binary and
either insist on an ordinal regression (not what I am doing) or otherwise
provide categorical groupings on the response variable and produce strange
results. Can any glmm program handle my Rp, Rs, and the non-binary nature
of the y variables?

I hope this is clear. Any suggestions will be greatly appreciated! Thanks
for your help and patience.

Best,
Jon

Links mentioned above:
https://stats.stackexchange.com/questions/33562/choose-best-model-between-logit-probit-and-nls?rq=1
https://stats.stackexchange.com/questions/69886/using-logistic-regression-for-a-continuous-dependent-variable?rq=1

--
Jonathan A. Nations
PhD Candidate
Esselstyn Lab
Museum of Natural Sciences
Louisiana State University

[[alternative HTML version deleted]]

Paul Buerkner

2018-05-31 15:19:33 UTC

Permalink

Hi Jon,

a few thoughts about your response variable first.

When dealing with proportions (values between 0 and 1) the beta
distribution is what is usually being used. However, the beta distribution
cannot handle observations at the boundary (i.e. y = 0 or 1).

That's obviously a problem for your data. We have multiple options to deal
with that:

We can use a zero-one-inflated-beta distribution which models the data as
three separate processes (0, 1, and everything in between).

Alternatively, and probably something I would prefer for your data, one
could model the data using an ordinal distribution. This will require the
values between 0 and 1 to be brokenup into (not too many) discrete
categories,
which will lead to some information loss but at least is more informative
than a simple 0 1 treatement as in logistic regression.

You can fit all of these models above in combination with phylogenetic
structures using the brms R package (some are available in MCMCglmm as
well). Type vignette("brms_phylogenetics") in R for more details.

Best,
Paul

Post by jonnations
Hi Listserv,
I am new to this type of work and have tried to make this as clear as
possible.
I am working on a project that models habitat use (y = ground(0) vs.
tree(1)) and body size (x = body size, continuous). My y variables are from
y=((tree captures / tree effort)) / (tree captures / tree effort) +
(ground captures / ground effort)
which should provide a ratio of captures in a given habitat while
accounting for effort. My y values are mostly binary, but some species'
y = c(0, 0, 0, 0, 0, 0, 0.25, 0.4, 0.6, 0.9, 0.9, 1, 1, 1, 1)
My goal for the model is to use the species with known habitat "scores" to
predict the habitat value (y) of species from their body size value (x).
There are 2 "random" effects in the model, the relatedness of the species
(the phylogeny, Rp) and the intraspecific variation of the x measurement
(Rs). These are both very important as my 150 data points are distributed
between 22 species.
Using logistic regression, the model takes the form: logit (Pr ( Y = 1 ))
= a + Bx + Rp + Rs + e
I have two questions for the group. First, is it appropriate to use
logistic regression (or a logit link) on these kinds of non-binary y
values? I have found several examples online of logistic regression with
non-binary variables (links below) but I have not found a publication with
a study design like mine.
Second, any suggestions of programs for setting up the model? I am
interested in using a bayesian glmm method (MCMCglmm, jags, etc.), however
I am worried that the programs will view these data as non-binary and
either insist on an ordinal regression (not what I am doing) or otherwise
provide categorical groupings on the response variable and produce strange
results. Can any glmm program handle my Rp, Rs, and the non-binary nature
of the y variables?
I hope this is clear. Any suggestions will be greatly appreciated! Thanks
for your help and patience.
Best,
Jon
https://stats.stackexchange.com/questions/33562/choose-
best-model-between-logit-probit-and-nls?rq=1
https://stats.stackexchange.com/questions/69886/using-
logistic-regression-for-a-continuous-dependent-variable?rq=1
--
Jonathan A. Nations
PhD Candidate
Esselstyn Lab
Museum of Natural Sciences
Louisiana State University
[[alternative HTML version deleted]]
_______________________________________________
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

[[alternative HTML version deleted]]

jonnations

2018-05-31 20:14:06 UTC

Permalink

Hi Paul,

Thank you for the quick response! This is the exact kind of information I
was hoping for. I had just heard of brms in passing, but after looking
through the vignettes it seems like a good choice. I have been interested
in STAN's algorithms but correctly scripting a phylogenetic glmm from
scratch seemed daunting.

Quick question concerning ordinal regression: I though - perhaps naively -
that ordinal models always "categorize" data and fit separate "slopes" for
each category. My ultimate goal is to use the model to predict a response
value from an explanatory variable for species lacking habitat (response)
data. I had anticipated a posterior distribution of a continuous response
variable for each "newdata" value in predict().

I see that there are several additional ordinal families in brms that I am
unfamiliar with. Perhaps one of these would be best for predicting a
(continuous?) response value, or maybe it is just "better" to predict
membership in an apriori discrete category.

Your thoughts would be greatly appreciated. Thanks again for the help!

Jon

Post by Paul Buerkner
Hi Jon,
a few thoughts about your response variable first.
When dealing with proportions (values between 0 and 1) the beta
distribution is what is usually being used. However, the beta distribution
cannot handle observations at the boundary (i.e. y = 0 or 1).
That's obviously a problem for your data. We have multiple options to deal
We can use a zero-one-inflated-beta distribution which models the data as
three separate processes (0, 1, and everything in between).
Alternatively, and probably something I would prefer for your data, one
could model the data using an ordinal distribution. This will require the
values between 0 and 1 to be brokenup into (not too many) discrete
categories,
which will lead to some information loss but at least is more informative
than a simple 0 1 treatement as in logistic regression.
You can fit all of these models above in combination with phylogenetic
structures using the brms R package (some are available in MCMCglmm as
well). Type vignette("brms_phylogenetics") in R for more details.
Best,
Paul

Post by jonnations
Hi Listserv,
I am new to this type of work and have tried to make this as clear as
possible.
I am working on a project that models habitat use (y = ground(0) vs.
tree(1)) and body size (x = body size, continuous). My y variables are from
y=((tree captures / tree effort)) / (tree captures / tree effort) +
(ground captures / ground effort)
which should provide a ratio of captures in a given habitat while
accounting for effort. My y values are mostly binary, but some species'
y = c(0, 0, 0, 0, 0, 0, 0.25, 0.4, 0.6, 0.9, 0.9, 1, 1, 1, 1)
My goal for the model is to use the species with known habitat "scores" to
predict the habitat value (y) of species from their body size value (x).
There are 2 "random" effects in the model, the relatedness of the species
(the phylogeny, Rp) and the intraspecific variation of the x measurement
(Rs). These are both very important as my 150 data points are distributed
between 22 species.
Using logistic regression, the model takes the form: logit (Pr ( Y = 1 ))
= a + Bx + Rp + Rs + e
I have two questions for the group. First, is it appropriate to use
logistic regression (or a logit link) on these kinds of non-binary y
values? I have found several examples online of logistic regression with
non-binary variables (links below) but I have not found a publication with
a study design like mine.
Second, any suggestions of programs for setting up the model? I am
interested in using a bayesian glmm method (MCMCglmm, jags, etc.), however
I am worried that the programs will view these data as non-binary and
either insist on an ordinal regression (not what I am doing) or otherwise
provide categorical groupings on the response variable and produce strange
results. Can any glmm program handle my Rp, Rs, and the non-binary nature
of the y variables?
I hope this is clear. Any suggestions will be greatly appreciated! Thanks
for your help and patience.
Best,
Jon
https://stats.stackexchange.com/questions/33562/choose-best-
model-between-logit-probit-and-nls?rq=1
https://stats.stackexchange.com/questions/69886/using-logist
ic-regression-for-a-continuous-dependent-variable?rq=1
--
Jonathan A. Nations
PhD Candidate
Esselstyn Lab
Museum of Natural Sciences
Louisiana State University
[[alternative HTML version deleted]]
_______________________________________________
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

--
Jonathan A. Nations
PhD Candidate
Esselstyn Lab <http://www.museum.lsu.edu/esselstyn>
Museum of Natural Sciences <http://sites01.lsu.edu/wp/mns>
Louisiana State University

[[alternative HTML version deleted]]

Paul Buerkner

2018-05-31 20:31:24 UTC

Permalink

Hi John,

See https://psyarxiv.com/x8swp/ for a detailed introduction to ordinal
models. For your data I think the cumulative() family probably makes the
most sense among the ordinal families.

Please keep in mind that ordinal models do not "automatically" categorized
the response. You have to categorize it yourself.

Paul

Post by jonnations
Hi Paul,
Thank you for the quick response! This is the exact kind of information I
was hoping for. I had just heard of brms in passing, but after looking
through the vignettes it seems like a good choice. I have been interested
in STAN's algorithms but correctly scripting a phylogenetic glmm from
scratch seemed daunting.
Quick question concerning ordinal regression: I though - perhaps naively -
that ordinal models always "categorize" data and fit separate "slopes" for
each category. My ultimate goal is to use the model to predict a response
value from an explanatory variable for species lacking habitat (response)
data. I had anticipated a posterior distribution of a continuous response
variable for each "newdata" value in predict().
I see that there are several additional ordinal families in brms that I am
unfamiliar with. Perhaps one of these would be best for predicting a
(continuous?) response value, or maybe it is just "better" to predict
membership in an apriori discrete category.
Your thoughts would be greatly appreciated. Thanks again for the help!
Jon

Post by Paul Buerkner
Hi Jon,
a few thoughts about your response variable first.
When dealing with proportions (values between 0 and 1) the beta
distribution is what is usually being used. However, the beta distribution
cannot handle observations at the boundary (i.e. y = 0 or 1).
That's obviously a problem for your data. We have multiple options to
We can use a zero-one-inflated-beta distribution which models the data as
three separate processes (0, 1, and everything in between).
Alternatively, and probably something I would prefer for your data, one
could model the data using an ordinal distribution. This will require the
values between 0 and 1 to be brokenup into (not too many) discrete
categories,
which will lead to some information loss but at least is more informative
than a simple 0 1 treatement as in logistic regression.
You can fit all of these models above in combination with phylogenetic
structures using the brms R package (some are available in MCMCglmm as
well). Type vignette("brms_phylogenetics") in R for more details.
Best,
Paul

Post by jonnations
Hi Listserv,
I am new to this type of work and have tried to make this as clear as
possible.
I am working on a project that models habitat use (y = ground(0) vs.
tree(1)) and body size (x = body size, continuous). My y variables are from
y=((tree captures / tree effort)) / (tree captures / tree effort) +
(ground captures / ground effort)
which should provide a ratio of captures in a given habitat while
accounting for effort. My y values are mostly binary, but some species'
y = c(0, 0, 0, 0, 0, 0, 0.25, 0.4, 0.6, 0.9, 0.9, 1, 1, 1, 1)
My goal for the model is to use the species with known habitat "scores" to
predict the habitat value (y) of species from their body size value (x).
There are 2 "random" effects in the model, the relatedness of the species
(the phylogeny, Rp) and the intraspecific variation of the x measurement
(Rs). These are both very important as my 150 data points are distributed
between 22 species.
Using logistic regression, the model takes the form: logit (Pr ( Y = 1 ))
= a + Bx + Rp + Rs + e
I have two questions for the group. First, is it appropriate to use
logistic regression (or a logit link) on these kinds of non-binary y
values? I have found several examples online of logistic regression with
non-binary variables (links below) but I have not found a publication with
a study design like mine.
Second, any suggestions of programs for setting up the model? I am
interested in using a bayesian glmm method (MCMCglmm, jags, etc.), however
I am worried that the programs will view these data as non-binary and
either insist on an ordinal regression (not what I am doing) or otherwise
provide categorical groupings on the response variable and produce strange
results. Can any glmm program handle my Rp, Rs, and the non-binary nature
of the y variables?
I hope this is clear. Any suggestions will be greatly appreciated! Thanks
for your help and patience.
Best,
Jon
https://stats.stackexchange.com/questions/33562/choose-best-
model-between-logit-probit-and-nls?rq=1
https://stats.stackexchange.com/questions/69886/using-logist
ic-regression-for-a-continuous-dependent-variable?rq=1
--
Jonathan A. Nations
PhD Candidate
Esselstyn Lab
Museum of Natural Sciences
Louisiana State University
[[alternative HTML version deleted]]
_______________________________________________
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

--
Jonathan A. Nations
PhD Candidate
Esselstyn Lab <http://www.museum.lsu.edu/esselstyn>
Museum of Natural Sciences <http://sites01.lsu.edu/wp/mns>
Louisiana State University

[[alternative HTML version deleted]]