[R-sig-ME] LMER-CorpusData

Discussion:

Taha Omidian

2018-10-07 22:46:38 UTC

Hello,

I’m trying to fit a mixed effects model to my corpus data. The data has a hierarchical structure. I need to make sure that the final model reflects this nested structure.

My final model looks like this:

theMdl<-lmer(dis.norm.j$transformed~disciplinaryGroup+genreGroup+level+(1|student_id)+(1|levelA)+(1|levelB)+(1|levelC),data=thedata, control=lmerControl("bobyqa”))

where

LevelA is genreGroup:genreFamily:student_id
levelB is disciplinaryGroup:discipline:student_id
levelC is level:student_id

Here is a link to my data and R script: https://www.dropbox.com/sh/46r6lv6n89bromk/AABMc8MQmAYhRC3ubJ0Ii7Wma?dl=0

Thanks

Taha

Phillip Alday

2018-10-09 10:10:37 UTC

Permalink

I don't think this is the model you're looking for...

1. It's really weird to have your predictors in one dataframe and your
dependent variable in a different one. Are you really sure that the rows
line up like you think they do? If so, why not join the dataframes
earlier (with merge(), plyr::join() or dplyr::join())?

I'm overall quite nervous about namespaces / scope / etc. in your code
-- using attach() isn't recommended practice, especially when you mix
and match things (e.g. your levelX variables aren't in your dataframe,
but the other predictors are). You have to be really careful to make
sure you're using the data you think you're using.

You can do it like you have it, but it makes me very nervous in terms of
computing what you think you're computing.

2. Your levels include the same predictor in both the fixed effects and
as a grouping variable (the part of the random effect after the |) .
This generally doesn't make sense -- there are a number of posts on this
mailing list to that effect (see also
https://rpubs.com/INBOstats/both_fixed_random and
https://www.muscardinus.be/2017/08/fixed-and-random/) -- but it depends
on your data.

In other words, seeing your model specification isn't quite enough -- we
also need to know something about your data, more than your variable
names alone reveal. Even though I work a lot with language data, I still
can't tell enough from your variable names and code what your data
actually represent.

Best,
Phillip

Taha Omidian

2018-10-10 10:52:52 UTC

Permalink

Hi Philip,

Thanks so much for your reply.

I think the best way to describe the data is to start with the aim of our study. The purpose of our study is to investigate the effect of discipline, genre, and level of study on the use certain word combinations in learner writing. To represent learner writing, we compiled a corpus of texts collected from students in 30 different disciplines and at four levels of study. Texts in the corpus were then categorised based on their genres (13 genres).

Following this, we classified the disciplines into four major disciplinary groupings. Genres were also grouped under 5 broad categories based on their social purposes. We then search the corpus for the occurrence of 278 word combinations (e.g., on the other hand) and recorded their normalised frequency of occurrence for each text (labeled as ref.norm in our data).

To me, our data is structured in a hierarchical fashion (for each predictor). So here is what we have in our data:

-Students (student_id col) contributed multiple texts (id col)

-Each text is nested within different disciplines (discipline col) which are clustered within four disciplinary groupings (disciplinaryGroup col)

-Each text is nested within genres (genreFamily col) which are grouped into five genre groups (genreGroup col)

-Each text is nested within four levels of study (level col)

Predictors (based on the labels in our data) are: disciplinaryGroup, genreGroup, level
Dependent variable (based on its label in our data) is: ref.norm

So I need to know how this nested structure can be reflected in a LME model.

As always thanks for your help.

T

On Oct 9, 2018, at 11:10 PM, Phillip Alday <***@mpi.nl<mailto:***@mpi.nl>> wrote:

I don't think this is the model you're looking for...

1. It's really weird to have your predictors in one dataframe and your
dependent variable in a different one. Are you really sure that the rows
line up like you think they do? If so, why not join the dataframes
earlier (with merge(), plyr::join() or dplyr::join())?

I'm overall quite nervous about namespaces / scope / etc. in your code
-- using attach() isn't recommended practice, especially when you mix
and match things (e.g. your levelX variables aren't in your dataframe,
but the other predictors are). You have to be really careful to make
sure you're using the data you think you're using.

You can do it like you have it, but it makes me very nervous in terms of
computing what you think you're computing.

2. Your levels include the same predictor in both the fixed effects and
as a grouping variable (the part of the random effect after the |) .
This generally doesn't make sense -- there are a number of posts on this
mailing list to that effect (see also
https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frpubs.com%2FINBOstats%2Fboth_fixed_random&data=02%7C01%7Ctaha.omidian%40vuw.ac.nz%7C4f9c8008c76b4354479908d62dcf77f8%7Ccfe63e236951427e8683bb84dcf1d20c%7C0%7C0%7C636746766469897297&sdata=nDnQofQVnta%2BUlvfdGI1z5PiNxkai0AXW59Uy368xUU%3D&reserved=0 and
https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.muscardinus.be%2F2017%2F08%2Ffixed-and-random%2F&data=02%7C01%7Ctaha.omidian%40vuw.ac.nz%7C4f9c8008c76b4354479908d62dcf77f8%7Ccfe63e236951427e8683bb84dcf1d20c%7C0%7C0%7C636746766469897297&sdata=7D%2FgIEUAJ%2BCOmR%2BrpNRtU49jyOtXDZk33cz5h9Ke04Y%3D&reserved=0) -- but it depends
on your data.

In other words, seeing your model specification isn't quite enough -- we
also need to know something about your data, more than your variable
names alone reveal. Even though I work a lot with language data, I still
can't tell enough from your variable names and code what your data
actually represent.

Best,
Phillip

On 10/08/2018 12:46 AM, Taha Omidian wrote:
Hello,

I’m trying to fit a mixed effects model to my corpus data. The data has a hierarchical structure. I need to make sure that the final model reflects this nested structure.

My final model looks like this:

theMdl<-lmer(dis.norm.j$transformed~disciplinaryGroup+genreGroup+level+(1|student_id)+(1|levelA)+(1|levelB)+(1|levelC),data=thedata, control=lmerControl("bobyqa”))

where

LevelA is genreGroup:genreFamily:student_id
levelB is disciplinaryGroup:discipline:student_id
levelC is level:student_id

Here is a link to my data and R script: https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.dropbox.com%2Fsh%2F46r6lv6n89bromk%2FAABMc8MQmAYhRC3ubJ0Ii7Wma%3Fdl%3D0&data=02%7C01%7Ctaha.omidian%40vuw.ac.nz%7C4f9c8008c76b4354479908d62dcf77f8%7Ccfe63e236951427e8683bb84dcf1d20c%7C0%7C0%7C636746766469897297&sdata=%2FnFwGE4shUmS2L1QGO0ExQ0jh49iyLMCj7xhx9%2BX2yI%3D&reserved=0

Thanks

Taha
_______________________________________________
R-sig-mixed-***@r-project.org<mailto:R-sig-mixed-***@r-project.org> mailing list
https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-sig-mixed-models&data=02%7C01%7Ctaha.omidian%40vuw.ac.nz%7C4f9c8008c76b4354479908d62dcf77f8%7Ccfe63e236951427e8683bb84dcf1d20c%7C0%7C0%7C636746766469897297&sdata=lutNcUBM2okGBj2fpYUhH216af55V1lfnr49U47LRkE%3D&reserved=0

[[alternative HTML version deleted]]

Phillip Alday

2018-10-17 16:05:44 UTC

Permalink

Hi Taha,

You can use the term "collocation" with me -- it's more precise than
"word combination". ;)

What seems to be missing from your model are your particular
collocations -- are you doing a separate model for each collocation? Or
are you looking at the combined frequency of all the collocations?
Assuming the answer to one of these questions is yes (and each has its
own implications and potential pitfalls for your inferences) ...

I would massively reduce your random effects structure. I propose the
following basic structure for the model, under the assumption that each
is only in one discipline)

ref.norm ~ 1 + disciplinaryGroup + genreGroup + level + (1|student_id)

I would seriously consider using the following interaction model, if you
have enough data to do so. Depending on which combinations of
disciplinaryGroup, genreGroup and level are present in the data, this
may give you warnings about a rank-deficient model matrix and dropped
columns, but that's okay. lme4 is just telling you that it can't
estimate interactions for combinations that didn't occur and so it won't
try.

If each student also produced texts in multiple genre groups, then I
would see if changing (1|student_id) to (1+genreGroup|student_id)
improved the fit.

Is each student measured at different levels? If so, then you can
consider doing the same thing as genreGroup for level|student_id.

I'm not sure I would include text id in the model because it's not
"repeated" in any meaningful sense and would thus be an
observation-level random effect. Text id essentially is just a way of
distinguishing between repetitions within each unit/student of the
student grouping.

Now, assuming that you don't care about particular disciplines or
genres, but rather just want to see if they account for any additional
variance beyond the coarser disciplinaryGroup and genreGroup
categorizations, you could include them as random effects:

ref.norm ~ 1 + disciplinaryGroup + genreGroup + level + (1|student_id) +
(1|discipline) + (1|genreFamily)

You don't have to explicitly nest student_id within discipline -- lme4
already picks up on that. genre is (at least partially) crossed with
student_id and discipline, and lme4 also picks up on that. (More
precisely, the mathematical formulation that lme4 uses deals with such
structures without any extra work.) This formulation assumes that the
effects of subject/discipline and genre are additive; you could
potentially add in a (1|subject_id:genreFamily) or
(1|discipline:genreFamily), but (1) I don't think this would explain
that much more variation and (2) you would need a *lot* of data for this
to actually be meaningful and not just overfitting.

Overfitting is actually a potential problem for all of these more
overcomplicated models: make sure that AIC and BIC aren't getting worse!
(The likelihood-ratio test is invalid for non-nested models and tricky
for nested models that only differ in their variance components.
Rejecting a variance component is the same thing as saying it's equal to
zero, which is at the edge of the parameter space for variance, which
means the p-values from the LRT aren't right.)

Assuming that each discipline only occurs within one discipline group,
disciplinaryGroup:discipline is the same thing as discipline. Same thing
for genreGroup:genreFamily.

Finally, please note that depending on your exact normalization
procedure, a standard Gaussian model with identity link (i.e. "linear")
might not be the right model for the job. I'm thinking in particular
about issues that can arise when your normalization procedure results in
an a measure that's bounded on [0,1].

Best,
Phillip

Post by Taha Omidian
Hi Philip,
Thanks so much for your reply.
I think the best way to describe the data is to start with the aim of
our study. The purpose of our study is to investigate the effect of
discipline, genre, and level of study on the use certain word
combinations in learner writing. To represent learner writing, we
compiled a corpus of texts collected from students in 30 different
disciplines and at four levels of study. Texts in the corpus were then
categorised based on their genres (13 genres).
Following this, we classified the disciplines into four major
disciplinary groupings. Genres were also grouped under 5 broad
categories based on their social purposes. We then search the corpus for
the occurrence of 278 word combinations (e.g., on the other hand) and
recorded their normalised frequency of occurrence for each text (labeled
as ref.norm in our data).
To me, our data is structured in a hierarchical fashion (for each
-Students (*student_id *col) contributed multiple texts (*id* col)
-Each text is nested within different disciplines (*discipline* col)
which are clustered within four disciplinary groupings
(*disciplinaryGroup* col)
-Each text is nested within genres (*genreFamily* col) which are grouped
into five genre groups (*genreGroup* col)
-Each text is nested within four levels of study (*level* col)
Predictors (based on the labels in our data)
are: *disciplinaryGroup, **genreGroup, **level*
Dependent variable (based on its label in our data) is: /*ref.norm*/
/*
*/
So I need to know how this nested structure can be reflected in a LME model.
As always thanks for your help.
T

Post by Phillip Alday
I don't think this is the model you're looking for...
1. It's really weird to have your predictors in one dataframe and your
dependent variable in a different one. Are you really sure that the rows
line up like you think they do? If so, why not join the dataframes
earlier (with merge(), plyr::join() or dplyr::join())?
I'm overall quite nervous about namespaces / scope / etc. in your code
-- using attach() isn't recommended practice, especially when you mix
and match things (e.g. your levelX variables aren't in your dataframe,
but the other predictors are). You have to be really careful to make
sure you're using the data you think you're using.
You can do it like you have it, but it makes me very nervous in terms of
computing what you think you're computing.
2. Your levels include the same predictor in both the fixed effects and
as a grouping variable (the part of the random effect after the |) .
This generally doesn't make sense -- there are a number of posts on this
mailing list to that effect (see also
https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Frpubs.com%2FINBOstats%2Fboth_fixed_random&data=02%7C01%7Ctaha.omidian%40vuw.ac.nz%7C4f9c8008c76b4354479908d62dcf77f8%7Ccfe63e236951427e8683bb84dcf1d20c%7C0%7C0%7C636746766469897297&sdata=nDnQofQVnta%2BUlvfdGI1z5PiNxkai0AXW59Uy368xUU%3D&reserved=0 and
https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.muscardinus.be%2F2017%2F08%2Ffixed-and-random%2F&data=02%7C01%7Ctaha.omidian%40vuw.ac.nz%7C4f9c8008c76b4354479908d62dcf77f8%7Ccfe63e236951427e8683bb84dcf1d20c%7C0%7C0%7C636746766469897297&sdata=7D%2FgIEUAJ%2BCOmR%2BrpNRtU49jyOtXDZk33cz5h9Ke04Y%3D&reserved=0)
-- but it depends
on your data.
In other words, seeing your model specification isn't quite enough -- we
also need to know something about your data, more than your variable
names alone reveal. Even though I work a lot with language data, I still
can't tell enough from your variable names and code what your data
actually represent.
Best,
Phillip

Post by Taha Omidian
Hello,
I’m trying to fit a mixed effects model to my corpus data. The data
has a hierarchical structure. I need to make sure that the final
model reflects this nested structure.
theMdl<-lmer(dis.norm.j$transformed~disciplinaryGroup+genreGroup+level+(1|student_id)+(1|levelA)+(1|levelB)+(1|levelC),data=thedata,
control=lmerControl("bobyqa”))
where
LevelA is genreGroup:genreFamily:student_id
levelB is disciplinaryGroup:discipline:student_id
levelC is level:student_id
Here is a link to my data and R
script: https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fwww.dropbox.com%2Fsh%2F46r6lv6n89bromk%2FAABMc8MQmAYhRC3ubJ0Ii7Wma%3Fdl%3D0&data=02%7C01%7Ctaha.omidian%40vuw.ac.nz%7C4f9c8008c76b4354479908d62dcf77f8%7Ccfe63e236951427e8683bb84dcf1d20c%7C0%7C0%7C636746766469897297&sdata=%2FnFwGE4shUmS2L1QGO0ExQ0jh49iyLMCj7xhx9%2BX2yI%3D&reserved=0
Thanks
Taha
_______________________________________________
https://apac01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-sig-mixed-models&data=02%7C01%7Ctaha.omidian%40vuw.ac.nz%7C4f9c8008c76b4354479908d62dcf77f8%7Ccfe63e236951427e8683bb84dcf1d20c%7C0%7C0%7C636746766469897297&sdata=lutNcUBM2okGBj2fpYUhH216af55V1lfnr49U47LRkE%3D&reserved=0