[R-sig-ME] Analyzing similarity scores between subjects

Discussion:

Han Zhang

2018-08-08 17:22:59 UTC

Hi all,

I have a modeling problem involving similarity scores between subjects.
During 4 time points in my experiment, I sampled eye movements of my
subjects. At each time point, subjects had either one of two different
states, Y or N. I have no control of the state, it is purely observational.
My data produces 4 similarity matrices - for each sampling, every subject
was compared to every other subject on some similarity measure of eye
movements (self-comparisons excluded). Each matrix contains three types of
comparison: N-N, N-Y, and Y-Y. My hypothesis is that the eye movements of
those in state N were more similar to each other, compared to N-Y, or Y-Y.
So N-N > N-Y or Y-Y.

I came up with a model like this:

lmer(dist ~ type + (1|sub_i) + (1|sub_i:type) + (1|segment) +
(1|segment:type) + (1|sub_i: segment) + (1|sub_i: segment:type), data,
REML=F)

where dist is the similarity score, type is a 3-level factor (n-n, n-y,
y-y), sub_i is subject ID, segment is sample ID. I was
trying to build a model with a "maximal" random structure.

Have I correctly specified my model? I have two concerns:
(1) because any given data point in the matrix belongs to two subjects, i
and j, should I include random effects for both subject i and subject j?

(2) Becuase each matrix is symmetrical, I am duplicating my data in the
above model. Should I use only the unique pairwise comparisons and do
something like this:

lmer(dist ~ type + (1|segment) + (1|segment:type), half_data, REML=F)

Thanks!

--
Han Zhang
Graduate Student
Combined Program in Education and Psychology
University of Michigan, Ann Arbor
Email: ***@umich.edu
Phone: 1-734-680-6031

[[alternative HTML version deleted]]

Jon Baron

2018-08-08 19:59:00 UTC

Permalink

This is a tough problem! And I'm not sure I can solve it without the
data (and I am not willing to go that far), or ever. But here are some
thoughts. If I had these data, I would not automatically think of
using a multi-level model, but, the more I think about it, the more
sense it makes. (And I would first look at something really simple to
see if my hypothesis has a chance of being correct.)

First, 4 time points may not be enough to treat time as a random
effect. It might (or might not) make sense to treat time as a fixed
effect and look at its interaction with type. It may be that time
segment does not matter at all. But if there is an interaction you
need to worry about coding the variables so that you can still
interpret the main effect of type.

Second, it seems to me that you need random-effect terms for both
subjects in each pair. And you should use only unique pairs, so that
you do not double-count (as you realize).

Thus, the model I would think of would be something like:

lmer(dist_ij ~ type_ij*segment + (1|sub_i) + (1|sub_j))

I'm not sure about which random slopes to include, if any, but with
all of them it would be something like:

lmer(dist_ij ~ type_ij*segment + (1+type*segment|sub_i) + (1+type*segment|sub_j))

Maybe you don't need the 1 in the last grouping term.

I'm just using the "ij" notation to indicate that you have a matrix or
data frame with one row for each unique pair in each segment.

I'm not sure whether "segment" should be a number or a factor.

Jon

Post by Han Zhang
Hi all,
I have a modeling problem involving similarity scores between subjects.
During 4 time points in my experiment, I sampled eye movements of my
subjects. At each time point, subjects had either one of two different
states, Y or N. I have no control of the state, it is purely observational.
My data produces 4 similarity matrices - for each sampling, every subject
was compared to every other subject on some similarity measure of eye
movements (self-comparisons excluded). Each matrix contains three types of
comparison: N-N, N-Y, and Y-Y. My hypothesis is that the eye movements of
those in state N were more similar to each other, compared to N-Y, or Y-Y.
So N-N > N-Y or Y-Y.
lmer(dist ~ type + (1|sub_i) + (1|sub_i:type) + (1|segment) +
(1|segment:type) + (1|sub_i: segment) + (1|sub_i: segment:type), data,
REML=F)
where dist is the similarity score, type is a 3-level factor (n-n, n-y,
y-y), sub_i is subject ID, segment is sample ID. I was
trying to build a model with a "maximal" random structure.
(1) because any given data point in the matrix belongs to two subjects, i
and j, should I include random effects for both subject i and subject j?
(2) Becuase each matrix is symmetrical, I am duplicating my data in the
above model. Should I use only the unique pairwise comparisons and do
lmer(dist ~ type + (1|segment) + (1|segment:type), half_data, REML=F)
Thanks!
--
Han Zhang
Graduate Student
Combined Program in Education and Psychology
University of Michigan, Ann Arbor
Phone: 1-734-680-6031
[[alternative HTML version deleted]]
_______________________________________________
https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models

--
Jonathan Baron, Professor of Psychology, University of Pennsylvania
Home page: http://www.sas.upenn.edu/~baron
Editor: Judgment and Decision Making (http://journal.sjdm.org)

Tom Philippi

2018-08-08 23:57:09 UTC

Permalink

Han--
At the risk of sending you down a completely different rabbit hole:

Each subject contributes to N - 1 similarities at each time. Depending on
the properties of your similarity score (triangle inequality & such), even
in your half_data example you don't have N * (N - 1) / 2 independent
observations. One outlier individual will produce N-1 low similarities.
The standard approach to such dependent response variables is to retain the
matrix of similarities with subjects as rows & columns, but permute the
values of the predictor variables across the subjects for ANOSIM or Mantel
tests.

Your hypothesis seems to be a simple pairwise similarity within group N
(N-N) is greater than within group Y (Y-Y) or between groups (N-Y). If you
had a single time or bout, that would be anosim (analysis of similarity),
where some metric (e.g., mean similarity of N-N) is calculated for the real
data, then the N & Y values are permuted across subjects, and the metric
from the real data is compared to the distribution of the metric across the
permutations (a form of Mantel Test). One _could_ use a complex mixed
model for the metric, but for your stated hypothesis there is no reason
to. [Also, since none of the pairwise similarities are changed for that
null distribution, just which ones are considered as N-N, the mean of N-N
similarities is a sufficient statistic.] Several packages have functions
to perform this test, including ade4, ape, vegan.

Because you have 4 times for the same subjects, things in terms of
hypotheses get much more complicated. Do individual subjects tend to
persist in the same state Y or N, or is state at time X not predictive of
state at time X + 1? If subjects tend to persist in N or Y, then you might
set up hypotheses across all 4 time matrices, and permute the observed
sequences of 4 states (N-N-N-Y) across individual subjects. If a subject's
state at time X + 1 is independent of state at time X, for pairs of
subjects with N-N at 1-3 times, you could ask if their similarity at N-N
times is greater than at the other times. Are these timepoints in a
treatment, where you have hypotheses about the effect getting stronger (or
weaker) over time? Some such hypotheses are simple to test with functions
in vegan (or ape, etc.), while others may require explicit coding of
restricted permutations.

There are also general approaches to permutation tests for ANOVA (Anderson
2001) and outer partitioning of dissimilarity matrices (McArdle & Anderson
2001, implemented in vegan::adonis)

These approaches do not explicitly use fixed vs random effects. Rather,
the within-subject correlated measures are accounted for via restrictions
on the permutations. See, for example, Jari's reply here:
http://r-sig-ecology.471788.n2.nabble.com/Adonis-and-Random-Effects-td7577863.html

This may or may not be useful for your particular question. I hope it's at
least worth your time to think about.

Tom

Anderson, M.J., 2001. A new method for non‐parametric multivariate analysis
of variance. *Austral ecology*, *26*(1), pp.32-46.

Legendre, P. and Anderson, M.J., 1999. Distance‐based redundancy analysis:
testing multispecies responses in multifactorial ecological
experiments. *Ecological
monographs*, *69*(1), pp.1-24.

McArdle, B.H. and Anderson, M.J., 2001. Fitting multivariate models to
community data: a comment on distance‐based redundancy analysis. *Ecology*,
*82*(1), pp.290-297.

[[alternative HTML version deleted]]