Evaluation Metrics

Evaluation

Tasks A and B

For tasks A and B, we plan to use Disattenuated Pearson Correlation between the predictions and the actual survey outcomes for the official rankings. This metric is isomorphic to a Pearson correlation, but it accounts for measurement error and thus yields values with larger variance making for easier comparisons between system performances. The measurement error (accounted for by its inverse, reliability) is taken from the literature on the reliability of the psychological distress questionnaires (0.77; Ploubidis et al., 2017) and of similar language-based predictions (0.70; Park et al., 2015). The metric is thus:

We will also report the results using the Root Mean Squared Error (RMSE), as we presume many will be using methods that optimize the MSE. We are still deciding on the final metric for the Innovation Challenge (it will likely be similar to BLEU). The exact script we will use for the evaluation will be provided to participants.

Park, G., Schwartz, H. A., Eichstaedt, J. C., Kern, M. L., Kosinski, M., Stillwell, D. J., … & Seligman, M. E. (2015). Automatic personality assessment through social media language. Journal of personality and social psychology, 108(6), 934.

Ploubidis, G. B., Sullivan, A., Brown, M., & Goodman, A. (2017). Psychological distress in mid-life: evidence from the 1958 and 1970 British birth cohorts. Psychological medicine, 47(2), 291-303.

Innovation Challenge

The goal of this challenge is to predict the language written in age 50 essays.

This will be evaluated in two forms:

  • Difference between generated essays and true age 50 essay according to Bleu score.
  • Difference between psychological word (see below) relative frequencies predicted versus actual.

From the original ncds dataset, we are including all essays of all study participants that wrote at least 50 words at age 11, had the age 11 controls (gender, social class), and wrote anything at age 50 (there is no minimum word count for the age 50 essays; this is because it is the target rather than the input and, in an application setting, there would be no way to make exclusion criteria based on future behavior; i.e. even just predicting the length of the essay is a useful step).

Psychological Words.

These psychological words were chosen based on six LIWC (Linguistic Inquiry and Word Count; Pennebaker et al., 2015)  categories that are often found related to mental health (I, We, Negation, Tentative, Certainty, and Articles; Pennebaker et al., 20XX; Schwartz et al., 2013) and correlations for affect (Preotiuc et al., 2016) and depression (Schwartz et. al., 2014) corpora. In addition, we include two “meta features”: number of words and average length of words.

The final lists consists of all LIWC words from the given categories which appeared in at least 1% of the training age 50 essays. The Affect and Depression words appeared in at least 0.1% of the age 50 essays (consisting of less frequent, “content” words) and they were further limited to the 30 words per category that were most strongly correlated with that category (Benjamini-Hochberg corrected p < .05).

Meta Features

Number of words
Avg. length of words

LIWC – We

we
our
us

LIWC – Tentative

hope
hopefully
or
some
if
may
maybe
lot
most
perhaps
any
lots
often
quite
something
try
unclear
generally
might
fairly
hoping
pretty
trying
lucky
somewhere
occasional
sometimes

LIWC – Negation

not
no
don’t
without
can’t
cannot
dont
won’t

LIWC – I

i
my
me
myself
i’m
i’ve
i’ll
i’d
im

LIWC – Certain

all
always
every
never
ever
sure
especially
everything

LIWC – Article

the
a
an

Affect

!
happy
great
love
all
day
had
for
? (Negative Correlation)
good
not (Negative Correlation)
fun
family
do (Negative Correlation)
finally
friends
today
need (Negative Correlation)
new
a
best
get (Negative Correlation)
much
what (Negative Correlation)
that (Negative Correlation)
someone (Negative Correlation)
year
made
when (Negative Correlation)
with

Depression

i
weekend (Negative Correlation)
.
! (Negative Correlation)
great (Negative Correlation)
football (Negative Correlation)
feel
me
don’t
want
i’ll
i’m
alone
the (Negative Correlation)
for (Negative Correlation)
myself
won’t
pain
day (Negative Correlation)
maybe
someone
think
could
i’ve
really
family (Negative Correlation)
wife (Negative Correlation)
our (Negative Correlation)
relax (Negative Correlation)
we (Negative Correlation)