Shared Task 2018

CLPsych-2018 Shared Task: Call for Participation
Predicting Current and Future Psychological Health from Childhood Essays

We invite participants for the 2018 CLPsych Shared Task.

Motivation and Background

This shared task seeks to encourage new methods not only for analyzing current language use as a signal for mental health, as in previous CLPsych shared tasks, but also for understanding childhood language as a marker of future psychological health over individual lifetimes. Predicting well-being from language in the short term is valuable, such as in improving intake assessment and monitoring. However, predictions about the long-term future, an area with little work thus far from the NLP community, can aid with another class of applications: the understanding of early life markers and development of preventative care.

The unprecedented data for this task comes from the National Child Development Study,  (NCDS), also known as the 1958 British Birth Cohort Study, which follows a cohort of all children born in a single week in Great Britain in March 1958 until the present day. They have been followed since their birth and have been surveyed at various points in their life to monitor their health and socioeconomic status. At age 11, the participants wrote short essays on where they saw themselves at age 25, fourteen years in the future; these essays will be used to predict aspects of their mental health at ages 11, 23, 33, 42, and 50. Additional non-linguistic variables, including gender and childhood parental social class, will be made available as well (Power & Elliot, 2005).


Task A: Cross-Sectional Psychological Health at Age 11

Input: Age 11 essays and socio-demographic controls

Output: Psychological health at age 11 (Behavioral scores from teachers)

This subtask looks at prediction in the short term, answering the question of what a person’s language tells us about their current psychological health.

Task B: Future Psychological Health

Input: Age 11 essays and socio-demographic controls

Output: Psychological distress at ages 23, 33, 42, 50*.

This subtask answers the question of how well we can know, at age 11, what a person’s psychological health will be at different stages of life. *age 50 outcome withheld from training set.

Innovation Challenge: Future Psychological Language Generation

In addition to the tasks above, we invite researchers to participate in an innovation challenge: At age 50, the NCDS participants were asked to write a new essay on where they saw themselves a further ten years down the road. Here, we invite participants to use the age 11 essays along with the socio-demographics already available and attempt to predict the language at age 50.

Input: Age 11 essays and socio-demographic controls

Outputs: Age 50 frequency of psychological words, Age 50 essays.

From mental health to demographics and personality, language use seems to be a window into many aspects of who we are and how we are doing (Pennebaker, 2011;  Coppersmith et al., 2014; Schwartz & Ungar, 2015; Kern et. al., 2016). In contrast to traditional psychological assessments, which typically capture one to a few psychological factors, language-based assessments have the advantage of, theoretically, being able to capture an unlimited number of constructs. The goal with this moonshot task is to motivate methods moving the field closer to more open-vocabulary outputs in psychological predictions — outputs that are not limited to any predefined category or construct. The first output will seek to predict the frequency of words that are deemed psychologically relevant according to literature (e.g. singular versus plural pronouns, ‘excited’, ‘hate’, ‘friends’) assessed by correlation, while the second output ultimately seeks to produce the entire age 50 essay, evaluated against true age 50 essays (e.g. using BLEU score). We invite participants to tackle one or both of these challenges. Manuscripts for this task which address the goals of the CLPsych workshop will be considered as part of the review process for the workshop proceedings.


For tasks A and B, we plan to use Disattenuated Pearson Correlation between the predictions and the actual survey outcomes for the official rankings. This metric is isomorphic to a Pearson correlation, but it accounts for measurement error and thus yields values with larger variance making for easier comparisons between system performances. The measurement error (accounted for by its inverse, reliability) is taken from the literature on the reliability of the psychological distress questionnaires (0.77; Plpibidis et al., 2017) and of similar language-based predictions (0.70; Park et al., 2015). The metric is thus:

We will also report the results using the Root Mean Squared Error (RMSE), as we presume many will be using methods that optimize the MSE. We are still deciding on the final metric for the Innovation Challenge (it will likely be similar to BLEU). The exact script we will use for the evaluation will be provided to participants.

Park, G., Schwartz, H. A., Eichstaedt, J. C., Kern, M. L., Kosinski, M., Stillwell, D. J., … & Seligman, M. E. (2015). Automatic personality assessment through social media language. Journal of personality and social psychology, 108(6), 934.
Ploubidis, G. B., Sullivan, A., Brown, M., & Goodman, A. (2017). Psychological distress in mid-life: evidence from the 1958 and 1970 British birth cohorts. Psychological medicine, 47(2), 291-303.

Please note signup has now closed! We thank everyone for their interest.



Anticipated Timeline

  • As soon as convenient: Start human subjects review process at your organization; see help below.  
  • March 5: Release of training data: ~8,000 anonymized childhood essays +
          gender and social class controls +
          psychological outcomes +
          ~8,000 matched anonymized adult essays
  • March 15 March 19: Task Signup Deadline
  • April 2: Test set released: 1,000 anonymized childhood essays +
    gender and social class control variables
  • April 9: Test set predictions due
  • April 16: Results announced, Participant manuscripts due.
  • June 5: Workshop


Human Subjects Review

Every effort has been made to anonymize the data. But even de-identified data used for research purposes must obtain human subjects review at one’s home institution. Within manuscript submissions, all participants must affirm that they have had an appropriate review completed at their home organization. In the US, this is likely to be quite straightforward: most US academic institutions will more than likely consider the data to be in the “exempt” category under the revised common rule, and in fact many university ethics boards already specifically list the NCDS data as “exempt” however, only an institutional review board (IRB) can make that decision. We are providing a Template Letter with information about the dataset in order to make this process smooth for people who have not previously done research involving human subjects review.


Data Access Procedure 

  1. All team members must sign up individually with the UK Data Service.
  2. All team members must sign and return the Data Use Agreement to We would prefer all team members to sign the same form. 
  3. If all is in order, you will receive the data within one week (typically less than 4 days).


Shared Task Organizers

H. Andrew Schwartz, Veronica Lynn*, Alissa Goodman, Kate Niederhoffer, Kate Loveys, Philip Resnik

*Primary point of contact:



Coppersmith, G., Dredze, M., & Harman, C. (2014). Quantifying mental health signals in twitter. In Proceedings of the Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality (pp. 51-60).

Kern, M. L., Park, G., Eichstaedt, J. C., Schwartz, H. A., Sap, M., Smith, L. K., & Ungar, L. H. (2016). Gaining insights from social media language: Methodologies and challenges. Psychological methods, 21(4), 507.

Pennebaker, J. W. (2011). The secret life of pronouns: What our words say about us. New York, NY: Bloomsburg Press.

Power, C., & Elliott, J. (2005). Cohort profile: 1958 British birth cohort (national child development study). International journal of epidemiology, 35(1), 34-41.

Schwartz, H. A., & Ungar, L. H. (2015). Data-driven content analysis of social media: a systematic overview of automated methods. The ANNALS of the American Academy of Political and Social Science, 659(1), 78-94.