Triaging content in online peer-support: an overview of the 2017 CLPsych shared task
The 2017 CLPsych shared task involved automatically prioritising content in online peer-support–specifically the forums hosted by ReachOut.com–by how urgently it requires attention from a moderator.
Motivation and Background
ReachOut.com is an Australian non-profit established in 1996 to support young people. It offers online resources about everyday topics like family, school and friendships, as well as more difficult issues such as alcohol and drug addition, gender identity, sexuality, and mental health concerns.
ReachOut also provides a safe online community for young people to chat anonymously and share their experiences. This community is maintained and looked after by a team of professional and volunteer moderators who listen out for anything that might require attention, responding when needed with encouragement, compassion and links to relevant resources. In extreme cases they will occasionally redact content that is overly distressing or triggering, or where the author has jeopardised their own safety and anonymity. There is an escalation process to follow when forum members might be at risk of harm.
This shared task is about helping these moderators by automatically identifying content that requires their attention, and triaging it to ensure that urgent content can be responded to more quickly and consistently. It is a repeat of the 2016 shared task, only with additional data.
Task and data
The task was structured as a supervised classification problem, in which participants were asked to automatically classify forum posts into one of the following four categories.
|crisis||the author or someone else is at risk of harm|
|red||the post should be responded to as soon as possible|
|amber||the post should be responded to at some point, if the community does not rally strongly around it|
|green||the post can be safely ignored or left for the community to address|
Ground truth for the task was provided via manual annotation. The annotation process was complex and somewhat subjective; further details can be found in the 2016 shared task overview paper.
Participants were initially given a training dataset of 65,756 forum posts, of which 1188 were annotated manually with the expected category. Participants were then free to use whatever algorithms and features they desired, as long as the features only involved information that would be available at the time a post was first created (i.e. to not include any information about how people respond to the posts, how often it is viewed, whether it is edited, etc.)
After 7 weeks, the participants were then given a test set of 92,207 forum posts, of which 400 were identified as requiring annotation. They were then expected to automatically annotate these 400 posts and promptly return them to the task coordinator, who would compare their submission against the ground truth to calculate the offical metrics. Participants were free to submit as many runs as they liked (i.e. to compare different algorithms or feature representations), and would only be judged on their top-performing run.
The full dataset is available for researchers who wish to continue developing and evaluating triage systems, or to conduct other research projects in online peer support. Please apply here for access.
The shared task attracted strong participation, with 15 teams of researchers from 20 institutions, across 7 countries.
|Team name||Team members||Institution||Country|
|Altszyler et al.||Edgar Altszyler, Ariel J. Berenstein, Diego Fernandez Slezak||Universidad de Buenos Aires, Hospital de Ninõs Ricardo Gutierrez||Argentina|
|Desmet and Jacobs||Bart Desmet & Gilles Jacobs||Ghent University||Belgium|
|French et al.||Leon French, Derek Howard, Jacob Ritchie & Geoffrey Woollard||Centre for Addiction and Mental Health Toronto||Canada|
|Gamaarachchige et al.||Prasadith Kirinde Gamaarachchige, Diana Inkpen & Ruba Skaik||University of Ottawa||Canada|
|Han et al.||Sifei Han, Tung Tran & Ramakanth Kavuluru||University of Kentucky||United States|
|Hoenen||Armin Hoenen||CEDIFOR / Goethe Uni Frankfurt am Main||Germany|
|Kennington & Mehrpouyan||Casey Kennington & Hoda Mehrpouyan||Boise State University||United States|
|Miftahutdinov et al.||Zulfat Miftahutdinov, Elena Tutubalina & Ilseyar Alimova||Kazan Federal University||Russia|
|Morales and Levitan||Michelle Morales & Rivka Levitan||The City University of New York||United States|
|Nair et al.||Suraj Nair, Ayah Zirikly & Philip Resnik||University of Maryland & George Washington University||United States|
|Qadir et al.||Ashequl Qadir, Oladimeji Farri, Sadid Hasan, Joey Liu, Vivek Datla, Kathy Lee & Yuan Ling||Philips Research North America||United States|
|Rose and Bex||Dylan Rose & Peter Bex||Northeastern University||United States|
|Vajjala||Sowmya Vajjala||Iowa State University||United States|
|Xia and Liu||Xianyi Xia & Dexi Liu||Jiangxi University of Finance and Economics||China|
|Yates et al.||Andrew Yates, Nazli Goharian, Arman Cohan & Sean Macavaney||Max Planck Institute for Informatics & Georgetown University||Germany, United States|
The shared task has three official metrics: macro-averaged f1 score, f1 for flagged vs. non-flagged posts, and f1 for urgent vs. non-urgent posts.
Macro-averaged f1 score
The macro-averaged f1 score—calculated after excluding the majority (i.e. green) class—was the official metric from the 2016 shared task. It captures an algorithm’s ability to cleanly separate the four severity labels.
|Xia and Liu||0.434||0.546||0.467||0.708|
|Nair et al.||0.435||0.538||0.461||0.678|
|Han et al.||0.450||0.451||0.447||0.683|
|Qadir et al.||0.373||0.625||0.436||0.683|
|Gamaarachchige et al.||0.429||0.462||0.413||0.678|
|French et al.||0.404||0.401||0.391||0.668|
|Miftahutdinov et al.||0.365||0.410||0.373||0.653|
|Yates et al.||0.352||0.312||0.319||0.700|
|Kennington & Mehrpouyan||0.307||0.299||0.281||0.650|
|Desmet and Jacobs||0.306||0.177||0.219||0.275|
|Rose and Bex||0.227||0.230||0.187||0.443|
|Morales and Levitan||0.071||0.250||0.086||0.540|
The top teams provide solid improvements over 2016, where the top three teams all achieved 0.42 f-measure.
F1 for flagged vs. non-flagged
This metric separates the posts that moderators need to do something with (i.e. crisis, red, amber) from those they can safely ignore (i.e. green). This is arguably the most important metric, since a low recall here could cause moderators to miss posts that they should pay attention to, while a low precision would increase their workload.
|Yates et al.||0.902||0.865||0.883||0.890|
|French et al.||0.913||0.844||0.877||0.883|
|Gamaarachchige et al.||0.897||0.829||0.862||0.868|
|Xia and Liu||0.842||0.871||0.856||0.870|
|Qadir et al.||0.837||0.870||0.853||0.868|
|Han et al.||0.875||0.826||0.850||0.858|
|Nair et al.||0.793||0.849||0.820||0.840|
|Miftahutdinov et al.||0.793||0.820||0.807||0.825|
|Kennington & Mehrpouyan||0.788||0.824||0.806||0.825|
|Rose and Bex||0.690||0.575||0.627||0.623|
|Desmet and Jacobs||0.815||0.482||0.606||0.513|
|Morales and Levitan||0.299||0.743||0.426||0.630|
This is again a good improvement over 2016, where the top team achieved 0.87 f-measure.
F1 for urgent vs. non-urgent
This metric separates the posts that moderators need to respond to quickly (i.e. crisis, red) from those for which they can afford to take their time.
|Nair et al.||0.744||0.615||0.673||0.838|
|Han et al.||0.644||0.604||0.624||0.825|
|Xia and Liu||0.556||0.676||0.610||0.840|
|French et al.||0.644||0.563||0.601||0.808|
|Gamaarachchige et al.||0.533||0.649||0.585||0.830|
|Qadir et al.||0.456||0.759||0.569||0.845|
|Miftahutdinov et al.||0.422||0.603||0.497||0.808|
|Yates et al.||0.333||0.909||0.488||0.843|
|Desmet and Jacobs||0.589||0.262||0.363||0.535|
|Kennington & Mehrpouyan||0.211||0.760||0.330||0.808|
|Rose and Bex||0.167||0.294||0.213||0.723|
|Morales and Levitan||0.022||1.000||0.043||0.780|
These results are similar to 2016, where the best performing team achieved 0.69 f-measure
We are currently preparing a more detailed article, and will update this page with a citation when it becomes available. In the mean time, please feel free to contact David Milne if you have any other questions.
Please cite this article as:
Milne, D.N. (2017) Triaging content in online peer-support: an overview of the 2017 CLPsych shared task. [Online]. Available: http://clpsych.org/shared-task-2017