Shared Task 2017

Triaging content in online peer-support: an overview of the 2017 CLPsych shared task

The 2017 CLPsych shared task involved automatically prioritising content in online peer-support–specifically the forums hosted by ReachOut.com–by how urgently it requires attention from a moderator.

Motivation and Background

ReachOut.com is an Australian non-profit established in 1996 to support young people. It offers online resources about everyday topics like family, school and friendships, as well as more difficult issues such as alcohol and drug addition, gender identity, sexuality, and mental health concerns.

ReachOut also provides a safe online community for young people to chat anonymously and share their experiences. This community is maintained and looked after by a team of professional and volunteer moderators who listen out for anything that might require attention, responding when needed with encouragement, compassion and links to relevant resources. In extreme cases they will occasionally redact content that is overly distressing or triggering, or where the author has jeopardised their own safety and anonymity. There is an escalation process to follow when forum members might be at risk of harm.

This shared task is about helping these moderators by automatically identifying content that requires their attention, and triaging it to ensure that urgent content can be responded to more quickly and consistently. It is a repeat of the 2016 shared task, only with additional data.

Task and data

The task was structured as a supervised classification problem, in which participants were asked to automatically classify forum posts into one of the following four categories.

crisis the author or someone else is at risk of harm
red the post should be responded to as soon as possible
amber the post should be responded to at some point, if the community does not rally strongly around it
green the post can be safely ignored or left for the community to address

Ground truth for the task was provided via manual annotation. The annotation process was complex and somewhat subjective; further details can be found in the 2016 shared task overview paper.

Participants were initially given a training dataset of 65,756 forum posts, of which 1188 were annotated manually with the expected category. Participants were then free to use whatever algorithms and features they desired, as long as the features only involved information that would be available at the time a post was first created (i.e. to not include any information about how people respond to the posts, how often it is viewed, whether it is edited, etc.)

After 7 weeks, the participants were then given a test set of 92,207 forum posts, of which 400 were identified as requiring annotation. They were then expected to automatically annotate these 400 posts and promptly return them to the task coordinator, who would compare their submission against the ground truth to calculate the offical metrics. Participants were free to submit as many runs as they liked (i.e. to compare different algorithms or feature representations), and would only be judged on their top-performing run.

The full dataset is available for researchers who wish to continue developing and evaluating triage systems, or to conduct other research projects in online peer support. Please apply here for access.

Participants

The shared task attracted strong participation, with 15 teams of researchers from 20 institutions, across 7 countries.

Team name Team members Institution Country
Altszyler et al. Edgar Altszyler, Ariel J. Berenstein, Diego Fernandez Slezak Universidad de Buenos Aires, Hospital de Ninõs Ricardo Gutierrez Argentina
Desmet and Jacobs Bart Desmet & Gilles Jacobs Ghent University Belgium
French et al. Leon French, Derek Howard, Jacob Ritchie & Geoffrey Woollard Centre for Addiction and Mental Health Toronto Canada
Gamaarachchige et al. Prasadith Kirinde Gamaarachchige, Diana Inkpen & Ruba Skaik University of Ottawa Canada
Han et al. Sifei Han, Tung Tran & Ramakanth Kavuluru University of Kentucky United States
Hoenen Armin Hoenen CEDIFOR / Goethe Uni Frankfurt am Main Germany
Kennington & Mehrpouyan Casey Kennington & Hoda Mehrpouyan Boise State University United States
Miftahutdinov et al. Zulfat Miftahutdinov, Elena Tutubalina & Ilseyar Alimova Kazan Federal University Russia
Morales and Levitan Michelle Morales & Rivka Levitan The City University of New York United States
Nair et al. Suraj Nair, Ayah Zirikly & Philip Resnik University of Maryland & George Washington University United States
Qadir et al. Ashequl Qadir, Oladimeji Farri, Sadid Hasan, Joey Liu, Vivek Datla, Kathy Lee & Yuan Ling Philips Research North America United States
Rose and Bex Dylan Rose & Peter Bex Northeastern University United States
Vajjala Sowmya Vajjala Iowa State University United States
Xia and Liu Xianyi Xia & Dexi Liu Jiangxi University of Finance and Economics China
Yates et al. Andrew Yates, Nazli Goharian, Arman Cohan & Sean Macavaney Max Planck Institute for Informatics & Georgetown University Germany, United States

Results

The shared task has three official metrics: macro-averaged f1 score, f1 for flagged vs. non-flagged posts, and f1 for urgent vs. non-urgent posts.

Macro-averaged f1 score

The macro-averaged f1 score—calculated after excluding the majority (i.e. green) class—was the official metric from the 2016 shared task. It captures an algorithm’s ability to cleanly separate the four severity labels.

Team Recall Precision F-measure Accuracy
Xia and Liu 0.434 0.546 0.467 0.708
Altszyler 0.489 0.446 0.462 0.695
Nair et al. 0.435 0.538 0.461 0.678
Han et al. 0.450 0.451 0.447 0.683
Qadir et al. 0.373 0.625 0.436 0.683
Gamaarachchige et al. 0.429 0.462 0.413 0.678
French et al. 0.404 0.401 0.391 0.668
Vajjala 0.368 0.413 0.388 0.643
Miftahutdinov et al. 0.365 0.410 0.373 0.653
Hoenen 0.355 0.479 0.324 0.523
Yates et al. 0.352 0.312 0.319 0.700
Kennington & Mehrpouyan 0.307 0.299 0.281 0.650
Desmet and Jacobs 0.306 0.177 0.219 0.275
Rose and Bex 0.227 0.230 0.187 0.443
Morales and Levitan 0.071 0.250 0.086 0.540

The top teams provide solid improvements over 2016, where the top three teams all achieved 0.42 f-measure.

F1 for flagged vs. non-flagged

This metric separates the posts that moderators need to do something with (i.e. crisis, red, amber) from those they can safely ignore (i.e. green). This is arguably the most important metric, since a low recall here could cause moderators to miss posts that they should pay attention to, while a low precision would increase their workload.

Team Recall Precision F-measure Accuracy
Altszyler 0.908 0.903 0.905 0.913
Yates et al. 0.902 0.865 0.883 0.890
French et al. 0.913 0.844 0.877 0.883
Gamaarachchige et al. 0.897 0.829 0.862 0.868
Xia and Liu 0.842 0.871 0.856 0.870
Qadir et al. 0.837 0.870 0.853 0.868
Han et al. 0.875 0.826 0.850 0.858
Vajjala 0.875 0.797 0.834 0.840
Nair et al. 0.793 0.849 0.820 0.840
Miftahutdinov et al. 0.793 0.820 0.807 0.825
Kennington & Mehrpouyan 0.788 0.824 0.806 0.825
Hoenen 0.875 0.647 0.744 0.723
Rose and Bex 0.690 0.575 0.627 0.623
Desmet and Jacobs 0.815 0.482 0.606 0.513
Morales and Levitan 0.299 0.743 0.426 0.630

This is again a good improvement over 2016, where the top team achieved 0.87 f-measure.

F1 for urgent vs. non-urgent

This metric separates the posts that moderators need to respond to quickly (i.e. crisis, red) from those for which they can afford to take their time.

team recall precision f-measure accuracy
Altszyler 0.667 0.706 0.686 0.863
Nair et al. 0.744 0.615 0.673 0.838
Han et al. 0.644 0.604 0.624 0.825
Xia and Liu 0.556 0.676 0.610 0.840
French et al. 0.644 0.563 0.601 0.808
Vajjala 0.589 0.609 0.599 0.823
Gamaarachchige et al. 0.533 0.649 0.585 0.830
Qadir et al. 0.456 0.759 0.569 0.845
Hoenen 0.822 0.357 0.498 0.628
Miftahutdinov et al. 0.422 0.603 0.497 0.808
Yates et al. 0.333 0.909 0.488 0.843
Desmet and Jacobs 0.589 0.262 0.363 0.535
Kennington & Mehrpouyan 0.211 0.760 0.330 0.808
Rose and Bex 0.167 0.294 0.213 0.723
Morales and Levitan 0.022 1.000 0.043 0.780

These results are similar to 2016, where the best performing team achieved 0.69 f-measure

Further information

We are currently preparing a more detailed article, and will update this page with a citation when it becomes available. In the mean time, please feel free to contact David Milne if you have any other questions.

If you would like to stay informed about future CLPsych shared tasks, then please sign up to the mailing list and/or join the CLPsych Google group.

 

Please cite this article as:

Milne, D.N. (2017) Triaging content in online peer-support: an overview of the 2017 CLPsych shared task. [Online]. Available: http://clpsych.org/shared-task-2017