Making Sense of Microposts (#Microposts2015)

Big things come in small packages

Named Entity rEcognition and Linking (NEEL) Challenge

Microposts are a highly popular medium for sharing facts, opinions and/or emotions. Microposts comprise an invaluable wealth of data, ready to be mined for training predictive models. Following the success of the challenge in 2013/14, we are pleased to announce the NEEL challenge which will be part of the #Microposts2015 Workshop at the World Wide Web 2015 conference.

The challenge task is to automatically recognise entities and their types from English Microposts, and link them to the corresponding English DBpedia 2014 resources (where a linkage exists). Participants will have to automatically extract expressions that are formed by discrete (and typically short) sequences of words (e.g., "Barack Obama", London, Rakuten) and recognise their types (e.g., Person, Location, Organisation) from a collection of Microposts. At the linking stage participants must disambiguate the named entity spotted to the corresponding DBpedia resource, or to a NIL reference if it does not match any resource in DBpedia. The 2015 challenge will also evaluate the end-to-end performance of the system, by measuring the computation time for analysing the corpus using each algorithm submitted.

We welcome and aim to attract participants from the previous Microposts workshop challenges, as well as TREC, TAC KBP, ERD shared tasks to the #Microposts2015 NEEL challenge.

Award Sponsor: SpazioDati

Dataset

The dataset comprises tweets extracted from a collection of over 18 million tweets. They include event-annotated tweets provided by the Redites project, covering multiple noteworthy events from 2011 and 2013 (including the death of Amy Winhehouse, the London Riots, the Oslo bombing and the Westgate Shopping Mall shootout), and tweets extracted from the Twitter firehose in 2014. Since the challenge task is to automatically recognise and link entities, we have built our dataset considering both event and non-event tweets. While event tweets are more likely to contain entities, non-event tweets enable us to evaluate the performance of the system in avoiding false positives in the entity extraction phase. The training set is built on top of the entire corpus of the NEEL 2014 Challenge. We have further extended it for typing the entities and adding NIL references.

Following the Twitter ToS we will only provide tweet IDs and annotations for the training set; and tweet IDs for the test set. We will also provide a common framework for mining these datasets from Twitter. The training set will be released as tsv, following the TAC KBP format, where each line consists of the following features:

1st: tweet id
2nd, 3rd: start/end offsets expressed as the number of UTF8 characters starting from 0, aka the beginning of the tweet
4th: link to DBpedia resource or NIL (there may be different NIL references in the corpus. Each NIL may be reused if there are multiple mentions in the text which represent the same entity)
5th: type

Tokens are separated by TABs. Entity mentions and URIs are listed according to their position in the tweet. We will notify release of the data set from @Microposts2015, here and via the mailing list – subscribe to the #Microposts2015 Google group.

Evaluation & Submission

Evaluation

Participants are required to implement their systems as a publicly accessible web service following a REST-based protocol, which will be publicised before the release of the training set, and submit (up to 10) contending entries to the registry of the NEEL challenge services. Upon receiving the registration of the service, calls to each entry will be scheduled in two different time windows:

D-Time to test the APIs;
T-Time for the final evaluation and metric computations.

In the final stage, each participant may submit up to 3 final contending entries.

We will use the metrics proposed by TAC KBP 2014 and in particular we will focus on:

[tagging]   strong_typed_mention_match (check entity name boundary and type)
[linking]   strong_link_match
[clustering]   mention_ceaf (KB/NIL detection)
Additionally, an another metric will be considered to estimate the computational time:
[latency*]   it estimates the computational time

To ensure the correctness of the results and avoid any loss we will trigger N number of calls and statistically evaluate the metrics.

Accompanying written submission

Description of approach followed – 2 page extended abstract. Further detail on the submissions page

back to top

Deadlines

  • Intent to participate: 20 26 Jan 2015 soft (register to http://goo.gl/forms/MLcSidVTbj)
  • Release of the REST API specs: 2 Feb 2015
  • Release of training set: 15 Feb 2015

  • Registration of contending entries: 2 Mar 2015
  • D-Time: 10-20 Mar 2015
  • T-Time: 11-20 Apr 2015

  • Written submission: 07 Apr 2015
  • Acceptance notification: 27 Apr 2015
  • Camera-ready: 11 May 2015

Prize

A prize of €1500, generously sponsored by SpazioDati, will be awarded to the highest ranking submission. SpazioDati is an Italian startup focused on text analytics and big data. One of SpazioDati's key components is dataTXT, a text-analytics engine available on SpazioDati's API platform, Dandelion. The dataTXT named-entity extraction system has been proven to be very effective and efficient on short and fragmented texts, like Microposts. By teaming up with SpazioDati to make the challenge possible, the #Microposts workshop organisers wish to highlight new entity extraction methods and algorithms to pursue in such challenging scenarios.

Challenge Chairs

A. Elizabeth Cano, Knowledge Media Institute, The Open University, UK

Giuseppe Rizzo, EURECOM, France

 

Challenge Dataset Chairs

Andrea Varga, Swiss Re, UK

Bianca Pereira, Insight at National University of Ireland, Galway

 

Evaluation Committee

Gabriele Antonelli, SpazioDati, Italy
Ebrahim Bagheri, Ryerson University, Canada
Pierpaolo Basile, University of Bari, Italy
Grégoire Burel, KMi, Open University, UK
Leon Derczynski, The University of Sheffield, UK
Milan Dojchinovski, Czech Technical University, Czech Republic
Guillaume Erétéo, Vigiglobe, France
Andrés García-Silva, Universidad Politécnica de Madrid
Anna Lisa Gentile, The University of Sheffield, UK
Miguel Martinez-Alvarez, Signal, UK
José M. Morales del Castillo, El Colegio de México, Mexico
Bernardo Pereira Nunes, PUC-Rio, Brazil
Daniel Preoţiuc-Pietro, The University of Sheffield, UK
Giles Reger, The University of Manchester, UK
Irina Temnikova, Qatar Computing Research Institute, Qatar
Victoria Uren, Aston University, UK

Download the CfP

Contact workshop organisers or challenge chairs

back to top