|Title||Cross-Document Coreference for Cross-Media Film Indexing|
|Publication Type||Conference Paper|
|Year of Publication||2006|
|Authors||Tomadaki, E, Salway, A|
|Conference Name||LREC 2006 Workshop on Crossing Media for Improved Information Access|
|Conference Location||Genoa, Italy|
Potentially, rich representations of film content could be extracted and merged from various texts, such as screenplays, audio description and plot summaries, in order to improve video indexing. As a first step, this requires solving the cross-document coreference (CDCR) task. The CDCR task is difficult in this new scenario because the texts each select and present information about film events very differently; furthermore, the set of possible events is relatively unconstrained. In order to propose new solutions for CDCR we first analysed how two different text types select and present information about the same film events. We present a corpus based analysis of the language used in plot summaries and in audio description, which suggests that while both use similar words to refer to entities, they use very different words to refer to events; there is little systematic relation between the words each use to refer to events. Based on our results, we propose and evaluate four heuristics for the CDCR task that match nouns, functional roles, some verbs, and take into account the number of expected matches according to event aspect. At best we achieved Precision of 49% and Recall of 32% based on 375 CDCR instances between plot summaries and audio descriptions. These figures are low compared to many information retrieval and extraction tasks but we believe that: (i) they may be close to the best possible given the differences between the text types and that they refer to an unconstrained set of events; (ii) they are high enough to start leveraging the information in the texts for video indexing purposes.