|Natural Language Processing methodology for tracking diachronic changes in the 20th century English language
|Year of Publication
|Štajner S, Mitkov R, Leech G
|Journal of Research Design and Statistics in Linguistics and Communication Science
Since the 1990s, when its more recent additions were released and diachronic study became possible, the ‘Brown family’ of corpora has been widely used across the linguistic community for various synchronic and diachronic studies. However, the methodology used in these studies did not take advantage of modern, state-of-the art Natural Language Processing (NLP) tools, but rather relied on part-of-speech (POS) tagging, often with manual post-editing. Most previous work (e.g. Leech et al., 2009; Davies, 2013) has focused mainly on the linguistic interpretation of the results and on proposing hypotheses about the ways language changes, without giving much consideration to whether the results were statistically sound or not.1 This work aims to fill the aforementioned gaps by proposing a novel, NLP-motivated methodology, which employs a fully automatic feature extraction procedure and conducts a thorough statistical analysis, thus offering a promising basis for future large-scale studies, reducing the amount of human effort required. The choice of statistical tests in this study was evaluated and confirmed to be correct by several procedures which rely on leading machine learning algorithms.