|Title||Diachronic Changes in Text Complexity in 20th Century English Language: An NLP Approach|
|Publication Type||Conference Paper|
|Year of Publication||2012|
|Authors||Štajner, S, Mitkov, R|
|Conference Name||The eighth international conference on Language Resources and Evaluation (LREC)|
|Publisher||European Language Resources Association (ELRA)|
|Conference Location||Istanbul, Turkey|
A syntactically complex text may represent a problem for both comprehension by humans and various NLP tasks. A large number of studies in text simplification are concerned with this problem and their aim is to transform the given text into a simplified form in order to make it accessible to the wider audience. In this study, we were investigating what the natural tendency of texts is in 20th century English language. Are they becoming syntactically more complex over the years, requiring a higher literacy level and greater effort from the readers, or are they becoming simpler and easier to read? We examined several factors of text complexity (average sentence length, Automated Readability Index, sentence complexity and passive voice) in the 20th century for two main English language varieties – British and American, using the ‘Brown family’ of corpora. In British English, we compared the complexity of texts published in 1931, 1961 and 1991, while in American English we compared the complexity of texts published in 1961 and 1992. Furthermore, we demonstrated how the state-of-the-art NLP tools can be used for automatic extraction of some complex features from the raw text version of the corpora.