Establishment of a List of Non-Compositional Multi-Word Combinations of English for Journalism Learners

This research describes an attempt to establish a pedagogically useful list of the most frequent semantically noncompositional multi-word combinations of English for Journalism learners in an EFL context, who need to read English news in their field of study. The list was compiled from the NOW (News on the Web) Corpus, the largest English news database by far. In consideration of opaque multi-word combinations in widespread use and pedagogical value, the researcher applied a set of selection criteria when using the corpus. Based on frequency, meaningfulness, and semantic non-compositionality, a total of 318 non-compositional multi-word combinations of 2 to 5 words with the exclusion of phrasal verbs were selected and they accounted for approximately 2% of the total words in the corpus. The list, not highly technical in nature, contains the most commonly-used multi-word units traversing various topic areas and newsreaders may encounter these phrasal expressions very often. As with other individual word lists, it is hoped that this opaque expressions list may serve as a reference for English for Journalism teaching.


Introduction
A text or a discourse is not only made up of individual words but also a large number of multi-word sequences, in which some of the words frequently co-occur with others and form relatively fixed word combinations. This phenomenon is generally referred to as formulaic language (Schmitt, 2010).
Formulaic language is ubiquitous and makes up a large proportion of any discourse (Nattinger and DeCarrico, 1992). Drawing upon the London-Lund Corpus, (Altenberg, 1993) estimated that various multi-word combinations accounted for as high as 80% of the total words in the corpus. In their study on the idiom principle, Erman and Warren (2000) reported that more than 55% of the words in an English text comprised prefabricated multi-word expressions.
Due to its multiplicity, it is difficult to lend some consistency to every single instance of formulaic language. Therefore, this research used multi-word combinations as an umbrella term to refer to miscellaneous fixed combinations of words. Among a plethora of multi-word expressions, the researcher was more concerned with the word combinations that may pose reading comprehension problems if they are not known. Not all multi-word combinations are equally semantically compositional or transparent. Martinez and Murphy (2011), pointed out that opaque formulaic sequences may negatively affect reading comprehension or cause deceptive comprehension, especially when they are composed of the most frequent general words and concealed in the known words. English learners may presume that they are familiar with these very common words (e.g. as, of, in, well, that) but actually they are not acquainted with the words in combination (e.g. as of, as well, as well as, in that) and deduce a wrong meaning. If no distinction is made between individual general words and general word combinations, the latter may be overlooked or misinterpreted.
As such, this research focused on a semantically non-compositional subset of formulaic language. However, it excluded phrasal verbs, since they form such a large subset of formulaic language that they merit separate research of their own. This research sought to answer the following two questions. RQ1. Apart from phrasal verbs, what are the most frequent non-compositional multi-word combinations in English news articles? RQ2. Apart from phrasal verbs, what is the lexical coverage of the most frequent non-compositional multi-word combinations in English news articles?

Literature Review
Formulaic language is multi-faceted. In some cases, formulaic expressions tend to abandon their semantically compositional meaning in favor of a holistic one (Nattinger and DeCarrico, 1992). Semantic transparency is related to semantic compositionality. Compositionality signifies how easily a multi-word combination can be interpreted from its component words. Conversely, non-compositionality denotes that the meaning of a multi-word unit as a whole contradicts the decoding of its constituent parts. Namely, the individual words of a multi-word unit do not help each other to reveal the meaning as a whole. Lewis (1993), called the varying degrees of compositionality "a spectrum of idiomaticity" (p. 98).
Along the axis of idiomaticity, Howarth (1998) put forward a framework for categorization of multi-word combinations ranging from being least to most idiomatic: free combinations, restricted collocations, figurative idioms and pure idioms. At the extreme end of compositionality, free combinations deliver the literal meanings of their component words and allow substitution, having the highest degree of semantic transparency (e.g. video games, free games, indoor games). Restricted collocations are word combinations in which some substitution is possible, but with some restrictions on substitution. Specifically, at least one word has a non-literal meaning and at least one word is used in its literal sense, and the whole combination is still more or less transparent (Cowie, 1998) (e.g. keep an eye on, make a comeback). Figurative idioms have metaphorical meanings in terms of the whole, which are separate from their literal meanings (e.g. a house of cards, a smoking gun). With little connection to the meanings of their constituent parts, pure idioms need to be explained and learned as a whole (e.g. cut the mustard, red herring).
This research focused its attention on non-compositionality because non-compositional multi-word combinations form distinct meanings and can be learned like single words. According to Nation (2006), lexical text coverage is defined as "the percentage of running words in the text known by the reader" (p. 61) and regarded as an indicator of whether a text is likely to be adequately understood. When lexical text coverage with an emphasis on known words is calculated, multi-word combinations are not taken into account. As such, the lexical coverage of a text may be overestimated when non-compositional multi-word combinations are hidden in known words and their meanings as a whole happen to be unknown to learners. In this case, knowledge of non-compositional multi-word expressions may contribute to filling the chasm of lexical text coverage that individual words fail to account for (Martinez and Murphy, 2011).
In the literature, there are two fundamental approaches used to retrieve recurrent multi-word combinations: a frequency-based approach and a phraseological approach (Nesselhauf, 2005). The former mainly relies on statistical measures as screening criteria, whereas the latter primarily resorts to linguistic analysis and hence manual examination is inevitable.
The pre-determined cut-off points in the literature for frequency and dispersion have been arbitrary, subject to researchers" goals. Biber et al. (1999), adopted a very flexible cut-off point at a minimum of ten times per million words across five or more texts. They found that 3-word bundles occurred over 60,000 times and 4-word bundles over 5,000 times per million tokens, accounting for approximately 21% of the 5.3 million words of the academic section of the Longman Corpus. Biber et al. (2004), were more cautious in choosing lexical bundles from their corpora by setting a relatively high frequency cut-off at 40 times per million words. Following Biber et al., Hyland (2008) increased the cut-off value from a minimum of 10 times to 20 times per million words and decided on the breadth of lexical bundles at occurring in at least 10% of the texts, when selecting lexical bundles in his 3.5-millon-word corpus of research articles, Master"s dissertations and PhD thesis.
Present-day n-gram programs ensure the properties of frequency and multi-text occurrences but fail to adequately deal with meaningful retrievals. Purely based on statistical figures, a phrase extractor may generate a long list of multi-word sequences, part of which have little meanings (e.g. that do not, and there being) or part of which are grammatically ill-formed (e.g. was found in the, of the distribution of). Though frequent, such word combinations may not be "pedagogically compelling" (Simpson-Valch and Ellis, 2010) p. 493).
To identify the most frequent spoken collocations, Shin and Nation (2008), p. 341) proposed a set of selection criteria, one of which was "grammatical well-formedness" and involved a great deal of manual checking. From the British National Corpus spoken section, they targeted a sequence of words which do not span "immediate constituents" (two neighboring phrases/clauses) (Bloomfield, 1933), p. 161), because a grammatical well-formed word sequence is a comprehensible unit. For instance, "the fact that" is more understandable than "fact that the", since the retrieval of the former follows the dividing principle of immediate constituents.
By compiling a 25-million-token corpus of research articles across five academic domains, Durrant (2009) endeavored to make a listing of positionally-variable collocations for students from a wide range of departments. Relying on the log-likelihood and Mutual Information, he identified the most frequent 1,000 academic collocations (e.g. respect to, number of, effect on, effects on, was used). However, some collocations fail to contribute to the learning of grammatical patterns if they are not extended to longer word sequences (e.g. was used). Some collocations can be combined into one for learning together (e.g. effect on, effects on), while others are apparently incomplete so that they are not suitable for direct teaching (e.g. respect to, number of).
To tackle the problem of teachability, Simpson-Valch and Ellis (2010) proposed the notion of Formula Teaching Worth (FTW) by incorporating the Mutual Information (MI) score into their weeding procedure in lieu of a merely frequency-based approach. MI is a statistical measure of the cohesiveness of words, which signifies the degree to which the words are bound together (Stubbs, 2007). In one of their cases, the word sequence "with which the" occurred more frequently than expected (passing a certain threshold of both frequency and range). In contrast, the expression on the other hand cohered much more than would be expected by chance based on the high MI score. The expression "with which the" would come at the top if frequency is a top priority in ranking formulaic sequences, while on the other hand ranks high if the MI score is considered first. In the light of identifiable meanings, the latter seems to be more noteworthy for teaching than the former. After a series of reliability and validity checks, Simpson-Valch and Ellis (2010) concluded that the FTW that combines frequency and MI may provide teachers with a basis of prioritization, when judging multi-word sequences in terms of whether they are pedagogical compelling.
Also relevant to this study is cross-disciplinary Academic Collocation List (ACL). Ackermann and Chen (2013), compiled a corpus of over 25 million tokens from the Pearson International Corpus of Academic English (PICAE). Through MI and t-score as initial filtering and then a panel of experts for screening, they retrieved 2,468 most frequent lexical collocations, which were claimed to be immediately operationalizable for EAP teachers to help students increase collocational competence in academic English. Despite the relevance of the ACL for learners with academic goals, the ACL including free word combinations (e.g. further research, academic writing) seems to be so unwieldy as to possibly overburden students before they concentrate on the collocations they may need imminently.
The review of previous studies has helped to shape the present approach to selecting recurrent multi-word combinations for inclusion in the list for pedagogical purposes. In view of the fact that not all multi-word units are of equal importance to learners with specific purposes, this research adopted semantic non-compositionality as a point of departure.

The Corpus
The NOW (News on the Web) Corpus is the largest, well-balanced English news corpus to date. At the time of doing this research, it has already had 7.3 billion words of data retrieved from web-based newspapers from 2010 to the present time. Automated scripts run every day to add texts to the corpus, so the corpus is continually growing by 140 to 160 million words each month. Due to everyday update, the corpus reflects contemporaneity and modernity of English as time goes on. This has important implications for the learning of non-compositional multi-word expressions, since the very low frequencies in the NOW Corpus may indicate that these phrases may be of little pedagogical value.

The Procedure
The selection of recurrent multi-word combinations for inclusion in the list involved quantitative and qualitative approaches. The frequency measure resembled those of lexical bundles used in past studies in some ways. To lessen subjectivity, we referred to Shin and Nation (2008) as well as Ackermann and Chen (2013) and thereby formulated two questions to guide the judgment. They were used to gauge meaningfulness and well-formedness, after candidate multi-word sequences were initially identified.
The software Collocate Barlow (2004) was used to retrieve multi-word sequences from the downloaded NOW Corpus for offline use. The span parameter for multi-word length was set from 2 to 6 words. Frequencies drop drastically as word sequences are extended to five words or beyond (Hyland, 2008). Though recurrent 6-word combinations may be relatively rare, they were also included for thoroughness.
The next decision was what frequency level was to be used as a cut-off. Since there were other sifting measures, a less rigorous criterion was set to begin with, namely five times per million words. For a single word to enter the BNC first 5,000 most frequent word families, the word and its family members altogether need to occur at least 7.87 times per million words (Nation, 2012). Consequently, the cut-off was set at a minimum of five times rather than 10 to 40 times as in previous research (specifically, a minimum of 36,500 times as far as 7.3 billion words were concerned).
After the frequency-based measures, the strength of word co-occurrence was taken into account. There are several statistical measures to determine collocational strength. MI indicates the degree of mutual dependence of two or more words. The t score and log-likelihood ratio (LLR) are two measures of certainty of a word pairing. MI tends to give high scores to collocations having less frequent components but having strong associations between words, whereas the t score and LLR are sensitive to frequency in the sense that higher scores are associated with higher frequency of occurrence, and hence their scores are often high for functional/grammatical collocations. In consideration of possible multi-word combinations with less frequent constituent words, MI was adopted for ranking. MI complements the frequency measure. Frequency screening favors word sequences that may occur due to the high frequency of their components and may not have distinctive meanings (e.g. of which the). Since higher MI means greater association between words than is expected by chance, recurrent multi-word combinations with a high MI score are more likely to be meaningful. According to Hunston (2002), collocations with an MI score greater than 3 are considered strong. Therefore those candidate word sequences at the top of the ranked list by MI may be close to being integral in meaning. As a result, those multi-word combinations with both high frequency and high MI were first chosen while those appearing at the bottom of both frequency and MI rankings were removed. Multi-word combinations with the MI score lower than the default value (=3) were eliminated at this stage (e.g., with which can be).
Subsequently, meaningfulness, grammatical well-formedness and semantic non-compositionality guided manual checking. The multi-word combinations to be included in the list must have meaning(s) and can be learned as a whole. This criterion would help to make the present multi-word list comparable to an individual word list. To lessen subjectivity, four questions were used as selection criteria. Q1.Does the candidate multi-word combination convey a meaning? Q2. Does the candidate multi-word combination cross the boundary of an immediate constituent/phrase? Q3. Does the construct of the candidate multi-word combination behave like an individual lexical item, which is unlikely to be further analyzed into the form-meaning link of its subparts? Q4. Does the meaning of the candidate multi-word combination as a whole remain or marginally remain when each component word is decoded with its core meaning?
The researcher-teacher and her colleague made an independent judgment of each candidate word combination. The 3-point scale was used and the responses of yes, not sure and no were coded as 1, 0.5 and 0 respectively. When there was no agreement between the two raters or the answer was "not sure", the entry was reserved for further examination.
For Q1 to Q4, a series of Cohen"s Kappa statistics were undertaken as inter-rater reliability tests. The k values were 0.91, 0.92, 0.87 and 0.89 respectively (all >0.80), revealing a substantial level of agreement between the two raters.

Data Processing
To make the list serve the pedagogical purpose, two major modifications were made. One revision was undertaken for partial overlap. It refers to a situation where a longer phrase was the combination of two or more shorter phrases, each of which could occur as an independent subset of the longer one. Take due to the/an absence of as an example again. One of its subset due to appeared 1,123,999 times, while the other three, the/an absence, absence of and the/an absence of appeared 68,255, 220,825 and 388,967 times respectively. The prepositional phrase due to may have been connected with other nouns or noun phrases other than the/an absence of. Similarly, the/an absence of was one of the combinations in connection with absence of, for example, a complete absence of, a total absence of and an absence of. The absence is a free word combination, so it was not included in the current list. Since the four phrases due to, the/an absence of, absence of and due to the/an absence of can stand alone as a meaningful unit, they were separately examined based on their respective occurring frequency for decision whether to be included in the list.
To make the list more compact, a word sequence in its usual form and its possible variants with the same meaning were combined. The examples include based on/upon, even if/though, and so on/forth, with the first word appearing more frequently than the second word.
Step 1 resulted in an effective frequency threshold at having to occur over 36,500 times; Steps 2 and 3 led to effective MI greater than 6.

The Most Frequent Non-Compositional Multi-Word Combinations in English News Texts
A total of 318 non-compositional expressions of 2 to 5 words were ultimately chosen and formed the multi-word combinations list. The list consists of 153 two-word, 103 three-word, 56 four-word and 6 five-word phrasal expressions commonly used in English news articles.
The RANGE program (Heatley et al., 2004) was used to examine the vocabulary levels of the individual word tokens of the frequent non-compositional multi-word combinations. This software is installed with the ranked twenty-five 1,000 English word-family lists derived from the British National Corpus (BNC) and the Corpus of Contemporary American English (COCA) according to their occurring frequency and dispersion in the corpora (Nation, 2012). The multi-word combinations list consists of 869 running words and involves 335 word types as well as 298 word families. The BNC/COCA first 1,000 word families account for 87.72% of the total words in the present list and the second 1,000 make up 5.15%. The combined coverage percentage of the first 2,000 word families is 92.87%. The percentage of the third 1,000 word families is 2.42%, the third highest lexical coverage after the first 2,000 high-frequency word families. After the first 4,000 word families, the coverage percentage of additional 1,000 word families rapidly reduces to less than 1%.
As can be seen above, a large number of non-compositional multi-word combinations are composed of very general words, most of which (95.29%) are from the first 3,000 most frequent words in the BNC/COCA. The pairings or strings of content words (nouns, lexical verbs, adjectives or adverbs) and function words (determiners, conjunctions, prepositions, pronouns, auxiliary verbs, modals and quantifiers) form a common pattern in the present list, for example, much as (=though), as well as, in order to, there + be, and to do with. Among the instances, the everyday words as, well, order, do, much and there do not have an independent meaning but are a component of a repertoire of multi-word combinations that make up a text, as Sinclair (1991) has claimed. Without specialist knowledge involved, these semantically non-compositional multi-word combinations occur across a wide range of subject areas with their high-frequency component words.
Concerning the structure of 2-word combinations, a vast majority of them (132 out of 153) are grammaticallyconditioned pairs, namely a content word combined with a function word, as opposed to only 21 lexical collocations, a content word tied with a content word (e.g., simply put, no matter, so far, very few). Phrasal prepositions come second (26/132=19.7%) (e. g. as for, apart from, as per, according to), followed by the pattern a preposition + a noun (20/132=15.2%) (e.g. at once, at times, in place, in question), being the third.
The three patterns as ~ as, a ~ of, and by + noun phrase are productive among the 3-word combinations, as in the cases of as far as, as much as, as soon as, a host of, a range of, a couple of, by means of, by way of and by virtue of. These three patterns contribute to the description of quantity, the coverage of a subject or an approach.
For 4-word sequences, the prepositional phrase is the most common structure, comprising 57% of all forms in the category of 4-word combinations (=32/56). They are, for instance, on one's own account, in the event of/that, in the light of, in the wake of, with a view to, on the grounds of/that.
In the present list, two 5-word combinations extended from 3-word combinations can still be semantically opaque, as shown in the instances of as far as…be concerned and have nothing/much/little/something to do with.

The Lexical Coverage of the Most Frequent Non-Compositional Multi-Word Combinations in the English News Corpus
The present multi-word combinations list contains a total of 318 phrases of 2 to 5 words with an accumulation of 33,917,223 individual instances and 101,751,907 running words, which makes up almost 2% of the tokens in the English News Corpus.
At first sight, 2% lexical coverage in the English News Corpus does not appear to be worth noticing. However, if not recognized, the non-compositional multi-word combinations may impede reading comprehension. Native English-speaking children view a vocabulary load of two unknown words per hundred words as difficult reading (Carver, 1994). Some scholars (Hu and Nation, 2000;Schmitt et al., 2011) regard one unknown word in every fifty words (98% lexical coverage) as the minimum threshold necessary for adequate comprehension. If 2% unknown words are a critical benchmark for unassisted understanding of a text, then the present non-compositional multi-word combinations should not be neglected. As such, the researcher would like to propose the inclusion of the noncompositional multi-word combinations in English for Journalism syllabi.

Findings
The major concern of this research was to create a semantically non-compositional subset of formulaic language for English for Journalism learners for receptive use. By means of a principled set of criteria, a total of 318 multiword combinations of 2 to 5 words were selected and they made up 2% of the total words in the English News Corpus. The present list contains the most widely-used phrases across various everyday topics. As high as 95.29% of the non-compositional multi-word combinations are made of the BNC/COCA first 3,000 word families. Accordingly, the present selected multi-word combinations can bridge the gap between the lexical coverage that the most general words can and cannot account for in a text. Irrespective of topic areas, English news readers may come across these phrases while reading everyday news. The present multi-word combinations list is short and may be a viable option for English for Journalism learners to learn in a short time.
Despite arbitrary decisions on cut-off values in the compilation of the frequent non-compositional multi-word combinations, there may be some advantages to overt instruction of these frequent expressions. The effectiveness of learning opaque expressions is worth investigation but beyond the present focus. It is hoped that the present multiword expressions list may provide some inspiration for future empirical studies and teaching materials development for Journalism purposes.

Pedagogical Implications
Although the present multi-word combinations list provides a window to the Journalistic register, itemized phrasal expressions are still not enough for EFL undergraduates. As with the learning of individual words, the noncompositional multi-word combinations should be learned in context rather than in isolation. English for Journalism teachers can raise their students" consciousness of how opaque phrases behave in context with the help of free online concordancers (e.g. Compleat Lexical Tutor at http://www.lextutor.ca/concordancers; NOW at https://corpus.byu.edu/now/). By using corpora, students can gain direct access to abundant examples of authentic language, resulting in a better understanding of the use of certain semantically non-compositional phrases. Classroom exercises using concordances may be undertaken, for instance, in gap-fill exercises. With more exposure to English news, EFL undergraduates will consolidate the lexical knowledge acquired from the present opaque multi-word combinations list.
given that