"Phrases in English" - Getting Started

Getting Started with
"Phrases in English"

Before you start...

Please take a moment to read the FAQ to understand what this site means by n-grams and phrase-frames, and how the BNC defines words and tags them grammatically with POS codes. Then familiarize yourself with the normalization conventions which are specific to this database. Finally, please remember that this site contains only a subset of the words and phrases in the BNC, those occurring three times or more.

The drop-down menu under "Grams" affords access to all the query interfaces. The

The following discussion shows screen shots from the "Explore N-Grams" page. Important differences for "Explore Phrase-Frames" are outlined at the end.

Clicking on one of the "Explore..." links launches a frameset with a query pane on the left and blank pane for query results on the right. If either side is too narrow, a scrollbar appears at the bottom of the window. The relative sizes can be changed by clicking on the divider, holding down the mouse button, and dragging the divider to resize the panes.

While numerous [Query] buttons are strewn around the page for convenience, they are actually unnecessary: just hit the "Enter" key to submit a query.

Selecting the value of n

Select the number of words in n-grams or phrase-frames by clicking on the radio button to the left of the number. Try each number to develop a feeling for the kinds of datasets which match each value of n. A value of 1 returns a list of individual words. The highest-frequency two-word phrases tend to be fragmentary building blocks of language, including "de-fused" contractions like do n't, I 'm; other useful groups of 2-grams are compound nouns written as two words as well as adjective-noun and adverb-adjective collocations. Recurring 3-grams and 4-grams are often (almost) complete familiar phrases. The larger n is, the larger the number of distinct n-grams in the corpus grows, while the percentage of n-grams meeting the cutoff criteria (here: minimum of 2 occurrences in the corpus) declines. For n values of 5 and greater, formulaic expressions restricted to highly specific circumstances become increasingly prominent in the data, and the total frequency of any given 5- or 6-gram is relatively small. For example, there are almost 1.6 M 3-grams occurring 5 or more times, and over 1.3 M 2-grams; in contrast, less than half the former number of 4-grams cross the threshold, and the number of 6-grams that qualify is less than 4% of the figure for 3-grams. Values of n greater than 6 almost exclusively reflect formulaic language and quotations.

Note the links to jump down to the word-form and POS-code filters, as well as the ubiquitous [Query] button.

Display options and numeric query conditions

The options in this section display "tool tips", i.e. short explanations that pop up when you hover your mouse cursor over them. Several display options appear in the right-hand column. As a general rule, for efficient querying you should limit the display options to what you actually need.

Count only reports the total number of items which match your query, not their total frequency. This provides a useful quick indicator of the validity of your criteria which is more efficient than retrieving and displaying an entire dataset.
Display POS tags shows the part of speech tags assigned to each word in a separate column to the right of the words matching your query. There is a link to an explanation of the codes. Deselect this option if you do not intend to use this grammatical information.
Format results as table produces an easy-to-read table of the dataset. Tabular format is required if you want to have clickable links to citations from the corpus for n-grams or links to list variants for phrase-frames. If you intend to save the results for computer analysis, not view them directly, uncheck the box for quicker results and a more compact download.
The Save settings link stores your current settings and reloads them when you return to this page. Click Reload defaults to restore your saved settings after changing any of the values. [These two features have not been implemented yet.]

Numeric Conditions

Minimum frequency restricts the result dataset to items which occur at least the specified number of times. Since the cutoff for inclusion in the database is a minimum frequency of 3, specifying lower values is the same as specifying 3. To improve response time, specify a cutoff value as high as possible for your purposes; if too few items match your query, you can reduce the cutoff value later. Follow this link for other tips to speed up your queries.
Maximum frequency restricts the result dataset to items which occur at most the specified number of times. This is useful for focusing on less frequent items. In combination with minimum frequency it permits systematic study of successive frequency ranges of items.
Start with item specifies the first item in the dataset to display. While this value is typically 1, entering a higher number allows you to resume viewing data at a point where you have left off or to skip very frequent items.
Important: When you have retrieved the first "chunk" of a dataset, click the Next > button in the query results pane to retrieve any successive chunks. (With Internet Explorer and Opera browsers the Next > button appear blue when there are more chunks remaining and gray when not; Netscape does not color the button blue to indicate data remaining in the dataset.)
Data chunk size specifies how many items to retrieve and display per chunk, up to 10,000. Large values may take a long time to download over a slow connection. A future version of this website will allow fetching complete datasets as compressed files.

Order specifies (you guessed it) the order in which items are displayed: from most to least frequent or vice-versa, in alphabetical order of the items, or in alphabetical order of the POS tags. If you choose an order other than alphabetical, items with identical values for the primary sort key appear in alphabetical order of the word or phrase. Alphabetical sorts are the most efficient option.

Focusing your search with filters

Without filters, every item that meets your numeric conditions is included in the results dataset, which degrades database response time. Filters narrow the dataset down to items which match specific criteria. You can match word-forms, POS tags, or both; if both criteria are specified for a given position, items must match all criteria (logical AND) to pass the filter.

To match alternate word-forms or POS tags, enter them all in the same field separated by spaces (logical OR).

Unfortunately there are no semantic filters. For example, by specifying both word-form and POS tag you can distinguish the verb match from the noun, but you cannot distinguish instances in which the latter means 'sports contest' from 'incendiary device', 'corresponding entity' etc.

Both word-form and POS filters support wildcards: * matches any number of characters; ? matches one character, no more, no less.

Check the exclude box to eliminate the forms or tags you specify from the dataset: then only items which do not match your specifications are retrieved.

If there is an entry in a filter field, the field is colored light green, or else light red if the exclude box is unchecked: . These color codes make it easy to spot which fields have entries and whether they specify inclusion or exclusion.

Note the < Top link to jump back to the top, e.g. to change the value of n. Decreasing the value of n hides some of the filter fields, but the values are preserved and reappear when the higher value of n is restored. Careful: clicking the Clear Filters button erases everything directly, without confirmation. It is a good idea to click this button when starting a new query lest you unintentionally carry over filters from a previous search.

Matching word forms

The Words & Phrases Database normalization process converts all characters to lower case, so matching is not case sensitive: Pole and pole both return the same dataset. To match more than one specific word-form, it is generally most efficient to specify all forms, separating them with spaces: country countries will match only these two forms, while wildcard countr* matches additional forms such as countrified, countryman, and countryside, resulting in a larger dataset with unnecessary items. You can increase the number of useful matches by specifying orthographic variants such as realise realize and database data-base.

The corpus preserves the spelling of original texts, so compound forms might be written together as a single word, with a hyphen, or as two separate words. Consequently you may need to run separate queries searching on up to three possibilities: Query 1, Word 1 database data-base ||Query 2 Word 1 data Query 2 Word 2base or Query 1 Word 1 much-needed || Query 2 Word 1 muchQuery 2 Word 2 needed.

In the normalization process sentence punctuation was removed. Three kinds of punctuation remain within word-forms:

hyphen - in hyphenated forms: much-needed
underscore _ in "multiword units": of_course in_spite_of
apostrophe ' in contracted forms and possessives, typically "de-fused" into separate word-forms: she 's does n't

Using wildcards you can find forms with these punctuation marks: *_* matches all multiword forms with underscore, and *-* returns hyphenated forms. Forms with apostrophes are "de-fused" into their components; to reconstruct them look for 2-grams and specify *'* for word 2.

Matching POS codes

The CLAWS parser assigns each "word" in the corpus a Part Of Speech (POS) code as detailed here. These codes may be specified in various ways:

Enter the three-letter POS code directly in the field, separating multiple codes with spaces, e.g. NN1 NN2 matches both singular and plural common nouns. Wildcards can be used without sacrificing efficiency, e.g. N* matches any noun.
Enter the numeric values from this table. With this method, 17 18 matches both singular and plural common nouns. Numeric values can also be specified as ranges by joining the lowest and highest value in the range with a hyphen: the range 16-19 matches any noun. These numeric values are specific to this website. In some cases the POS tags have been reordered to create ranges corresponding to POS supercategories. For example, the class articles has been moved out of alphabetical order so the range 12-15 includes all determiners.
Select a POS description from the drop-down list to the left of the entry field. This list includes single POS tags and supercategories defined with both wildcards and numeric ranges. Caution: selecting a category from this list erases anything already appearing in the corresponding field.

Tip: to get a feeling for different ways to specify POS tags, select various entries from the drop-down list and study the codes that appear in the POS field.

Differences between n-gram and phrase-frame queries

There are several subtle differences between the "Explore N-Grams" and the "Explore Phrase-Frames" interfaces:

Since phrase-frames must consist of more than one lexical unit, the smallest meaningful value of n is 2.
There is one additional pseudo-POS-tag which can be matched, the "wildword" code -*- . Entering it forces the wildword to appear in the corresponding position (i.e. as the first, second etc. word). Careful: at most one wildword may be specified per query, and each query must have at least one position for which both the word-form and POS-tag are unspecified.
Phrase-frame query results show only the phrase-frames. To see the actual variants of a phrase-frame, click on it in the results pane; the variants will appear in a new window. (If you have a pop-up window blocker, disable it for this site!)
Additional numeric conditions and ordering options for phrase-frames are highlighted in green in this screenshot below. Reminder: here as elsewhere, references to counts and frequencies imply "for the numeric conditions and word and POS filters specified".
- Minimum and maximum limits can be specified for both the absolute number of variants and the total number of occcurrences of all variants of a phrase-frame
- In addition to ordering results by the total number of occcurrences of all variants of a phrase-frame, one can order them by the total number of variants of each phrase-frame.

Result pages

All kinds of query results appear in a "pane" next to the query form and have several clickable buttons at the top:

Print prints the results pane.
Next > and < Back show the following / preceding "chunk" of the current dataset. They are "grayed out" Next > < Back when no more data are available.
Save Page saves the entire page including the Options header, either in HTML format if the "Format results as table..." option has been specified, or else as a text file. Save Data saves only the data portion of the page in text format starting with the "chunk" identifier line. By default files are saved to the desktop with a unique filename derived from the current date, query number and chunk range; the user may specify a different filename or target folder. Currently this function works only with Internet Explorer under Windows. Custom security settings may be required for it to work properly. Click here to troubleshoot problems with saving pages.
POS Codes displays an explanation of the tags if they are shown.

N-Grams

Click on any word or phrase to display a random set of 50 concordances of the item from the corpus in a separate window.

Phrase-Frames

Click on any phrase frame to display variants in a separate window. The frequency cutoffs, first item and chunk size are determined by the parameters specified in the phrase-frame uqery.

Phrase-Frame Variants