How "Phrases in English" Was Made
Modification and Normalization of the BNC Data
To interpret query results correctly users of this site should
understand the normalization conventions followed in compiling the PIE database. Generally speaking, these conventions were
adopted to limit the number of non-essential distinctions in the data both to permit linguistic patterns to emerge more
clearly and to improve database performance. These conventions and the exclusion of items falling
below the frequency threshold of three mean that queries against this database can yield different
frequency counts than queries directly against the BNC database.
-
Upper-case characters were converted to lower case. Since proper
nouns have different POS tags from common nouns, e.g. Pole can
still be distinguished from pole, but there is no marking of proper
adjectives.
-
Accented characters (café, façade) were mapped onto their
"plain" equivalents (cafe, facade), both because source texts are
not always consistent or correct in use of diacritics and because entry of
plain characters queries is easier from English-language keyboards. Other
SGML
character entities were treated variously
as detailed here.
-
All numerals of any magnitude and degree of precision were mapped onto a single #,
so both 31,298,435 and 0.0095 appear as #, and ranges
of numerals like 1931-35 appear as #-#. The primary
motivation was to highlight lexical patterns involving numbers and dates
which would otherwise be obscured by the large number of variants. About 1
in 50 "words" in the BNC is a sequence of numerals, and the data contain
about 45,000 different sequences, of which 60% occur just once; fewer than
half the numbers meet the 3 times or more frequency cutoff for inclusion in the PIE database.
In the follow-on phase an additional database is planned specifically
to study numerals and numbers in the BNC data.
-
Each token identified by the
CLAWS parser with a lexical or morphemic POS tag (i.e. not a
punctuation tag) was treated as a "word".
"Multiword
units" (such as in spite of) are joined by underscores, not spaces (in_spite_of).
"Fused"
forms, both contractions like isn't and possessives like boy's,
are separated into their components: is n't, boy 's.
Somewhat inconsistently, can't was de-fused into can n't, but
won't, ain't were separated into wo n't, ai n't. Following BNC's
usage, variant spellings without or with space are treated one or two
words respectively, and hyphenated variants are treated as a single
distinct word: data base (two words); database, data-base
(distinct one-word types).
-
Punctuation marks other
than quote and comma were treated as an end-of-phrase markers, as were
segment boundaries assigned by the CLAWS parser, and word-external punctuation was stripped.
-
All sequences of 1-8 "words" and POS tags in each phrase (see 5) were
isolated and tallied to construct the database. Words and n-grams
occurring less than three times in the entire corpus were dropped.
From this database phrase-frames were derived and tallied. All
phrase-frames with two or more variants were retained.
The Words and Phrases database uses the
MySQL database server with a Web user
interface programmed in PHP. The text
normalization procedures described above are programmed in
PowerBasic incorporating routines coded
by the developer for KWiCFinder and
kfNgram,
which generates phrase-frames and chargrams as well as n-grams.