Corpus Purpose:
This corpus is designed to represent general written American English from the founding era of the United States of America (i.e., 1765-1799). This corpus attempts to represent general writing by sampling language from multiple registers (see Biber, 1993). Biber (1993) argues that register diversity more so than corpus size is useful for general language studies because language can vary so vastly from one register to register. Therefore, register is a key variable that must be considered when designing interpreting results from corpora. Thus, although this corpus does not fully represent American English from the founding era because it is both large and register-diversified, it is currently the best corpus in existence for representing written language from that time period. We provide a detailed description of the composition of this corpus below.
Current Status:
Version 3.00 was built 4 February 2019.
It includes corrections of OCR errors and adjusted word counts.
Current sources include 119,801 texts from three sources for a total of 133,488,113 words.
Source |
Documents |
Words |
Evans Early American Imprints |
2,646 |
62,582,540 |
Founders Online |
116,854 |
38,259,148 |
HeinOnline |
277 |
32,481,530 |
Farrands |
847 |
693,750 |
United States Statutes at Large |
479 |
444,673 |
Elliots |
145 |
373,957 |
Totals |
127,840 |
138,182,839 |
Version 5.3.0
Current sources include 127,840 texts from three sources for a total of 138,182,839 words.
The Initial Three Sources are:
Founders Online (https://founders.archives.gov/) over 90,000 records (mostly personal records, letters, diaries, etc. ) from the National Archives.
Broken Down by individual words, the Founders Online we are using represent the following founders.
Author | Words |
Washington Papers | 12,044,694 |
Adams Papers | 7,274,489 |
Hamilton Papers | 3,895,699 |
Franklin Papers | 2,578,518 |
Jefferson Papers | 1,726,603 |
Madison Papers | 119,680 |
HeinOnline (The largest legal publisher in the United States)
Around 300 records. These are mostly session laws, executive department reports, and legal treatises. For the most recent title list click here.
Evans Bibliography of Early American Imprints covering the time frame of 1760 to 1799. For the most recent title list click here. Around 3000 texts from Evan’s work American bibliography : a chronological dictionary of all books, pamphlets and periodical publications printed in the United States of America from the genesis of printing in 1639 down to and including the year 1820 ;with bibliographical and biographical notes. We were given the third of Evans available and about half of that was within our time frame. It was shared with us by the University of Michigan’s Text Creation Project (TCP).
Goal: Develop large balanced corpus of English language materials available between 1760 and 1799.
Background:
COFEA was initial conceptualized by James Phillips, in 2015 while he as a visiting professor at BYU Law School.
It covers the time period starting with the reign of King George III, and ending with the death of George Washington (1760-1799), making it the oldest historical corpus of American English, and the possibly the first in existence for that time period.
References
Biber, D. (1993). Using register-diversified corpora for general language studies. Computational Linguistics, 19(2), 219-241.
(Updated 22/2/2024)