Corpus of Founding Era American English (COFEA)

Corpus Purpose:

This corpus is designed to represent general written American English from the founding era of the United States of America (i.e., 1765-1799). This corpus attempts to represent general writing by sampling language from multiple registers (see Biber, 1993). Biber (1993) argues that register diversity more so than corpus size is useful for general language studies because language can vary so vastly from one register to register. Therefore, register is a key variable that must be considered when designing interpreting results from corpora. Thus, although this corpus does not fully represent American English from the founding era because it is both large and register-diversified, it is currently the best corpus in existence for representing written language from that time period. We provide a detailed description of the composition of this corpus below.

Current Status:  

Version 3.00 was built 4 February 2019.

It includes corrections of OCR errors and adjusted word counts.

Current sources include 119,801 texts from three sources for a total of 133,488,113 words.

Source

Documents

Words

Evans Early American Imprints

2,645

62,660,171

Founders Online

115,408

37,057,114

HeinOnline

277

32,237,273

Farrands

847

689,755

United States Statutes at Large

479

470,345

Elliots

145

373,455

Totals

119,801

133,488,113

 

Version 2.1

Current sources include 95,133 texts from three sources for a total of 138,892,619 words.

 The Initial Three Sources are:

Founders Online (https://founders.archives.gov/) over 90,000 records (mostly personal records, letters, diaries, etc. ) from the National Archives.

Broken Down by individual words, the Founders Online we are using represent the following founders.

Author Words
Washington Papers 12,044,694
Adams Papers 7,274,489
Hamilton Papers 3,895,699
Franklin Papers 2,578,518
Jefferson Papers 1,726,603
Madison Papers 119,680

HeinOnline (The largest legal publisher in the United States)

Around 300 records.  These are mostly session laws, executive department reports, and legal treatises.  For the most recent title list click here.

Evans Bibliography of Early American Imprints covering the time frame of 1760 to 1799.  For the most recent title list click here.  Around 3000 texts from Evan’s work American bibliography : a chronological dictionary of all books, pamphlets and periodical publications printed in the United States of America from the genesis of printing in 1639 down to and including the year 1820 ;with bibliographical and biographical notes.  We were given t a third of Evans available and about half of that was within our time frame.  It was shared with us by the University of Michigan’s Text Creation Project (TCP).

Goal: Develop large balanced corpus of English language materials available between 1760 and 1799.

Background:

COFEA was initial conceptualized by James Phillips, in 2015 while he as a visiting professor at BYU Law School.

It covers the time period starting with the reign of King George III, and ending with the death of George Washington (1760-1799), making it the oldest historical corpus of American English, and the possibly the first in existence for that time period.

References

Biber, D. (1993). Using register-diversified corpora for general language studies. Computational Linguistics19(2), 219-241.

 

(Updated 10/11/2019)

Constitution

David Armond

Head of Infrastructure & Technology

BYU Law
Sara White

Corpus Linguistics Fellow

Profile