Law & Corpus Linguistics — Background

Corpus linguistics is an approach to language research that utilizes a principled collection of texts (i.e., a corpus) in order to better understand patterns of language use. Analysis of these patterns can produce insight into, among other things, the meaning of words and phrases. Linguists (and lexicographers) have long understood that corpora are a vastly superior guide to interpretation than native speaker intuition or even dictionaries. With advances in computer technology, the use of corpus linguistics for research has expanded dramatically. Legal scholars and judges have only recently begun to tap the potential of this method because most are unaware of its possibilities.

What counts as corpus linguistics?

Biber, Conrad, and Reppen (1998) identified several aspects that identify what corpus linguistics is and how it can be used:

  1. They are empirical analyses of actual patterns of language use in natural texts;
  2. They are based on large and principled collections of texts called corpora to represent a target domain of language use;
  3. They use computers for analysis and employ both automatic and interactive techniques;
  4. They rely on both quantitative and qualitative analytic techniques to be successful (Biber, Conrad, & Reppen, 1998, p. 4).

Several things are important to highlight from these four hallmarks of corpus linguistics. First, a corpus is used to find generalizable patterns from a large dataset. Often, in order to make a rhetorically convincing argument, one or two well-chosen examples might be selected to illustrate a point. However, in corpus linguistics, instead of focusing solely on small numbers of specific examples, one is able to find systematic, reoccurring, and robust patterns that occur over and over and over again. Thus, instead of relying only on anecdotal evidence to support a linguistic claim, one is able to provide massive amounts of evidence to support their thesis.

Second, corpora represent a target domain of language use. Logically, a corpus comprised of language from face-to-face conversations of British English speakers in the 21st century is not going to share the same situational, linguistic, or functional characteristics as language of U.S. Supreme Court opinions from the 19th century. However, even when the considering language varieties that are less different than that example, a corpus and findings based off of the analysis of that corpus will only be generalizable to the extent to which the corpus shares characteristics with the target domain of language use.

Third, and perhaps most importantly, corpus linguistics is a scientific discipline that relies on both quantitative and qualitative analysis techniques. Therefore, the primary goal of corpus linguistics is to use methodological valid techniques in order to discover objective reality. The methods of corpus linguistics are designed to minimize bias, promote replicability, and produce results that are generalizable.

In conclusion, corpus linguistics is a methodological attempt to leverage computers to identify patterns of language use in large sets of data in order to make generalizable claims. Because so much of legal scholarship revolves around linguistic questions, corpus methods can be leveraged to provide scientifically valid methods for learning objective reality to answer those questions.


Biber, D., Douglas, B., Conrad, S., & Reppen, R. (1998). Corpus linguistics: Investigating language structure and use. Cambridge: Cambridge University Press.

(Updated 10/8/2019)