Principled Text De-duping

Status: 
Text scripts were written in late 2016 to identify possible duplicate texts in COFEA.

Goal: 
Develop a process for deduping text within any given corpus.

Background:
There are at least three types of duplication in the BYU Law Corpora.  The first is duplicate text caused by multiple sources of the same document.  For instance a text in a Founders papers, and one reprinted in a volume of collection works.  The second is similar, but it is the republication in whole or in part by a party quoting the original source.  The final is idiomatic references that may be repeated in text prepared for multiple individuals, or commonly repeated in multiple text but with limited noetic value.  These questions require advice from professional linguists before a satisfactory technical solution can be developed.