Genealogy how to's


Free genealogy and family tree search tips, how to articles and on-line genealogy tutorials

papere on Tools and techniques of Genealogical research by Joseph C Wolf

Tips and Tricks to Genealogy

Issues in automating genealogical research


A Metric-Based Machine Learning Approach to Genealogical
Record Linkage

S. Ivie, G. Henry, H. Gatrell and C. Giraud-Carrier
Department of Computer Science, Brigham Young University

Genealogical Record Linkage (GRL) is the process of determining whether two pedigrees
refer to the same base individual. Unlike other record linkage problems, GRL datasets
have a large number of attributes that frequently are sparsely populated with no
definitive limit. A metric-based, machine learning approach has been developed. In this
approach, innovative comparison metrics were developed for the three basic types of
data: names, dates and locations. In addition, two more advanced comparisons were
developed to handle one-to-many relationships (e.g., an individual may have 0 to an
unknown number of children). Using these metrics and Clementine’s C5.0 decision tree
learning algorithm (with costs and boosting), high levels of accuracy, precision, and
recall were achieved on a large post-blocking, standardized database.


Information Genealogy

NSF-Project IIS-0812091

Cornell University
Department of Computer Science

Project Goals

In many areas of life, we now have almost complete electronic archives reaching back for well over a decade. This includes, for example, the body of research papers in computer science, all news articles written in the US, and most people’s personal email. However, we have only rather limited methods for analyzing and understanding these collections. While keyword-based retrieval systems allow efficient access to individual documents in these archives, we still lack methods for understanding a corpus as a whole. In particular, these archives have grown through an "evolutionary" process, where new documents are influenced by the content of already existing documents and where ideas are iteratively refined. While this dependency structure is important for a high-level understanding of a collection and for improved retrieval (e.g. PageRank), little is explicitly represented or known about the influence between documents, their authors, and their effects on each other.
This project addresses the task of automatically detecting the influence structure and flow of ideas in document corpora that have grown over time (e.g. scientific literature, political debates, news, email, wikis, blogs). We call this the problem of Information Genealogy, where we trace the origin and development of ideas over time. A key difference to most prior work is that we will not require the existence of a formal citation network, but will rely primarily on the content of the documents.


Related Publications

B. Shaparenko, T. Joachims, Identifying the Original Contribution of a Document via Language Modeling, poster abstract, Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), ACM, 2009.
[PDF] [BibTeX]
B. Shaparenko, T. Joachims, Information Genealogy: Uncovering the Flow of Ideas in Non-Hyperlinked Document Databases, Proceedings of the ACM Conference on Knowledge Discovery and Data Mining (KDD), ACM, 2007.
[PDF] [BibTeX]
S. Pohl, F. Radlinski, T. Joachims, Recommending Related Papers Based on Digital Library Access Records, Proceeding of the Joint Conference on Digital Libraries (JCDL), 2007.
[PDF] [BibTeX]
B. Shaparenko, R. Caruana, J. Gehrke, and T. Joachims, Identifying Temporal Patterns and Key Players in Document Collections. Proceedings of the IEEE ICDM Workshop on Temporal Data Mining: Algorithms, Theory and Applications (TDM-05), pp. 165–174, 2005.
[PDF] [BibTeX]

Acknowledgement and Disclaimer

This material is based upon work supported by the National Science Foundation under CAREER Award IIS-0812091. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author and do not necessarily reflect the views of the National Science Foundation (NSF).


Proof Standards for GEnealogy