Multiple links to other defining sources for DATA MINING

Data mining
Data mining commonly involves four classes of tasks:[13]
  • Clustering - is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data.
  • Classification - is the task of generalizing known structure to apply to new data. For example, an email program might attempt to classify an email as legitimate or spam. Common algorithms include decision tree learning, nearest neighbor, naive Bayesian classification, neural networks and support vector machines.
  • Regression - Attempts to find a function which models the data with the least error.
  • Association rule learning - Searches for relationships between variables. For example a supermarket might gather data on customer purchasing habits. Using association rule learning, the supermarket can determine which products are frequently bought together and use this information for marketing purposes. This is sometimes referred to as market basket analysis.

Database and Data MINING Algorithm Information

Mining the wealth of online genealogy data

Genealogy Datamining resources.

Data Mining


article on data mining


Blog "Mine all data" issues in mining genealogy


Genealogy data mining made simple by Missouri state Genealogy association


Mining Databases on the web 26 page paper.

about Datamining


Data Mining is a threat to privacy


In this paper, I will suggest that the application of data mining to large data sets and repositories of genealogical information, along with the potential benefits of data mining to both researchers and organizations that support genealogical research efforts will enhance the ability of historians and genealogical researchers to conduct more efficient and effective research.


Genealogical research usually begins with a small data set of a particular family relationships, history and origins. In time, the amount of research material discovered and collected regarding the members of a single family can grow substantially in size. For some researchers, the interests in genealogy become such that the development of single surname repositories become common and the researcher has turned it into a profession (
Guild, 2008). With this increasing amount of data and information, new methods of discovery need to be considered, including the application of data mining techniques for seeking out previously undiscovered patterns and clusters of information.

Genealogical data repositories for a single family will quickly grow into the gigabytes when storage factors such as copies of original records, digital photographs, video collections, references, and various electronic documents are considered. The need for increased storage increases exponentially when a researcher has taken on multiple surnames throughout a family, or when information on all the individuals of a single surname is collected. Additional data resources for the genealogist include existing electronic resources, online databases and repositories, OCR compatible documents, and other digital archives. Going through all of this raw data and finding relevant information becomes increasingly difficult and time consuming for individual researchers.

Data mining bears many similarities to statistical analysis and information extraction, and as a result data mining can be very useful in the analysis of large data repositories. Data mining in genealogical repositories can be used to extract information about previously unknown relationships, to determine implicit relationships from large collections of people, and extract potentially useful information through an automated system (
Elmasri, 2007).

Data Mining in Genealogy

In order for data mining to be effective, we must have specific goals or application of the data discovered. In the case of genealogical data, we hope to identify historic patterns, familial relationships between different persons for which explicit relationships have not been identified, geographic patterns of distribution within family groups, and specific patterns in time periods. These various classes of attributes within the data set will enable the researcher to spend less time manually searching through data sets, and more time verifying the relationships discovered through data mining.

Since genealogical data repositories often exist with some inherent structure to the data, discovering and creating models for discovery is less cumbersome than less structured data sets and makes the creation of clusters within data sets. Data modeling is “the act of building a model in one situation where you know the answer and then applying it to another situation that you don’t” (
Thearling, 2008). For instance, large genealogical data sets are stored using the GEDCOM (GEnealogical Data COMmunication) standard. The GEDCOM standard was created “to provide a flexible, uniform format for exchanging computerized genealogical data” (LDS, 1996). This structured data model allows for the creation of class labels from the attributes within it, such as birth place and date classes, surnme, place, date classes, etc. However, due to the often disparate collections of text documents, database sources, binary files, etc., using clustering techniques in data mining should prove to be a more reliable source of relevant information than hierarchical and decision trees.

Using hierarchies and clustering present ideal frameworks for data mining genealogical collections. However, hierarchies and decision trees require the creation of specific models in advance of data mining, which is not always applicable to unstructured data often found in genealogy repositories. Clustering is the processing of data into partitions without having a predefined training class for doing analysis; it places records into groups of similar data and also into groups of dissimilar data (Elmasri, 2007).

Data mining can also be used in the development of a genealogical data repository or warehouse to find meaningful patterns within existing data sets and information collections. By using data mining on small collections of data in the early stages, we can better define those elements that will structure a future data warehouse, where applicable. Data mining can also be used AFTER the creation of a data warehouse to find different rules and patterns since the data has been cleansed and transformed into the necessary structure for analysis (
Betz, 2006).

In genealogical repositories it is common for much of the data to exist in an unstructured format, such as a text paragraph in a PDF file. This free form text is not part of the typical data mining environment, thus it requires data analysts to spend more time imposing some type of structure to the data before and after processing. Domain expertise will facilitate this process, as will the interpretative power of the researcher. Ultimately, discovering relational patterns unknown a priori may both improve extraction accuracy and uncover informative trends in the data and help. (
Betz, 2006)

Future Implications

The introduction of data mining into the realm of genealogical research opens new doors for businesses to develop new customer models that assist with research and also affords individual researchers to leverage advances in storage capacity and computing power. Large organizations that have built up massive data collections – such as, or the The Church of Jesus Christ of Latter-day Saints ( – will benefit the most from the economic benefits of creating a data mining infrastructure. In this scenario – where user contributed genealogical repositories reside – each of the above organizations could create web services built upon data mining the vast collections of material they have accumulated and charging a fee for access to and use of the data mining results.

Another area of further research is the contribution of data from social networking sites to the genealogical data repository. By leveraging the vast amount of information being published online in such application as Facebook, MySpace or LinkedIn, genealogy researchers can mine relationships and social contexts of living members to develop enhanced understanding of family dynamics.

Genealogical data mining creates a number of privacy concerns for the researcher or organization that wishes to leverage it as part of a business model. While many organizations and researchers use tools available to them to hide important information about living persons, it is always possible that some aspect of the data has not been sanitized and vetted of personal information. This is an important consideration for organizations that intend to provide public accessing to genealogical data in general and to data mining services in particular.

A final concern for many genealogists who have labored to collect their data is the issue of ownership and copyright. Many researchers have spent incalculable hours compiling their data repositories and are very reluctant to part with their work. This creates a dilemma for large repositories that are user populated and make ideal candidates for data mining; in these cases, it is best that the researchers doing the data mining and the genealogical researcher conduct due diligence on keeping data clean and the ownership clearly delineated.


Betz, Jonathan, Culotta, Aron, and McCallum, Andrew. Integrating Probabilistic Extraction Models and Data Mining to Discover Relations and Patterns in Text. 2006.

Elmasri,R. and S.B.Navathe (2007). Fundamentals of Database Systems, 5th ed. Addison Wesley.

Guild of One-Name Studies. 2008.

LDS. The GEDCOM Standard Release 5.5. 1996. Family History Department, The Church of Jesus Christ of Latter-day Saints.

Thearling, Kurt. An Introduction to Data Mining.



Alamance County Gen Soc Data Mining 11 Sep 2006 Page 1 of 4

Genealogical Data Miningby Ben Franklin

Definition of Data Mining
The expression “data mining” is now widely used in the Information Systems arena. However, the exact meaning of the expression is still not widely known. It implies “digging through tons of data” to uncover patterns and relationships. Data mining:
•Is a tool that supports research and allows new assertions to be made by disclosing previously undisclosed details in large amounts of data
•Integrates the results of research in database management, statistics and artificial intelligence
•Opens new horizons where the traditional methods are not adequate for efficient data analyses
Data mining in a genealogical context is the technique of rapidly acquiring new information, without stopping to categorize, analyze and make assertions. Data mining is used to gather, collate, and organize information so that it can be more effectively searched.
Throughout this presentation, I will be mentioning the successes that I’ve experienced in pursuing a single-surname database of Franklins. I do not assume that you are interested in this lineage, but I will use it as an example.
“Bad” data mining
Another type of illicit data mining is used to surreptitiously gather e-mail addresses and other personal information for malignant use by spammers, etc. This is not the sense of the term as used in this presentation.
Acquiring “Data”
In order to leverage the analysis of our data, first we must have something to analyze. Historically, the way we acquired data in our computers was that we would either manually type it into the computer ourselves, copy it from a CD, or find and download it from the Internet. These methods are OK to start. Allow me to suggest a few more ways of acquiring data:
Convert Hardcopies to Data Via OCR
Optical Character Recognition (OCR) software can be a very useful tool for turning mounds of paper into searchable information in our computers. This topic will be discussed further later in this presentation.
Periodically the New England Historical and Genealogical Society (NEHGS) gives access to some of its on-line databases. For instance, during the 2004 Thanksgiving Holiday the NEHGS offered free access to their database. I searched their on-line index for Franklins, and downloaded 1,600 GIF images representing pages of old issues of the NEHGS Journal, that pertain to Franklins.Data Mining Alamance County Gen Soc Page 2 of 4 11 Sep 2006

JSTOR (Journal Storage Project) is an ongoing project to develop a digital library in support of the arts and sciences. It initially consisted of about fifteen journal titles in the areas of economics and history and contained approximately 750K journal page images. These journals are fully searchable. Those of you who are members of an affiliating institution, such as students or faculty of a university, can take advantage of this collection of scholarly journals.
I first became interested in this when I found that digital copies of the William and Mary Quarterly can be downloaded there. This is a very useful journal for early Virginia history and genealogy. I was able to download 90 pages that pertain to Franklins in W&M, and from the remainder of JSTOR I found about another 200 pages of various biographies and other journals with articles about Franklins.
Heritage Quest Books
Heritage Quest has a large number (25k+) of books that are available to download on-line. These can be downloaded 50 pages-at-a-time and are in (graphic) PDF format. Thus far I have downloaded about 25K pages of Franklin material from this resource. Heritage Quest can be accessed for free by anyone with a valid Durham County or Orange County Library card. See
BYU’s Digital Archives
BYU’s Family History Collection is growing rapidly. The currently (15 Sep 2005) have 3944 books that are available to download on-line. These can be downloaded one page-at-a-time and are in (graphic) PDF format.
Copy Data from the Census Index
The census index can be copied and saved to your system for later searches, however unless you are going to use it as a framework for more information or in ways that cannot be used on-line, then it might be better to just access it from the website.
You can easily reformat this data, for example:
•To use it as a framework for census abstracts
•To create a spreadsheet of census data
•To build a GEDCOM.
To do this, you will need to use your word processing software or text editor (such as TextPad ). This is quite difficult to explain in detail, but will be demonstrated during the presentation.
Acquiring Hardcopies
Identify the books that interest you, using PERSI, the FHLC, references in others researchers’ data, etc. Then you can order the film at the FHC, go to a local library, borrow the material via interlibrary loan, etc. [Getting books and films is a beginner topic...]
However, in a “data mining” mode, you do NOT stop to read the material. Merely copy it for later analysis. Too many people travel thousands of miles to get to the FHL in Salt Lake City, and then spend their time reading books. No. Focus on copying. Thorough analysis requires a lot of time. It Data Mining Alamance County Gen Soc Page 3 of 4 11 Sep 2006

is time that you can ill afford at the FHL, unless you live very close to it.
Dealing with Images
This freeware application can be used to view convert, or print almost any type of graphic images. It is very, very useful. Download it at:
Print it and OCR it
When you have information in digital form, such as GIF images of pages of the NEHGS, for instance, OCR accuracy may improved dramatically if you print a hardcopy of the page and then scan it and OCR it. This is despite the fact that the OCR program can read GIF files directly. The problem is that in order to save disk space and improve download times, the original images are scanned at a very low resolution and the step of printing the image actually improves it.
Popular OCR applications include Scansoft’s Omnipage Pro, and Scansoft’s TextBridge.
More About PDFs
You will find a number of on-line sources are available in Adobe’s Portable Document Format (PDF). This can be viewed and printed using the Adobe Acrobat freeware application that can be downloaded from the Adobe website. There are two basic forms of data in a PDF - graphic and textual. Most of the PDFs that you can download, such as from BYU or Heritage Quest, are in purely graphical form. That means that the text cannot be searched. You will need to OCR this data to convert it to searchable data.
Finding Stuff in Your Data Mine
Windows Explorer Search
For those who use Windows, using the Search capability of Windows Explorer provides a rudimentary search function. This will enable you to find specific text strings within your data mine.
The GREP command comes from the UNIX world where it is used to search within files for data in a flexible way. This is much more powerful and faster than Windows Explorer. There are several version of this software that can be downloaded from various site on the Internet. My favorite is wingrep . It may be downloaded from:
Google Desktop
Like the popular web searching site, “Google”, there is an application that you can download to your personal computer called Google Desktop// is one of the most powerful and flexible search applications for your the data on your own computer. It gives convenient access to information on Data Mining Alamance County Gen Soc Page 4 of 4 11 Sep 2006

your computer and from the web. It is a desktop search application that provides full text search for your email, computer files, music, photos, chats and web pages that you’ve viewed. By making the data on your computer searchable, Google Desktop puts your information easily within your reach and frees you from having to manually organize your files, emails and bookmarks. It makes searching your computer as easy as searching the web with Google.
•Email in various client formats, including: Gmail, Outlook, Outlook Express, Netscape Mail, Mozilla Mail and Thunderbird
•Files on your computer, including text, Word, Excel, PowerPoint, PDF, MP3, image, audio, and video files. You can even search your media files by meta-tag: for instance, by artist name and song title, not just the file name.
•Web pages you’ve viewed using Internet Explorer, Netscape, Mozilla and Firefox.
•Chats from AOL 7, AOL Instant Messenger, and MSN Messenger

BooK: Advanced data Mining and applications: third international conference, ADMA by Reda Alhajj

Book on Data Mining.