Description
The National Library of Scotland[1] contains a wealth of digitised archival documents which record information about Scotland’s past, including 700 Post Office (PO) Directories from the 1700s to 1940s from all over Scotland. However, the usefulness of the Directories would be greatly enhanced if the information was structured, in order to make it easier to recognise and search for different entities such as people or places. In our project, we are focussing on Edinburgh PO Directories from the early 20th century, with the goal of converting the entries into structured data, and then trying to link entities across directories from the different years.
The PO directories have been digitised using optical character recognition (OCR), but the quality of the OCR output is far from perfect. The first step of the project involves parsing the entries into chunks, corresponding to forenames, surnames, occupations and addresses. We are adopting supervised machine learning for this task, since that offers the best prospects of coping with inconsistent formatting and OCR errors. We have a small amount of annotated training data, and will be expanding this as the project progresses. Machine learning experiments are being run in WEKA[2] and so far have included naive Bayesian classifiers, logistic regressions and decision trees.
Once we have succeeded in extracting structured information, it will be used to populate a database. In order to identify people across different years, we will explore approaches for record linkage based on work by Peter Christen[3]. If time allows, the databases will be made accessible through a front-end web based interface.
One of the main benefits of this project will be to provide historians with an open source research tool to explore Scotland’s history. It also serves as an example of what can be done with Open Data and will hopefully encourage more GLAMs (Galleries, Libraries, Archives and Museums) to adopt open licensing for their collections. Although the Post Office Directories are openly licensed, they are not easy to work with in their current form, and an additional goal of the project is to make them more widely accessible as Open Data. This project can also serve as a springboard for similar systems in the future as developers can learn what is and isn’t feasible and what potential pitfalls to expect along the way. Finally, it serves as an example of what can be done when researchers from different fields collaborate on Open Data.
References:
[1] National Library of Scotland. 2011. 1846-1975 – Post Office Edinburgh and Leith directory. [ONLINE] Available at: http://digital.nls.uk/91168907. [Accessed 25 November 2015]
[2] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten (2009); The WEKA Data Mining Software: An Update; SIGKDD Explorations, Volume 11, Issue 1.
[3] Christen, Peter (2012). “A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication”, Data Matching.