Fuzzy-matched Record Linkage and Deduplication

When combining datasets it is often necessary to de-duplicate records or to establish links between records that don’t share a common identifier. In these cases we typically need to compare open-text fields that might exhibit a range of variations of spelling or occasional include optional parts. Our company, for example, could be represented unambigiously in the UK with the compant registration number (06357985) that is set by Companies House. In real-world datasets, however, it is not uncommon to find it listed as “Infonomics”, “Infonomics Ltd”, “Infonomics Limited”, “Infonomix” etc.

Fuzzy-matching is the process of matching records despite these inconsistencies in how they are identified. We’ve used the Freely Extensible Biomedical Record Linkage which provides a set of python routines and a user interface.

The overal process works as follows:

identify blocks of possible matches - or rather to distinguish a record from obviously impossible matches (so that you don’t need to compare every record with every other one)
convert the text strings into standardised forms, this might be to use phonemes or to strip out extraneous characters
compare the standardised fields using metrics like edit distance to quantify several measures of similarity between candidate matches in each block
classify the weights assigned to the measures of similarity in order to produce an overall score that is effective for match, this can use supervised machine learning if a ground-truth such as a partial set of known-matches (typically taken from exact string matches) or otherwise using unsupervised learningn methods

We’ve had good success in using these techniques for linking records about Company Names or Administrative Geographies. The real power is that once you have a standardised identifier for a business or place you can then relate those records to a vast wealth of data that is published against those identifiers (such as company accounts or National statistics).

data-migration data-cleansing fuzzy-matching python machine-learning