Deduplication Services

Duplication of data plagues every database and mailing list. Duplication is inevitable and constantly keeps growing. Duplicate records progressively and significantly erode the quality of your data. You can slow its progress, but it’s impossible to stop it completely.

Exact Duplicates

Near Duplicates

True Duplicates

Deduplication (AKA: De-duplication, De-duping)

Duplication of data plagues every database and mailing list. Duplication is inevitable and constantly keeps growing. Duplicate records progressively and significantly erode the quality of your data. You can slow its progress, but it’s impossible to stop it completely.

In databases, duplication means that data mining totals, aggregates and key business decisions made on them will be inaccurate and therefore misleading. With mailing lists the most obvious reason is to save the money that would normally be wasted on printing, labeling and mailing two or more mailing pieces to the same address, a fruitless repetition of efforts. But there are even subtler and more deadly effects from mailing duplicate pieces than simply wasting your money.

Your customers feel disregarded if they get more than one piece per mailing – after all, don’t you? The best that a customer will think about getting several of the same mailing pieces from you is that you’re not very competent at keeping track of them (…or of your own records, for that matter!).

With donor appeals, the results can be even more disastrous. Besides the normal drawbacks in customer perception, there is also a chance that you have alienated a benefactor, loosing their future donations as a result.

Exact Duplicates

Definition: In these records names and addresses are spelled the same way, including all spaces and any punctuation. These are easy to find. One of the most common ways is through building and matching with a special field calculated for just this purpose (aka: Matchcode).

Cause: Since the consistency in order taking, sales lead collection and mailing list management required to create many exact duplicate records is almost non-existent, a high count of exact duplicates indicates that the same records have been appended to your table/file more than once. It’s likely your data entry is not the problem; look to your IT department or Database Administrator (DBA) for the answer.

Near Duplicates

Definition: This is where the people and destinations in your list are the same but there is variation in the spelling of the information or typos in the data. These are much more difficult to cull out of your database; they don’t really match, but they are True Duplicates (see below) nonetheless. It takes somewhat sophisticated software to find these (see table below).

Cause: This happens primarily because of two things: Incomplete or undecipherable data submitted for data entry (i.e., hand-written), or bad data entry. (Lists gathered from Web sites where each person enters their own data will cause this and many other problems.) Avoiding this problem is all but impossible, however training your data entry people, or using professional data entry personnel to enter sales and response data will help a lot.

True Duplicates

Definition: These are also the same people and addresses but are normally not identified by mail house de-duping software because they are too different where the computer’s judgment is concerned. They can be an exaggerated Near Duplicate but much more difficult to find. No matter what form they’re in, any set of records that will send more than one piece of mail to the same people at the same location are True Duplicates. There is special software on the market which can ferret these anomalies out of your database, but they all require some skill to use properly for a fully cleaned list. Here at Northwest Database Services, we use specially written de-duplication software that can easily root out the kinds of Near and True Duplicates seen in the table below:

Table with examples of near and true duplicates that are deduplication service can find and remove at the NW Database Services company

Near and True Duplicates In a Database

Cause: Near and True duplicate records are difficult to find lurking in your mailing list and yet they are constantly being put in during data entry. Clean data and the assignment of a data management person to review all new records before they are incorporated into your database will go a long way to solving this common problem.

What we do

NW Database offers the following de-duplication services

  • Match records using any field, any part of a field, or a combination of fields. We can choose to include an entire field’s data, or a string of characters within the field or fields selected.
  • We can select up to four hierarchical ranking fields on which to sort the findings within each group of duplicate records. We mark the first record in each dupe-group as a potential save and then mark the others for potential exclusion or deletion by default. Ranking fields tell us which record in each set you would prefer to keep (usually based on your business rules).
  • Select the percentage of match between records identified as likely duplicates (Suspects).
  • We can view the identified duplicate records grouped together and can make changes such as which record or records are to be removed (aka Casualties) or kept (aka Survivors); which makes it easy to drag and drop the best data between grouped records.
  • Merge or append important data between records, typically from casualties to survivors.
  • Produce a Dupe Report, with an optional summary sheet, which displays up to eight client-selected fields, as well as the percent of match, and which records were marked as “Keep” vs. “Delete” (survivor vs. casualty).
  • We can output these choices to other tables:
    • Only unique records (survivors).
    • Only duplicate records (casualties).
    • All records, with duplicates grouped and marked.
    • Only the dupe sets.
    • A table of casualty and survivor IDs for use in queries that would realign child table IDs with the correct parent table records (most often used when working with relational databases).

We can also update child table IDs directly, changing foreign IDs from records related to those to be discarded to their retained counterpart’s ID. (This way, instead of three versions of Joe Smith’s record with, for example, 2 different donation records each, what will remain is one record for him with all six of his donation records attached.)

We Are Here To Help!

14 + 11 =

Office

Sandersville, GA 31082

Email

gch [@] nwdatabase.com
To use email, remove the brackets

 

Call Us

(478)412-2156