MARC Sort and file deduping

Background

The purpose of the MARC Sort utility is to order a file of MARC records by the value of any MARC Tag. In addition to this functionality, the utility can also process the results so that all duplicate records1) are written to one file, and all non-duplicate records are written to another file.

In addition, it has always been possible to use MARC Sort to dedupe a file, but the steps to do so were somewhat of a puzzle:

  1. Sort the file into dupes and non-dupes
  2. Sort the dupes using a Date in the second key: keep either the earliest or most recent record from each dupe group and discard the others
  3. Concatenate the non-dupes from step #1 with the kept dupes from step #2


New option to dedupe a file

With version 233 the above three steps have been bundled into one mouse-click, thanks to the addition of two special 'Dedupe' options (see the screenshot that follows). We will illustrate this by deduping a file of OCLC records–each record has an OCLC number in the 001, and each record has an 005.2)

First we define the Sort keys. In Tag 1, enter 001. Leave the normalization rule set to the default ('AlphaNumeric').
In Tag 2, enter 005, then on the far right set the Normalization rule to 'Date'.

Leave all of the Sort options set to their defaults.

In the Output options, select the new option 'Dedupe/Keep last dupe'. Since we are sorting in 'Ascending order', the records with the later dates will file after records with earlier dates–this means that the most recent copy of any dupe will be the one that is retained.

Leave the filenames set to their defaults. The setup form for this task should now look like this:

When you are ready, click Run. The program will now dedupe the file using the options specified.

Results

When a dedupe option is chosen, the utility actually makes two separate runs. First, all of the records with the same 001 will be output to one file ('dupes'), and all of the records with unique 001s will be output to another file ('non-dupes').

In a dedupe run, the second sort key is not used during the first sort.

Second, the program processes the dupes from the first run and organizes them first by 001, then by 005. We said 'Keep Last' above, so for each group of records with the same 001, the program will select the last record in each group for output to the results file ('dupes2'). This is where the 005 kicks in: if three records have the same 001, they will then be ordered according to the value of the 005 (or whatever tag was used for the second key).

Finally, the program will concatenate the file of unique records from the first pass ('non-dupes') with the file of kept dupes from the second pass ('dupes2') to create a file of deduped records.

On completion, a summary of results will pop-up:

This tells us that there were 2316 records in the file we started with, and that 2009 of them had the same 001, and 307 had unique 001s; then, going through the 2009 records with duplicate 001s, the program kept 997 of them3); finally, the program concatenated the 307 records that were not duplicates with the 997 dupes that were kept, creating a result file of 1304 records.

This final file of 1304 records thus represents a deduped version of the original source file.

Note that if you use one of the dedupe options on a file that has no dupes, the program will pop-up a message after the first pass saying that 'No dupes were found'.

1)
By 'duplicate' we do not mean that two records are the same, but that two (or more) records have sort keys that evaluated to the same string (after the sort options were applied). For example, if you sort a file on 008/07, then all records with the same 008/Date1 would be considered duplicates.
2)
Keep in mind that if your records aren't set up like this, MARC Sort can sort a file on any MARC Tag, and then order the dupes chronologically on another field like the 008/Date 1, etc.
3)
the fact that 2009 dupes were found, and 997 were kept, tells us that some of the groups of duplicates contained more than two records
233/marcsort_dedupe.txt · Last modified: 2021/12/29 16:21 (external edit)
Back to top
CC Attribution-Share Alike 4.0 International
Driven by DokuWiki