Using MARC Sort to dedupe a file

The purpose of the MARC Sort utility is to order a file of MARC records by the value of any MARC Tag. In addition to this functionality, the utility can also process the results so that all duplicate records1) are written to one file, and all non-duplicate records are written to another file.

<!– Many customers have asked if this utilty can dedupe a file–this tutorial will show you how.

Although there is no one-click step to deduping a file, it is possible to dedupe a file by running MARC Sort twice: once to organize the source records into duplicate records and non-duplicate records, and then a second pass on the duplicate records file to pull out the records that you want to keep (in the deduped results).

Here are the steps to deduping a file with MARC Sort. We will assume for this tutorial that all of the records in the source file have an OCLC number in the 001, and each record also has an 005.2)

Start MARC Report, select your file, then select 'MARC Sort' from the Utilities menu.

Define the Sort keys. In Tag 1, enter 001. Leave the normalization rule set to the default ('AlphaNumeric'). In Tag 2, enter 005, then on the far right set the Normalization rule to 'Date'.

Leave all of the Sort options set to their defaults.

In the Output options, select 'Pull all dupes/Split'. You can leave the filenames set to their defaults. The setup form for this task should now look like this:

When you are ready, click Run.

This task will output all the records that have duplicate 001s into a file called 'dupes.mrc', earliest record first (based on 005), and all of the records that have unique 001s to a second file called 'nondupes.mrc'.

If there are no dupes in your file, only the 'nondupes.mrc' will be created.

Here comes the tricky part.

Double-click on the picture to the left, and use the file dialog to select the dupes.mrc file from the first Sort run above.

Tag 1 should still be 001, so there's no need to change it.

Go back to the Runtime options section in the middle of the form, and select 'Keep last dupe'.

Last, but not least, change the results filename from 'dupes.mrc' to something else, say 'dupes.latest.mrc'

Now press Run.

If you followed these steps, you will now have three files:

1. dupes.mrc – all the OCLC number dupes – we will discard this file 2. nondupes.mrc – all the records that weren't dupes 3. dupes.latest.mrc –the most recent record from each set of OCLC dupes

The last step is to use the Concatenate Files utility to create a single file.

Quit MARC Sort.

Select 'Concatenate Files' from the Utilities menu.

Click the Source files button. Find and select files #2 and #3 from the list above. Click the Result File button. Enter a filename for the new file.

Click Go.

That's it. –>

By 'duplicate' we do not mean that two records are the same; instead, we mean two records or more that have sort keys that have evaluated to the same string after the sort options have been applied. For example, if you set TAG to 008 and SUBF to 007, then all records that have the same 008/Date1 will be considered duplicates:-).
Just keep in mind that if your records aren't set up like this, MARC Sort can sort a file on any MARC Tag, and then order the dupes on another field like the 008/Date 1, etc.
help/sortdedupe.txt · Last modified: 2021/12/29 16:21 (external edit)
Back to top
CC Attribution-Share Alike 4.0 International
Driven by DokuWiki