Differences
This shows you the differences between two versions of the page.
— |
233:marcsort_dedupe [2021/12/29 16:21] (current) |
||
---|---|---|---|
Line 1: | Line 1: | ||
+ | **MARC Sort and file deduping** | ||
+ | |||
+ | __Background__ | ||
+ | |||
+ | The purpose of the MARC Sort utility is to order a file of MARC records by the value of any MARC Tag. In addition to this functionality, | ||
+ | |||
+ | In addition, it has always been possible to use MARC Sort to dedupe a file, but the steps to do so were somewhat of a puzzle: | ||
+ | - Sort the file into dupes and non-dupes | ||
+ | - Sort the dupes using a Date in the second key: keep either the earliest or most recent record from each dupe group and discard the others | ||
+ | - Concatenate the non-dupes from step #1 with the kept dupes from step #2 | ||
+ | |||
+ | \\ | ||
+ | |||
+ | __New option to dedupe a file__ | ||
+ | |||
+ | With version 233 the above three steps have been bundled into one mouse-click, | ||
+ | |||
+ | First we define the Sort keys. In Tag 1, enter 001. Leave the normalization rule set to the default (' | ||
+ | In Tag 2, enter 005, then on the far right set the Normalization rule to ' | ||
+ | |||
+ | Leave all of the Sort options set to their defaults. | ||
+ | |||
+ | In the Output options, select the new option ' | ||
+ | |||
+ | Leave the filenames set to their defaults. The setup form for this task should now look like this: | ||
+ | |||
+ | {{: | ||
+ | |||
+ | When you are ready, click Run. The program will now dedupe the file using the options specified. | ||
+ | |||
+ | __Results__ | ||
+ | |||
+ | When a dedupe option is chosen, the utility actually makes two separate runs. First, all of the records with the same 001 will be output to one file (' | ||
+ | |||
+ | __In a dedupe run, the second sort key is not used during the first sort.__ | ||
+ | |||
+ | Second, the program processes the dupes from the first run and organizes them first by 001, then by 005. We said 'Keep Last' above, so for each group of records with the same 001, the program will select the last record in each group for output to the results file (' | ||
+ | |||
+ | Finally, the program will concatenate the file of unique records from the first pass (' | ||
+ | |||
+ | On completion, a summary of results will pop-up: | ||
+ | |||
+ | {{: | ||
+ | |||
+ | This tells us that there were 2316 records in the file we started with, and that 2009 of them had the same 001, and 307 had unique 001s; then, going through the 2009 records with duplicate 001s, the program kept 997 of them((the fact that 2009 dupes were found, and 997 were kept, tells us that some of the groups of duplicates contained more than two records)); finally, the program concatenated the 307 records that were not duplicates with the 997 dupes that were kept, creating a result file of 1304 records. | ||
+ | |||
+ | This final file of 1304 records thus represents a deduped version of the original source file. | ||
+ | |||
+ | Note that if you use one of the dedupe options on a file that has no dupes, the program will pop-up a message after the first pass saying that 'No dupes were found' | ||