This shows you the differences between two versions of the page.

Link to this comparison view

plp:crosschecks:x260b [2010/05/26 15:48]
plp:crosschecks:x260b [2013/04/27 09:09] (current)
Line 1: Line 1:
 +====== X260B Crosscheck ======
 +The purpose of this crosscheck is to prevent records representing different productions, publications, etc. of an item from matching. 
 +X260B often fails because of variations in the form of the publisher given in the 260 subfield $b. For this reason, the X260B is a good target for PLP's [[plp:synonyms|synonym rules]]. PLP also performs an unusual amount of processing to this crosscheck in an effort to reduce the number of XCFails generated by it.
 +===== Pre-processing =====
 +===== Data extraction =====
 +  * Publisher names are extracted from MARC 260 subfield $b; all occurrences of subfield $b are used
 +  * If the extracted data contains a semi-colon, the string will be broken at each ';', creating additional publisher strings
 +  * It is typical (after normalization--see section that follows) for each record to egnerate several publisher strings
 +===== Normalization =====
 +  * Any data enclosed within square brackets is deleted
 +  * [[plp:normalization|standard normalization]] is applied
 +  * The following common strings are deleted: [[x260b stopwords|X260B Stopwords]]
 +  * If the resulting string is < 3 characters or > 64, processing stops and an empty publisher is returned
 +  * The following common phrases are then deleted: ' & COMPANY INC', ' & CO INC', ' & COMPANY', ' AND COMPANY INC', ' AND CO INC', ' AND COMPANY', ' AND CO', ' INC', ' COMPANY', ' CO', ' & CO'
 +  * If the result contains any blank spaces, an attempt is made to extract the most meaningful term from the string and add the result as a separate 'publisher'. For example:
 +      if we have:       this step will create an additional item for:      
 +      A KNOPF           KNOPF
 +      J P GETTY         GETTY
 +      JOHN P GETTY      GETTY
 +During this special search for 'most meaningful' terms, the following processing occurs:
 +  * the words 'AND', 'FOR', 'OF' are deleted (the three most common words in the 260 $b)
 +  * of the remaining words, only the first three are retained
 +  * a second [[x260b_stopwords2|list of stopwords]] is applied to the third word
 +===== Processing rules =====
 +  * If the publisher string for either record is blank, the crosscheck passes
 +  * If any of the publisher strings from one record match any publisher string in the other, the crosscheck passes
 +  * Otherwise, the crosschek fails
Back to top
CC Attribution-Noncommercial-Share Alike 3.0 Unported
Driven by DokuWiki Recent changes RSS feed