List search: Matching items from a list in MARC Review

MARC Review has the ability to match MARC Data (specified on the pattern form) against a list of items entered into a text file. For example, you could search your database for a list of ISBNs; or, you could validate the contents of a subject heading field against a vocabulary or thesaurus.

For a 'List search', a text file meeting the following criteria is needed:

  • Each item, or string, in the list must be on a separate line.
  • The list must not contain any null (empty) lines, as a null line represents the end of the list to the program.
  • In general, the list data should be entered in the same form as it appears in your MARC data1).


Its also important to make sure that the file containing your list of items is in a folder that is not likely to be moved. The reason for this is that, MARC Review saves the name of the file (and not the content of the list), and if the file were to be moved, the saved review will become invalid.

Once you have a list of items to search, the next step is to start MARC Review, press Next, and go to the pattern form.

Specify the pattern options as you normally would, keeping in mind that:
1) The 'Data' box is reserved for the filename containing the list, and
2) All other options on this form will be applied to every item in the list.

Do not type into the Data box.

Instead, activate the special 'List search' menu, by right-clicking on the Data box:

Two types of list matching are supported. This article helps you decide what 'kind' of list matching you want to perform:

  • simple string matching, or
  • value list matching.

SIMPLE STRING MATCH

Simple string matching uses the default MARC Review pattern match behavior. When applied to a list, this type of matching checks for the presence–in the MARC field specified on the pattern form–of any string in the list.

The following examples describe the default MARC Review pattern match behavior; remember, this behavior will be applied to each item in your list.

For example, if the pattern form specifies

TAG=650
SUBF=a
DATA=librar
CASE=False

then the program will find all records that contain a 650 $a with the term 'librar' in it:

$aDigital libraries
$aFriends of the library
$aInternational librarianship
$aLibrarians
$aLibraries and people with disabilities
$aLibrary catalogs

We can anchor a match to the beginning or end of the MARC field by using regular expressions. Continuing with the 650 $a example, if we change DATA to:

DATA=^librar
CASE=False
RegEx=True

it then matches only subfields beginning with the string:

$aLibrarians
$aLibraries and people with disabilities
$aLibrary catalogs

Matching to the end of a field is a bit harder, because a pattern like:

DATA=librar.*$
CASE=False
RegEx=True

will still match a heading like:

$aLibraries and people with disabilities

because of the 'greediness' of the regular expression support in the program. So what we would have to do is something like this:

DATA=libraries$||library$||librarians$||librarianship$
CASE=True
RegEx=True

turning on case-sensitivity so that we do not match '$aLibraries' (additionally, we could add a blank space in front of our search terms).

Note: The maximun length of a 'simple string list' is 5000 items. The reason for this limit is that this type of search is very ineffecient–in effect, the program creates one pattern for each item in the list.

Example: Using a simple string list to find errors reported by MARC Analysis


VALUE LIST MATCH

Value list string matching was added to the program in version 236. The purpose of a value list is to support controlled vocabularies. Examples of value lists are everything from the MARC Code List for Languages to the Library of Congress Subject Headings.

Ideally, MARC fields that are to contain data from a controlled vocabulary should be entered using dropdown menus that contain all available values. For example, in MARC Report we may click on the 008 element for 'Language' and press <F1> to select a code from a list of valid codes.

However, this type of data entry may not be feasible for a large list of subject headings, which may contain many thousands (or hundreds of thousands) of items. Value list searching makes it possible to validate MARC data against a controlled vocabulary 'after the fact', so to speak.

Value list matching in MARC Review is implemented somewhat differently than “simple string” matching. Whereas above, we asked the question: 'is the specified data (“librar”) present in the field we are searching (“650$a”)?', value list searching asks the question: 'is the (content of the) field I am searching present in the value list I have specified?'.

Value list matching is always left-anchored in MARC Review. We assume, by definition, that a subject heading of:

650 $aLibraries.

should never match a value list item like:

Technical services (Libraries)

Thus, we do not match strings within strings when validating a term in a MARC record against a value list. This has a benefit in that we can programmatically support very large lists when the search term is left-anchored.

Also, in value list matching there is no 'Regular expression' option, as matching is always left-anchored. Instead, this option is replaced on the pattern form with two new options: 'Partial match', and 'Normalize MARC':

If the 'Partial match' option is not selected (the default), then the MARC field:

650 $aLibraries

will match only one item from the LCSH value list:

Libraries

Thus, the default action in a value list search option is to require an exact match of the MARC field in the list.

If, au contraire, 'Partial match' is selected, then the MARC field:

650 $aLibraries

will match all of the following items from a LCSH value list:

Libraries
Libraries (Rooms)
Libraries and adult education
Libraries and booksellers
Libraries and colleges
Libraries and community
Libraries and distance education
Libraries and education
...
Libraries, Medical

In this case, the effect is similar to a 'browse heading' search (as the matching is left-anchored). One benefit of this option (in MARC Review) is that it will allow shorter value lists, if needed.

The 'Normalize data' option (selected by default) is important in a list search. The purpose of this option is to normalize both the list items, and the data selected in the MARC records in the same way as search terms are normalized when we search an OPAC 2).

For example, if 'Normalize data' is not selected and a MARC record contains:

650 $aLibraries.

the MARC record will fail the pattern match (against the LCSH example listed above) because of the ending period. For complete details on the normalization used in a value list search, please click here.

Note: the Case sensitive option is also supported in value list matching.

Note: Value lists should not contain subfield delimiters.

For example, do not enter:

$aLibraries and education

in a value list.


Note: When matching against a list in MARC Review, only two 'Rules' are applicable, but their sense is slightly different, depending on the type of list:

'And' -- In a 'simple' search, returns True if any of the
items in the list are present in the MARC field 
'And' -- In a 'value list' search, returns True if the 
MARC field is present in the value list.
'Not' -- In a 'simple' search, returns True if none of the
items in the list are present in the MARC field 
'Not' -- In a 'value list' search, returns True if the
MARC field is not present in the value list.

Back To Top

Summary

To summarize, the simple string list method tries to match the items in your list (one at a time) against the data in the Tag/Subfield specified on the pattern form. The matching can be applied with or without case-sensitivity. It stops at the first hit.

A valuelist search works the other way around: it tried to match the data in the Tag/Subfield list specified on the pattern form against the items in your list. The matching can be applied options for case-sensitivity, normalization, and partial match. It also stops on the first hit.

These methods seem very similar.

Yet running each method on the same file can yield very different results. To demonstrate this, we ran a list search in each mode using the RDA list of relationship designators as our list, and the 700 $e as the MARC pattern. The file used contained 240721 DLC records coded as 'rda'.

Using the 'simple string list' option,

69792 tag(s) in 45558 record(s) matched the pattern:
AND 700$e=D:\MARC\rda appendix I elements.txt [Case=Y] 

Using the 'valuelist' option:

51056 tag(s) in 35350 record(s) matched the pattern: 
AND 700$e=Item In List (D:\MARC\rda appendix I elements.txt)  [Case=Y] 

Same MARC file, same text list, two different results.

One example should illustrate the difference.

In 2014, RDA removed the relationship designator for

editor of compilation

and it was to be replaced by (the already in use)

editor

In our list, of course, we had the latest form, 'editor'. But the DLC file, which goes back to the beginning of MARC/RDA in 2010, had many instances for 'editor of compilation'.

So, the simple string list, which matches list terms against MARC data, successfully matched 'editor' against both

700 $e editor
700 $e editor of compilation

whereas, with the valuelist method, 'editor of compilation' did not match any terms in the supplied list.

1) but see below for more options on controlling matching
2) One notable difference is that MARC Review allows us to specify case-sensitivity separately–whereas in an OPAC, normalization implies ignoring case
help/mr_list_search_236.txt · Last modified: 2017/02/04 12:04 by richard
Back to top
CC Attribution-Noncommercial-Share Alike 3.0 Unported
Driven by DokuWiki Recent changes RSS feed