MARC Global and PCRE

Getting to know subpatterns

Many of the MARC Review PCRE examples demonstrate the use of sub-paterns in MARC Review. These same sub-patterns may also be used in the top form of the Change Data task in MARC Global.

In brief, when patterns enclosed in parens in the top pattern match, the matching data is saved into a memory space; these memory spaces1) can then be accessed in the bottom pattern using a 'back reference' technique, where:

'\1' refers to memory space #1, ie. whatever matches the first subpattern '\2' refers to memory space #2, ie. whatever matched the second subpattern

The subpattern numberings–i.e. the determination of 'first', 'second', etc.–follow simple sequential order.

Here is a quick example to illustrate. Suppose you have sentences in your abstracts that are missing the blank space between the end of one sentence and the start of another. And, for some unknown reason :-), you have decided that you want to add the missing blank spaces.

Without PCRE, you would have to change every period (full-stop) to period + blank space–and take your chances. But now, the following will do the job quite nicely, with little need for a subsequent clean-up review:

In the example we create two subpatterns:

([a-z]\.)
([A-Z]) 

And in the 'New Data' form we reference them as:

\1
\2

If the 520 field looks like this:

They include excellent overviews of history for 
each country.Long available as a print series, the 
web versions are easy to navigate and read.

then the first subpattern ([a-z]\.) matches a lowercase letter followed by a period:

y.   (each countr'y.')

and the second subpattern ([A-Z]) matches an uppercase letter following the first subpattern:

L    ('L'ong available)

Therefore, when we refer to the subpattern matches in the 'New data' section,

\1 is equivalent to 'y.'
\2 is equivalent to the 'L' (after 'y.')

and thus, our replacement pattern

\1 \2 

basically says

replace 'y.L' with 'y. L'

Note: the blank space between \1 and \2 is not a separator–it is data to be added between the two subpatterns!


Much of the power of PCRE lies in, so its important to know a little bit about them, and thus, a few more examples from real-life (or should that be 'real data'?) follow.

Example: Changing variant values to a common value

With the changes made to form subdivisions over the years, its often difficult to predict in what subfield a subdivision might be found.

Lets say you want to find all fields with the subdivisions “Maps, Physical” and “Bathymetric maps”, whether they appear in $x or $v, and change them all to $v “Maps”.

The best way to do this in the past would be to use two2)separate reviews. Beginning with version 236, this type of task can now be performed in a single review:

Data to Change:

TAG=651
DATA=±[xv](Maps, Physical|Bathymetric maps)
REGULAR EXPRESSION=Yes

New data:

REGULAR EXPRESSION=No
DATA=±vMaps

Explanation: In tag 651, match a subfield delimiter followed by either 'x' or 'v' (the character class '[xv]'), followed by the subpattern '(Maps, Physical|Bathymetric maps)', where '|' indicates an 'or' condition.

Replace whichever phrase matches the sub-pattern with '±vMaps'

In fact, if you had more variants that you wanted to change to 'Maps', you could simply add them to the subpattern, separated by pipes:

(Maps, Physical|Bathymetric maps|Maps, Tourist|Maps, Topographic)

etc.

Example: Changing a 'c' into a ©

MARC Global recognizes the PCRE syntax to specify diacritics, both in the 'Data to Change', and the 'New Data' boxes.

For example, to change the letter 'c' into the copyright symbol, you would setup a Change Data form either like this

or like this

Of course, when dealing with diacritics, you will need to know the character encoding of your records, so each of the above examples should be combined with an appropriate pre-processing pattern.

For example, for unicode records, setup a pattern like the following before enter the 'Change data' parameters:

IMPORTANT

Do not use the older MARC Review curly brace syntax, eg.

New data={xC2}{xA9}

as a replacement pattern in MARC Global, as it will, in this case, replace the 'c' with the string '{xC2}{xA9}'.

The reason for this is twofold.

First, the curly brace syntax is now deprecated. Replace the old curly brace syntax with the standard backslash notation in your reviews.

Second, the old curly brace syntax requires the regular expression checkbox be selected, and that checkbox isn't usually enabled for the 'New data' option.

Example: Fixing uppercase titles

In MARC Global, it is now possible to use a regular expression in the bottom part of the 'Change Data' form:

In the screenshot above, we are matching uppercase words in the title (literally, a subpattern of one or more uppercase characters followed by a non-word character).

In the 'New Data' section, we now have the opportunity to use regular expressions. The example in the screenshot shows another use of the backslash: to refer to a subpattern occurrence.

When a subpattern is referenced this way in a replacement pattern, it's like saying 'insert that subpattern here'. Thus, the replacement pattern can be read as:

\L  Begin lowercase
\1  insert subpattern 1 (followed by a blank space)
\E  End lowercase

Whatever matches subpattern 1 (an uppercase word) will be converted to lowercase. Since we have set the 'Data Occ' option to 'All' in the top section, the change will be applied to all uppercase words in the title.

This is, in essence, the first step in creating an autoreview to convert upper-case titles to 'library' title case:

'BEST OF THE BOLSHOI.' Changed to: 'best of the bolshoi.'
'MUSIC FOR AMERICA.'	 Changed to: 'music for america.'
'THE CINNAMON BEAR.'	 Changed to: 'the cinnamon bear.'
'ORCHESTRA WORKS'	 Changed to: 'orchestra WORKS'
'BUFFALO DANCE /'	 Changed to: 'buffalo dance /'
'WRC RADIO AIRCHECK :' Changed to: 'wrc radio aircheck :'

There are two steps left:

1. Take care of titles with ending words that did not match (because they were not followed by a non-word character):

TAG=245
SUBF=a
DATA=(\W[A-Z]+)$
DATA OCC=First
REGULAR EXPRESSION=Yes
NEW DATA=\L\1\E
REGULAR EXPRESSION=Yes

This will change 'orchestra WORKS' to 'orchestra works', etc.

2. And, uppercase the first word of the title:

TAG=245
SUBF=a
DATA=^([a-z])
DATA OCC=First
REGULAR EXPRESSION=Yes
NEW DATA=\U\1\E
REGULAR EXPRESSION=Yes

This will give us the final result:

'best of the bolshoi ' Changed to: 'Best of the bolshoi '
'music for america '	 Changed to: 'Music for america '
'the cinnamon bear '	 Changed to: 'The cinnamon bear '
'orchestra works'	 Changed to: 'Orchestra works'
'buffalo dance /'	 Changed to: 'Buffalo dance /'
'wrc radio aircheck :' Changed to: 'Wrc radio aircheck :'

These few examples also illustrate the complexity of dealing with all-uppercase data, since we have now lowercased proper names like 'bolshoi', lost acronyms ('WRC'), etc.

Example: Putting names in direct order

Using sub-patterns and PCRE, it is now possible to change the positions of pieces of data in a MARC record quite simply. For example, a common task in some data manipulations is to put names into direct order:

Satie, Erik   -->   Erik Satie

To do this in MARC Global, go to the Change Data form and enter, at the top:

TAG=100
SUBF=a
DATA=(\w+), (.*)
REGULAR EXPRESSION=Yes

And in the 'New data' section:

NEW DATA=\2 \1
REGULAR EXPRESSION=Yes

The results will be as follows:

'Geptner, V. G.'	 Changed to: 'V. G. Geptner'
'Rachmaninoff, Sergei,'  Changed to: 'Sergei Rachmaninoff,'
'Beethoven, Ludwig van,' Changed to: 'Ludwig van Beethoven,'
'Dingfelder, Ingrid.'	 Changed to: 'Ingrid. Dingfelder'
'Hantaèi, Pierre,'	 Changed to: 'HantaèPierre, i'
'Gilmore, Horace W.,'	 Changed to: 'Horace W. Gilmore,'

At first glance, this looks OK, but look closer at the name containing the diacritic. It got jumbled, because of the following quirk in PCRE:

In UTF-8 mode, characters with values greater than 128 never match 
\d, \s, or \w, and always match \D, \S, and \W. This is true even 
when Unicode character property support is available. These 
sequences retain their original meanings from before UTF-8 support 
was available, mainly for efficiency reasons. 

So, if we are working with diacritics, and that's often going to be the case with name fields, we must adjust our top pattern as follows:

TAG=100
SUBF=a
DATA=([\w\x80-\xFF]+), ([\w\x80-\xFF]+)
REGULAR EXPRESSION=Yes

We create a character class […] containing \w (any ASCII 'word' character), and to it we add the upper ASCII bytes (\x80-\xFF) used to carry diacritics (whether in MARC-8 or UTF-8).

Re-running this job, the jumbled entry now becomes:

'Hantaèi, Pierre,'	 Changed to: 'Pierre Hantaèi,'

Of course, there are still several clean-up jobs to follow, some of which may depend on how dates and name qualifiers will be handled; We will leave these tasks as a (hopefully) enjoyable puzzle for the user to solve.

MARC Review help for diacritics

This excerpt is based on the current MARC Review help page and is followed by some suggestions of how diacritics should be searched in the PCRE environment. Its worth mentioning a second time, the following excerpt from the PCRE documentation:

In UTF-8 mode, characters with values greater than 128 never match 
\d, \s, or \w, and always match \D, \S, and \W. This is true even 
when Unicode character property support is available. These 
sequences retain their original meanings from before UTF-8 support 
was available, mainly for efficiency reasons. 

DIACRITICS

To search for a character not on your keyboard, enter the numeric value of the character enclosed in curly braces. You may use either decimal or hexadecimal notation for this number; decimal numbers must be zero-filled to three digits and fall within the range 000-255; hex numbers must begin with a 'x' and fall within the range 00-FF.

For example, decimal {031} or hex {x1F} will match the MARC subfield delimiter.

Note that entering a character is this manner, as of version 236, requires that the regular expression option be selected.

A more interesting example; entering:

[{x7F}-{xFF}]

as a pattern and selecting the regular expression option will find all diacritics in a field. This works because MARC Review performs a character substitution for 'curly brace' expressions before it processes a regular expression (thus, the regular expression engine never sees the curly braces).

However, if you are planning to make full use of the PCRE support in the program, then it might be a good idea not to use MARC Review's curly brace technique for matching diacritics inside a regular expression((especially as this usage could be interpreted as a 'repetition quantifier' by PCRE)).

Instead, use PCRE's '\x' + the hex code of the character. For the example above, use:

[\x7F-\xFF]

and select the regular expression checkbox.

MARC Report customizations and PCRE

In MARC Review, it has always been possible to embed multiple patterns inside one pattern using double pipes (||). For example–

TAG=100
SUBF=a
DATA=Doe, John|Doe, Jane

–will find all authors named 'Doe', whether 'John' or 'Jane'. This is not a regular expression. Watch out in PCRE, however, where single pipes (which function in a similar way) do imply a regular expression.

As mentioned above in the section on diacritics, enclosing a character in curly braces–to search diacritics–is not a regular expression in MARC Report; whereas, using curly braces to indicate the minimum/maximum number of occurrences to match in a subpattern is indeed a PCRE.

In MARC Global, entering '^' or '$' alone, as a pattern, has always matched the beginning and end of a field or subfield, respectively. This makes '^' a synonym for inserting data, and '$' a synonym for appending data. This is an extension of true regular expression usage, including PCRE, where by themselves, '^' and '$' do not match anything.

'newlines', 'returns', 'tabs', etc., are not (or should not be) found in MARC, so ignore discussion of these when reading PCRE docs. As a rule, PCRE's metacharacter for a space character– \s –should find only blank spaces in MARC data.

'perl-compatible …' is not quite the same as 'perl …' regular expressions; for this reason we advise not to use 'perldocs' documentation as a regular expression reference.

1)
or buckets, or whatever you want to call them
2)
or four reviews, if you did not use a regular expression
236/pcre_and_mg.txt · Last modified: 2021/12/29 16:21 (external edit)
Back to top
CC Attribution-Share Alike 4.0 International
Driven by DokuWiki