MARC Global: 'Lazy' and 'Greedy' regular expressions

In PCRE, the regular expression engine used in MARC Review and MARC Global, all pattern matching is defined as 'greedy' by default. This means that a pattern will match the maximum amount of data that it is able to.

Consider the following MARC data:

±aBall, Michael,$±cRevd. Dr.
±aRivers, Larry E.,$±d1950-
±aWells, H. G.±q(Herbert George),±d1866-1946.
±aWade, William,±cSir,±d1918-
±aSimon, Simple.

What if we just wanted to boil these headings down to the person's Name ($a), and strip out what follows, like the Title ($c), Fuller form of the name ($q), and Date ($d). So we would want to start at the $a and end at the first following subfield.

How about this:

^(±a.*)(±.*$)

and simply

\1

for the replacement, with regular expression checked in both places:

thinking that we will capture everything from the $a to the next subfield delimiter (and save the match in the first subpattern)?

This works good until we run into headings with more than one subfield. For example, the result of using this pattern on 'HG Wells' is as follows:

±aWells, H. G.±q(Herbert George),

and not what we were hoping for:

±aWells, H. G.

Here, we see that instead of stopping the match at the first subfield delimiter ($q), PCRE does not stop the match until it finds the last subfield delimiter ($d) after the $a in the heading; again, this is because it defaults to matching as much as possible, not as little as possible.

This is where the new 'Lazy match' option will be useful to us: it disables the default 'greedy' behavior. If we select this option, PCRE will then stop at the earliest possible position that satisifes the pattern (ie. the $q), so that the new result would indeed be:

±aWells, H. G.

Note that the 'Lazy match' option is only available in Change Data. It is not necessary in other parts of the program where regular expressions are available, i.e. in pattern matching, because the purpose of pattern matching in MARC Review and MARC Global is to simply arrive at a Yes/No outcome: is it a match, or not? But when a pattern matches and a replacement is to be made, the latter operation is handled by the Change Data task, which does now support both 'Lazy' and 'Greedy' matching.


If you have played a bit with the above example you may have noticed it is not a perfect solution. For example, running the change on this heading:

±aWade, William,±cSir,±d1918-

will produce this result:

±aWade, William,

Is there a way to cleanup that second comma without running a second pass?

Yes, in some cases, we can get around this issue by moving the trailing punctuation from the first subpattern–which is the one we are keeping–into the beginning part of the second subpattern, and then specify new trailing punctuation in the replacement.

So, for example, instead of this:

^(±a.*)(±.*$)

use this:

^(±a.*)([\.,]±.*$)

and add the terminating period to the replacement:

\1.

The end result is now:

±aWade, William.
239/pcre_greed.txt · Last modified: 2013/04/27 09:09 (external edit)
Back to top
CC Attribution-Noncommercial-Share Alike 3.0 Unported
Driven by DokuWiki Recent changes RSS feed