MARC Review and PCRE

PCRE stands for Perl Compatible Regular Expressions, an open source library supported by MARC Review and MARC Global.

A copy of the official documentation for using PCRE patterns can be found here.
(The license for the PCRE documentation is the same as for the PCRE code, which can be found here ).

The goal of this page is to provide some examples of the new regular expression usage in MARC Review and MARC Global.

If you need help with regular expressions, please email us–that's one of the things we are here for.

There are many tutorials for regular expressions available on the web. Keep in mind that these tutorials are often written for programmers, so you may want to check for one that matches your own level of understanding.

http://www.regular-expressions.info/tutorial.html
A good R/E site for beginners, although it does contain advertisements.

http://en.wikipedia.org/wiki/Regular_expression_examples
A page of basic R/E examples from wikipedia.

http://www.php.net/manual/en/reference.pcre.pattern.syntax.php
Perhaps more readable version of the PCRE documentation.


MARC Review introduction to regular expressions

This very brief introduction to R/E is from the current MARC Review help page

REGULAR EXPRESSION

If the Regular Expression box is selected, the program will treat the pattern entered in the DATA box as a regular expression. The most common metacharacters used in pattern matching are:

  .       matches any single character

  ^       anchors match to the beginning of the data 
  $       anchors match to the end of the data

  *       matches 0 or more of the preceding expression

  [       begin character class definition
  ]       end character class definition	
  -       within a character class, indicates a range of characters

          Eg., [a-d] matches a, b, c, or d
          Eg., [^a-d] matches any character except a, b, c, or d
          Eg., [A-Z] matches any uppercase character

  \       removes special meaning from above metacharacters 

As an example, if the Regular Expression box is checked, and your data pattern contains:

 200[0-2]

the program will match any data that contains '2000', '2001', or '2002'. If the Regular Expression box was not checked, the program would literally try to match the string '200[0-2]'.

NOTE: within square brackets, '^' negates a match. Therefore, to find all instances of invalid subfield coding, we could use the following expression:

±[^0-9a-z] 

This would match any subfield delimiter ± that is followed by a character not in the character class 0-9a-z.

Do not use commas to separate individual values in a character class. For example, this is the correct way to pattern match the ten numeric digits and the uppercase letters 'A', 'B', and 'C':

[0-9ABC]

But the following regular expression will also match any string containing a comma in it:

[0-9,A,B,C]

MARC Review and PCRE

With PCRE, the program's metacharacter support is increased a great deal:

  .       matches any single character

  ^       anchors match to the beginning of the data 
  $       anchors match to the end of the data

  *       matches 0 or more of the preceding expression
  +       matches 1 or more of the preceding expression
  ?       matches no more than 1 of the preceding expression

  [       begin character class definition
  ]       end character class definition	
  -       within a character class, indicates a range of characters
  
  |       alternative pattern separator
          Eg., red|black matches "red" or "black"

  (       begin subpattern
  )       end subpattern

  {       begin repetition quantifier
  }       end repetition quantifier

  \       if followed by one of the above, treat the character literally
          Eg., \* in a pattern will match the asterisk character

A second use of backslash provides a way of encoding non-printing 
characters in patterns in a visible manner:
  \cx       "control-x", where x is any character
  \n        linefeed (hex 0A)
  \r        carriage return (hex 0D)
  \t        tab (hex 09)
  \ddd      character with octal code ddd
  \xhh      character with hex code hh

Another use of backslash is for specifying generic character types. The 
following are always recognized:
  \d     any decimal digit
  \D     any character that is not a decimal digit
  \h     any horizontal whitespace character
  \H     any character that is not a horizontal whitespace character
  \s     any whitespace character
  \S     any character that is not a whitespace character
  \v     any vertical whitespace character
  \V     any character that is not a vertical whitespace character
  \w     any "word" character
  \W     any "non-word" character

These character type sequences can appear both inside and outside 
character classes. 

In addition, the following use of backslash is available in MARC Global:
  \L .. \E   lowercase all characters between '\L and '\E'
  \U .. \E   uppercase all characters between '\U and '\E'

REPETITION

The general repetition quantifier specifies a minimum and maximum number 
of permitted matches, by giving the two numbers in curly brackets braces), 
separated by a comma. For example:

  z{2,4}

matches "zz", "zzz", or "zzzz". 

A closing brace on its own is not a special character. 

If the second number is omitted, but the comma is present, there is no 
upper limit; if the second number and the comma are both omitted, the 
quantifier specifies an exact number of required matches. Thus

  [aeiou]{3,}

matches at least 3 successive vowels, but may match many more, while

  \d{8}

matches exactly 8 digits. 

An opening curly bracket that appears in a position where a quantifier
is not allowed, or one that does not match the syntax of a quantifier, 
is taken as a literal character. For example, {,6} is not a quantifier, 
but a literal string of four characters. 

For convenience, the three most common quantifiers have single-character 
abbreviations:

  *    is equivalent to {0,}
  +    is equivalent to {1,}
  ?    is equivalent to {0,1}

MARC Review PCRE examples

1. How to find titles that contain an acronym in the subfield $c

TAG=245 
SUBF=c 
DATA=[A-Z]{2,}\W
REGULAR EXPRESSION=Yes

Explanation: an uppercase letter [A-Z] occurring two or more times {2,} followed by a non-word character \W

2. How to find titles that contain two (or more) consecutive acronyms in the subfield $a. This example demonstrates the use of a subpattern–a regular expression enclosed in parens:

TAG=245 
SUBF=c 
DATA=([A-Z]{2,}\W){2,}
REGULAR EXPRESSION=Yes

Explanation: the subpattern ([A-Z]{2,}\W) –consisting of an uppercase letter occurring two or more times followed by a non-word character –occurring two or more times.

Some results of #2:

$aProfessional ASP.NET 1.1
$aIBM PC update.
$aRF/IF signal processing handbook.
$aCyberlaw @ SA II
$aScholarly book reviews on CD-ROM
$aKI-ES-KI, directory of key contacts in Canadian education

Note: '.', '/', '-' are non-word characters.

3. How to find subject headings that contain word(s) followed by the word 'fiction'. Easy one!

TAG=650
SUBF=a
DATA= fiction

4. How to find subject headings that contain two words followed by the word 'fiction'.

TAG=650
SUBF=a
DATA=\w+\s\w+\sfiction
REGULAR EXPRESSION=Yes

Explanation: One or more word characters \w+ followed by a single space character \s followed by one or more word characters \w+ followed by a single space character \s followed by 'fiction'.

Using a subpattern we can restate #4 as:

DATA=(\w+\s){2}fiction

This doesn't make the pattern any shorter, but we can now simply change the '2' if we want to change the number of words that must precede 'fiction'.

Some results of #4:

$aLatin American fiction
$aSpanish American fiction
$aStar Trek fiction.
$aStar Wars fiction
$aYoung adult fiction
$aYoung adult fiction, English

5. How to find summaries (520) containing more than a certain number of words:

TAG=520
SUBF=a
DATA=(\w+\s){50,}
REGULAR EXPRESSION=Yes

Explanation: the subpattern (\w+\s) –i.e., a word–occurring at least 50 times {50,}

6. How to find summaries (520) containing more than 50 words but less than 100:

TAG=520
SUBF=a
DATA=(\w+\s){50,99}
REGULAR EXPRESSION=Yes

Explanation: the subpattern (\w+\s) –i.e., a word–occurring at least 50 times but no more than 99 times {50,99}


Please click here for MARC Global examples

Compatibility note

Customers that interface directly to MARC Report's validation module from their own software, and change the location of this file, will find that valplus.dll will fail to load if the PCRE library (pcrelib.dll), now distributed with the program, cannot be found.

The solution is to simply copy the file pcrelib.dll to wherever valplus.dll is being copied (if applicable), as valplus expects the PCRE library to be in the same folder.

We are not going to change valplus to look elsewhere for the file.

236/pcre_and_mr.txt · Last modified: 2021/12/29 16:21 (external edit)
Back to top
CC Attribution-Share Alike 4.0 International
Driven by DokuWiki