MARC Global: Split long fields

Use this option to break long fields into smaller fields at a specified length.

Why?

Some OPACs may truncate a display field at a certain number of characters, and some systems may return an error when trying to load a record with a very long field. You might use this option to identify these long fields and split them into shorter fields. This typicaql examples for this type of change would be the MARC 505 and 520 fields.

The option form for 'Split long fields' looks like this:

In the top part of the form you must enter a pattern that specifies the tags to match. This pattern must use the same syntax as that specified in the MARC Review Help file for identifying fields of a specified length, with one exception: the MARC Global change does not apply to subfields–it applies to the tag only.

There are two options

The first is the requested maximum length of the tags created by the 'split' processing. We refer to this number below as the 'Break At' position. This may be the same as the length specified above, or it may be shorter–but it cannot be longer.

The second option is whether to leave a blank space at the end of each new tag, except the last. Depending on how your system reconstructs these fields for display, a blank space might be needed when the fields are re-joined.

The program uses the following logic to split the fields.

As each record is processed, the pattern in the top part of the form is applied. If no fields match, the program advances to the next record. If fields match, then for each field greater than the requested length, beginning at the position specified in the 'Break at' option, the program reads the field backwards until it finds a blank space. It then copies all of the data from the starting position (in this case, the beginning of the field) up to the blank space, into a buffer. The next byte after the blank space becomes the new starting position, and the 'break at' position is incremented accordingly. The process is repeated until there are not enough bytes remaining to support a new 'break at' position. At this point, the remainder of the field is added as the last entry in the new tag buffer.

Next, the new (buffered) fields are added to the MARC record in the order that they were created. To support sorting and programmatic reconstruction of the split field, a field link and sequence number identifier is added to the beginning of each new field (in subfield $8, exactly as described in the 'General sequencing' section of http://www.loc.gov/marc/bibliographic/ecbdcntf.html ).

For example, if a 505 is split into the three pieces, each new piece would have the following coding added to the beginning of the field:

$81.1\x
$81.2\x
$81.3\x

The next 505 that is split from the same record would have:

$82.1\x
$82.2\x
...

and so on.

This addition of field link and sequence number, and the need to repeat the indicators of the original tag, and insert a leading subfield, and add a field terminator, adds about 12-14 bytes of overhead to each new tag created. This overhead is 'computed' when the program sets each 'break at' position, so that the final length of the tag never exceeds the specified value.

Finally, once the new tags have been added, the original matching tags (the ones that were 'split') are deleted.

Notes and caveats

First, please keep in mind that this is a machine process, and the split fields produced may not be 'pretty' in some cases.

This option is intended only for MARC fields that contain words, like the 505 or 520.

MARC Fields that do not contain blank spaces cannot be split using this option.

The minimum break at position is 100 bytes.

It is not possible to get fields that exactly match the 'break at' length using this routine; but all fields, except for the last, should be approximately the 'break at' length, while never exceeding.

This option, although it works as designed, may not work as expected with long fields that have already been split, since there is no way to tell if, for example, two 505s in an existing record have already been split (none of the example records we have seen take measures to indicate the sequencing of split tags). Some manual re-ordering may be neccessary in this case, especially if the previously split tags are not in order to begin with.

You may wonder if it makes sense to set the 'break at' position to a value smaller than the length specified in the pattern. For example, if you know that the system chokes on fields longer than 4000 bytes, then breaking the tag at that point could conceivably generate a tag that is only a couple of bytes long (if the original tag was, say, 4002 bytes long). But breaking these tags at, for example, 3800 bytes, would mean the shortest length of a split tag would be about 201 bytes, which should create a readable portion of text–unless of course the original tag was 8002 bytes long! So using this option wisely may require a bit of research: find out if you have any tags with lengths that are near the 'break at' position and perhaps handle them manually.

There is presently no overt option to suppress the addition of the field link and sequence numbering. In our testing, we found the resulting data easily became jumbled up without this information. However, as a reward for reading the Help this far, we can tell you that adding '/8' to the end of the 'Break At' position on the options form will indeed suppress this behavior.

244/split_long_fields.txt · Last modified: 2015/03/26 09:33 (external edit)
Back to top
CC Attribution-Noncommercial-Share Alike 3.0 Unported
Driven by DokuWiki Recent changes RSS feed