MARC-8 and UTF-8 data validation

Its becoming more important now that records contain only those characters that are indicated by the code in the leader/09. In the days when MARC-8 was the only game in town, every byte was a character, and batchloading processes really did not need to perform too much validation on the data at this level.

With UTF-8, however, although most characters in a record are still composed of one byte, all diacritics are composed of at least two bytes, and some (including all CJK) require three bytes. Therefore, when a file is fed to a batchloading process, in most cases the processor will be put into either MARC-8 mode or UTF-8 mode; and any record that contains a character that does not match the corresponding mode, will cause an error and be rejected.

This version of MARC Report contains improved support for checking the validity of the characters used in a record. This support will be evident in several aspects of the program: when editing or viewing records, when running batch mode, when running filing indicator checks in MARC Report, or fixing filing indicators in MARC Global, etc.

By default, MARC Report checks character coding by consulting the value of the leader/09: if this is a blank, it checks the record using the MARC-8 coding rules; if it is 'a', it uses the UTF-8 coding rules. Therefore, MARC Report does not require that all records in a file use the same character coding.

When a character coding error is found, an 'invalid character' message will appear in the message list on the right. Depending on whether the record is MARC-8 or UTF-8, slightly different wording may be used. But in either case, clicking on the message will put your cursor on the field and display a note that should guide you to the offending character:

In the example above, the note gives a position of '8' in the 260 field which would be the diacritic between the 'n' and the 'v' (position is relative to the indicators). If you are in Edit mode the program should put your cursor on the problem character1)

Search a file for invalid characters

You can easily find all records with invalid characters in a file using the following steps:

  1. Select your file In MARC Report
  2. Go to the Options menu and select Cataloging Checks
  3. In the 'Current Set' box on the left, select the 'INVALID CHARACTER' set
  4. Uncheck all the main options on the page, as in the screenshot below
  5. Goto the 'Validation' page and unselect 'Enable Validation'
  6. Click Save

(What you have just done is to turn off all error checking in the program except for the checks for invalid characters.)

Now go to the File menu and select 'Run Batch mode'. Depending on the size of your file, you will soon have a report of any/all invalid character problems.

1)
What to do in this case? Its obvious something is wrong with the record, as it contains MARC-8 diacritics but a UTF-8 code in the leader/09. We recommend replacing the record if possible.
232/charactercoding.txt · Last modified: 2021/12/29 16:21 (external edit)
Back to top
CC Attribution-Share Alike 4.0 International
Driven by DokuWiki