Character display issues in MARCXML

When viewing or editing a record in MARC Report, a MARCXML view of the record can be obtained by pressing the <F5> key. In most cases, special characters, diacritics, foreign script, etc., display correctly in this view.

In some cases, however, the MARCXML view might display some characters as empty boxes. Though this may seem an anomaly, its likely that the underlying character is coded correctly. The purpose of the rest of this article is to explain in some detail what is going on here.

When <F5> is pressed, the first thing that the program does is run its built-in MARC to XML conversion. This creates an XML document with UTF-8 encoding; if there are any encoding issues, the document will not display, and/or an error message pointing to the invalid character will be shown by the program. On the other hand, if a MARCXML record is subsequently displayed in a new window, then there are likely no problems in the encoding.

The MARCXML view in MARC Report uses an OLE control that takes its display settings from Internet Explorer. Therefore, the font selected in IE will be the one used in an 'F5' display in MARC Report 1).

However, although a record may be correctly encoded in UTF-8, not all characters can be displayed in all fonts. For example, consider the UTF-8 encoding for subscript one: xE2x82x81 2). Its possible that you may have only one font on your computer that supports the correct display of this character: Arial Unicode MS. And, unless your IE settings use this font, this character is likely to display as an empty box.

Therefore, viewing an empty box as a display in 'F5' is not an indication that the underlying character in the record is incorrectly encoded. The only way to know for sure would be to open the MARC record in a binary editor, go to the position where the character is supposed to be, and check to see if it is indeed encoded with the correct sequence.

Another option for the empty box issue is to change your IE font options. To do this in version 8 of Internet Explorer, start IE, and select 'Tools–Internet options–Fonts'; then, in the 'Webpage font' box, select the font called 'Arial Unicode MS' and click OK. Next, click the 'Accessibility' button (next to the 'Fonts' button at the bottom), and select the option 'Ignore font styles specified on web pages'. Click OK twice and exit IE. If you now open MARC Report and use <F5> to view a record containing a subscript one, it will display correctly.

Helpful links

LC has an XML file on their site that contains all of the encodings for all MARC-8 characters 3). For each character, it lists the MARC-8 encoding, the UTF-8 encoding, and the unicode encoding.
http://www.loc.gov/marc/specifications/codetables.xml

A good general reference site for character encoding and font display issues is:
http://www.fileformat.info
For example, if you go to this site and search 'subscript one', a link to a vast amount of information about this one character will appear, including all imaginable encodings, a browser test page, etc.

Finally, we have found WinHex to be a good binary editor (for viewing the actual bytes used in a raw MARC record):
http://www.winhex.com/winhex/
It is extremely fast and knows no bounds with regard to file sizes.

1) Even if IE is not your default browser, this dependency still applies
2) or: 226-130-129, in decimal notation
3) essentially, every character likely to be used in a MARC record
help/f5_and_fonts.txt · Last modified: 2013/04/27 09:09 (external edit)
Back to top
CC Attribution-Noncommercial-Share Alike 3.0 Unported
Driven by DokuWiki Recent changes RSS feed