Marc/XML Utilities changes in 233

Previous to version 233, the Marc to XML utility would not process a MARC file that was greater than 2.1GB in size. This limitation has been removed.

The performance of the XML to Marc utility has been greatly improved.

A new option has been added to the MarcToXml utility which makes it possible to output each record in a MARC file as a single MarcXml file (a necessary step if you want to publish your MARC data using OAI).

The MarcToXml utility now writes out any records that it cannot convert to a file called 'hybrids-nnnn.mrc' (where 'nnnn' represents a datestamp). These records contain encodings that are not valid in MARC-8 (yet may be valid in UTF-8, hence the term 'hybrid'). In some cases, these records can be repaired simply by flipping the leader/09 to 'a' and re-validating using the 'INVALID CHARS' cataloging check set in MARC Report. It may also be the case that the errant codes represent a local practice (for example, a non-compliant character being used for a currency symbol in the 020), and that these characters can be corrected with MARC Global. Apart from these types of problem, the repair of offending characters may not be practical–especially with MARC-8 escape sequences–and in this latter case it would be best to try to get a new record for the resource from LC.

A few minor problems in both conversion utilities have been fixed in 233:

  • sluggish screen response/updates
  • the MARC Mapping being used for CJK code x21x23x28 was incorrect
  • the program did not recognize qualified tags (eg. '<marc:leader>')

Converting MARC-8 records to UTF-8 encoding

It is possible, using the xml conversion utilities in MARC Report, to change MARC-8 records into unicode records by running the two utilities back-to-back:

  1. Run the MARC to XML utility on your file of MARC-8 records
  2. Run the XML to MARC utility on the XML file created by step #1

If you are going to use this technique on a large file (millions of records), we recommend that you first split the MARC source file into records that contain diacritics, etc., and those that do not (see this article for how to do this in MARC Review). Records without non-ascii characters may be converted to UTF-8 simply by flipping the leader/09 code from ' ' to 'a', which is easily done in MARC Global. Then, on the file that contains the records with diacritics, follow the two steps listed above. On a large file, using this hint may save several hours of processing time.

233/xml_utilities.txt · Last modified: 2013/04/27 09:09 (external edit)
Back to top
CC Attribution-Noncommercial-Share Alike 3.0 Unported
Driven by DokuWiki Recent changes RSS feed