phelp:helpmarcanalysis [[MARC Report]]

MARC ANALYSIS

MARC Analysis reads a MARC file and gathers statistics about the MARC coding contained in the file. This utility does not change the file in any way and is safe to run on any MARC file.

The main output of MARC Analysis is a plain-text report. This report is self-explanatory and, as it can be very long, it will not be repeated here. Also note that many of the report sections described here can be turned on or off–see 'OPTIONS', below.

At the top of the report are general statistics about the file, for example: the number of records in the file, the average record size, the average number of tags per record, a list of tags present in every record, a list of subfields used, and so on. (For an up-to-date list of the 'general' statistics, simply run the utility and open the report)

After these general file statistics, a detailed report for every tag follows.

For the leader and fixed fields, each data element is broken out and reported on separately.

For the leader, a list and count of all values for each position is given (with the exception of the base address; record lengths are optionally counted in ranges of 50 bytes; see the option for this below).

The fixed fields are reported on in groups based on the record format or material type; within each of these groups, a list and count of all values for each position is given.

For each tag, the main report includes the following seven columns: Records–the number of records in which the tag appeared TotOccs–the total occurrence count of the tag in the file MaxOccs–the maximum ocurrences of the tag in a single record TotalSz–the total number of bytes used by the tag's data AvgSize–the average size of the tag's data (in bytes) Longest–the longest occurrence of the tag (in bytes) Shortest-the shortest occurrence of the tag (in bytes)

In addition, for variable tags, the total number of times each indicator and subfield value appeared in the tag are also listed.

For each subfield that appeared in a tag: –a breakdown of the occurrences of the subfield in the tag (how many times the subfield appeared once in the tag, how many times it appeared twice, etc.) –the average size of the subfield's data (in bytes) –the shortest occurrence of the subfield (in bytes) –the longest occurrence of the subfield (in bytes)

For all subfields recorded in a tag: –a list of subfield usage patterns and their counts –a list of subfield punctuation marks and their counts

Following this detailed tag-by-tag report, two character encoding reports are displayed. Each tries to account for every byte that appears in the file (except for those bytes that are used in the directory). The first table lists of al the MARC-8 bytes, and the second table lists all of the UTF-8 sequences. The UTF-8 sequences are only tracked for those records where the Leader/09 byte is set to 'a'. NOTE: Its possible that the total bytes counted will differ between the MARC-8 and UTF-8 reports (because a UTF-8 sequence can comprise more than one byte).

Following this are a number of lists that add some detail to the general statistics. For example, the list of tags present in the file ordered by frequency, the list of tags present in the file ordered by size (number of bytes), etc. These lists display both a count and a percentage for each tag. This information sheds light on what tags are used most in a file, what tags are used least, etc.

RUNNING MARC ANALYSIS

Simply press the 'Analyze' button to start MARC Analysis.

When the analysis run is complete, a report will be generated. Press 'Text' to view the report in your default text editor.

Click 'Exit', or the [x] in the top right corner to return to MARC Report.

Once the program is running, press the 'Stop' button if you need to abort the run, in which case a partial report will be generated.

The name of the MARC file being analyzed is given in the status bar at the bottom. To use a different MARC file, double-click on the picture (the green rolling hills of the Coromandel peninsula)–this will open windows explorer and let you select any file on your system.

RESULTS

The report generated by MARC Analysis is written to a file in the same directory as the MARC file that you are analyzing. The filename is composed by appending '-analysis.txt' to the end of the marc filename. If the program ascertains that the directory of the MARC source file is not writeable (because of a permissions issue), the program will redirect the report to the 'Batch Reports' directory (as defined in your MARC Report options).

You can also optionally navigate and save the results to a directory, and to a filename, of your own choosing.

You can also output the report in HTML format; in this case it will be called '-analysis.html' by default. The HTML report can be viewed in your web browser; one reason for using this format might be to easily browse from a tag to the MARC documentation at LC.

OPTIONS

Click the Options button to change the program options. There are three groups of options: Data collection, Custom lists, and Output options.

DATA COLLECTION

These options make it possible to eliminate unwanted sections from the report.

To quickly enable/disable all of the Data Collection options click the 'Select/Unselect All' links at the bottom of the form.

Note that unselecting the data collection options has little effect on performance, because the data is going to be collected anyways. The main point of these options is to customize the resulting report.

LEADER ELEMENTS

If selected, the program will build a table of each code that appears, and the number of times it is used, in each of the positions defined in the Leader.

RECORD LENGTHS

This option groups all the record lengths in the file (Leader/00-04) into 50 bytes categories. For example:

Record Length Analysis

450-499:	56	(2.583%)
500-549:	51	(2.352%)
550-599:	51	(2.352%)
600-649:	87	(4.013%)
650-699:	116	(5.351%)
700-749:	117	(5.397%)
...

The first row can be read as '56 records, representing 2.583% of the records in the file, had record lengths between 450 and 499 bytes.

001 PREFIXES

If selected, the program builds a list of alphabetic prefixes in the 001 field. A prefix is considered to be any group of non-numeric characters that begins the 001 field.

003 CODES

If selected, the program builds a list of the control number identifiers that appear in the 003 field. The maximum length of this field is 8 characters; the program removes any blank spaces at the end of the field before adding the string to the list.

005/LENGTH=

If selected, the program builds a list of dates from the 005 field. By default, the list is created using only the first six positions in the 005, ie. YYYYMM (down to the month level).

The length of the data collection from the 005 field can be controlled by changing the value for the '005/Length' option; by default, it is set to '6'. By changing the '6' (YYYYMM) to a '4' (YYYY), for example, a list of 4-digit years will be generated.

Note: the format of the 005 field is as follows: YYYYMMDDHHMMSS.[ms], where [ms] represents a millisecond.

006/007/008 ELEMENTS

If selected, the program will build a table of each code that appears, and the number of times it is used, in each of the positions defined in the 006, 007, and 008.

The reports for these tags are grouped into sections. For the 006, these sections are defined by the form of material (006/00); for the 007, these sections are defined by the type of material (007/00); and for the 008, these sections are defined by the record type (as coded in the Leader/06 and /07).

In addition, each fixed field report begins with a table of 'common' elements, that is, elements that are not dependent on material type (006/00, 007/00, and 008 Date Type, Date 1, Date 2, Country code, Language code, Modified record code, and Cataloging source). The report of values broken down by material type or record type follows.

008/00 LENGTH=

The length of the data collection from the 008 Date Entered field can be controlled by changing the value for the '008/00 Len' option; by default it is set to '2', which will collect only the 'year' of the entry.

LCCN PREFIXES

If selected, the program builds a list of the prefixes used in the 010 field. A prefix is three characters long in a old LCCN (pre-2001), and two characters long in a new LCCN.

ISBN UNIQUENESS

If selected, when the ISBN stats are being collected, the program will additionally convert all ISBN-10 values to ISBN-13, and then dedupe the results. The results of this processing will be summarized in the 'Match Key' section, and reported in detail following the tag report for the 020.

GMDS (245 $h)

If selected, the program will build a list of all of the GMDs found in tag 254 subfield $h. Some normalizations are performed on this list (case, spacing, and punctuation are ignored).

SUBFIELD PATTERNS

If selected, the program will maintain a list of subfield patterns for each tag in the file. This option can make a report somewhat unruly for a large file, as subfield patterns are presented in the order in which they appear in the field and are not condensed in any way.

The following example shows the subfield pattern for a 260 field from a small file (about 1000 records):

Subfield Patterns (Counts/Patterns)

811	abc
169	aabc
43	ababc
17	c
16	aababc
7	ab
6	abbc
4	bc
2	abca
2	aabbc
1	ac
1	abcg
1	abbbc
1	abababc
1	abaabc
1	aabbbc
1	aabababc

SUBFIELD PUNCTUATION

If selected, the program will maintain a list of each punctuation mark that appears before each subfield code in a variable tag, and a count of the punctuation mark's occurrences. The example that follows shows the subfield punctuation table for the 100 Tag (from a file of about 5000 records):

Subfield punctuation

Code  Punct  Occs
4:      .      5
c:      ,     88
c:      .      5
d:      ,   1054
e:      ,   9219
e:      -    348
e:      .     30
e:      )      4
q:      .    250

RELATIONSHIP DESIGNATORS

If selected, the program will build a frequency table for each MARC content designator that is defined either as a 'relator term' or a 'relationship'. At last count (version 251), there were over 80 such content designators defined in MARC.

This report appears at the end of the analysis report, after the last tag (and before the character set usage report, if defined). The relationship designator report may become very large (thousands of rows). If that is not desired, use a custom tag list (see below) to target only those relationships you are interested in (eg. 100 $e, 700 $e, etc.).

Note that the relationship designator report supports some of the options found in the Normalization and Output section of the Custom lists page: Normalize to lowercase, excel-friendly output, sort by count, output custom list(s) only (in the latter case, this report is considered a 'custom list'). If the 'Normalize' option is selected (recommended), then ending punctuation will be removed from the collected strings.

A brief excerpt from a relationship designator report follows.

710 $4
act: 1
cmp: 2
dst: 2
pbl: 1
prf: 4
pro: 1
Total for 710 $4: 11

710 $e
(production company): 2
a film distributor: 1
actor: 4
animation production company: 2
animator: 9
arranger: 1
artist: 8
audio producer: 5
audio publisher: 1
author: 35
...

Apart from the statistical aspect, this report will be useful for finding typos, missing subfield coding, and unwanted relationship terms in your records. If you are having difficulty finding the offending strings in your records, re-run the report with 'Normalization' disabled.

CHARACTER SET USAGE

If selected, the program will track each byte in the file, and count the number of times it occurs. Two separate tables will be maintained: one for MARC-8 encoding, and another for UTF-8 encoding. The program will determine the record's encoding from the value of Leader/09; so, if this value is not coded appropriately these tables may be correspondingly inaccurate. In addition, a third table is (optionally) produced for UTF-8 sequences, that is, characters in unicode records that are represented by two or more bytes.

Hint: If you have some records (with leader/09 coded 'a') containing foreign script, run MARC Analysis on the file and open the results in Notepad. Scroll down to the section on 'UTF-8 Multi-Byte Sequences' and in the 'Visual' column you may see a rendering of the foreign script character if your text viewer supports unicode.

For each row in the table, the following columns will be generated: codepoint (decimal), codepoint (hex), occurrence count, percent of total characters counted, and label. Here is a short example:

065	x41	222430		0.429	Latin Uppercase A
066	x42	176286		0.340	Latin Uppercase B
067	x43	580179		1.119	Latin Uppercase C
068	x44	210523		0.406	Latin Uppercase D

If there are UTF-8 encoded records in the file, a second table will be generated for each multibyte sequence that is encountered. The columns generated in this table are somewhat different: codepoint (hex), display character, occurrence count, percent of total characters counted. Note that to view display characters with codepoints beyond the extended Latin codepage, a unicode-compliant text editor will be needed (eg. notepad or editpad, as opposed to textpad). A brief example follows.

xC3 x80 	À	7	0.000
xC3 x81 	Á	30	0.000
xC3 x85 	Å	7	0.000
xC3 x86 	Æ	2	0.000

CUSTOM LISTS

Use this panel to define up to eight custom tag lists. For each tag/subfield defined, the program will build a frequency table by collecting each string that it finds and counting the number of times that the string appears.

Enter a tag/subfield in the box on the left using the format '040a' (no space between the tag and the subfield). Only one subfield can be specified per list (so if you want to create two lists using subfields from the same tag, enter them separately, eg. '852a', '852b'). The subfield code is required–you cannot simply enter a tag here.

'Length' is used to limit the length of each string being tracked. It is set to 1 by default. Change this number to adjust the length of the strings that you want to track. The maxiumum number allowed in this box is 64 (which means the program will truncate any subfield longer than 64 bytes). Note that MARC subfield delimiters are not counted in the length of the string.

For example, to generate a list of holdings codes and their counts in a file, enter the tag/subfield of the holdings code (e.g. '852a', or '949a', etc), then enter the maximum length of a holdings code (e.g. '5'). If the length of your holdings codes varies, enter the largest number that will collect all unique occurrences.

The length is important, as it determines the amount of granularity in the result. For example, to get a breakdown of call numbers in a database, a length of 3 for DDC, or 6 for LCC, will produce a nice overview, whereas a longer length, like 12, might simply dump of every call number in the system. The better way to do the latter would be to use MARC Review, as it provides far more options for configuring the output than MARC Analysis.

There is an option to normalize the strings that are collected by a custom Tag List. If this option is selected, each string will be shifted into uppercase, punctuation will be removed, leading and trailing blanks trimmed, and sequences of internal multiple blanks will be reduced to one blank.

There is also an option to output custom tag lists only. If this is selected, the report will contain only the general file statistics followed by the information for the tag(s) defined in your custom list.

NOTE: Data collected by a custom taglist is stored in system memory. Running 'Custom lists' with long maximum lengths on a very large file (millions of records) could conceivably exhaust system memory.

MATCH KEY STATS

On the right panel of the Custom lists page are a few options that apply to the Match Key analysis section of the report.

You can suppress this section altogether by unchecking 'Compile match key stats'.

The primary match keys (LCCN, ISBN, ISSN, OCLC) are not editable (the point of showing them is simply to demonstrate what data is going to be collected).

But there are two slots where the user can enter their own 'match keys', which will be tracked in the same section of the report.

SUMMARY LISTS

Any or all of the following statistical summaries may be added to a MARC Analysis report. If there are any selections, the results appear at the end of the report.

Frequency: What % of records contain each tag used in the file
Occurrence: How many times each tag is used in the file
Overall Size: How many bytes does each tag use in the file
Maximum Length: What is the longest, 2nd longest, etc., tag in the file
Minimum Length: What is the shortest, 2nd shortest, etc., tag in the file
Average Length: What is the longest, on average, tag in the file
Content designator length summary: what is the most common content length in the file
Tag Summary: one row for each tag, each row with the 7 main columns (listed above)

You may limit the number of rows that are added by these statistics, eg., show only the top 10, 20, etc., items in each category. Note: this limit does not apply to the tag summary report.

TAG SUMMARY

This report is selected using the summary lists option (see previous section). By default, this summary appears at the end of the main report. However, it can also be output to a separate report by selecting the “Output Tag Summary separately” option. In the latter case, the summary is omitted from the main report.

A benefit of a separate file for the tag summary is that it may be loaded into a program that supports tables (like Excel). This is not possible if the summary is part of the main report.

You may also apply one of these categories to the overall report by right-clicking on the corresponding list item. For example, if you right-click on the second category–“How many times each tag is used in the file”–and set your limit to '10', then only the 10 tags that appear in the selected category will appear in the main body of the report (where the complete statistics for each tag are normally reported).

ONLY DISPLAY DETAILS FOR

This option makes it possible to reduce the size of the report so that it just focuses on a single tag, or a range of tag.

To achieve this, enter a three-digit tag, a list of tags ('100,700'), or a range of tags (eg. '600-655', or '6XX'), or any combination of the preceding, and:

Turn off all of the options on the Data collection page
Make sure no Custom lists are defined
Turn off the Match Key analysis option
Turn off all Tag Lists to be added

The result will be a report that includes just the results for the tags that were entered here, after the general summary (which may not be suppressed).

Note: if any Custom Tag(s) are specified, the output for them will not be suppressed, even if they are not in the list of 'detail' tags.

phelp/helpmarcanalysis.txt · Last modified: 2021/12/29 16:21 (external edit)

Back to top