Differences
This shows you the differences between two versions of the page.
— |
phelp:helpmarcanalysis [2021/12/29 16:21] (current) |
||
---|---|---|---|
Line 1: | Line 1: | ||
+ | MARC ANALYSIS | ||
+ | |||
+ | MARC Analysis reads a MARC file and gathers statistics about the MARC coding contained in the file. This utility does not change the file in any way and is safe to run on any MARC file. | ||
+ | |||
+ | The main output of MARC Analysis is a plain-text report. This report is self-explanatory and, as it can be very long, it will not be repeated here. Also note that many of the report sections described here can be turned on or off--see ' | ||
+ | |||
+ | At the top of the report are general statistics about the file, for example: the number of records in the file, the average record size, the average number of tags per record, a list of tags present in every record, a list of subfields used, and so on. (For an up-to-date list of the ' | ||
+ | |||
+ | After these general file statistics, a detailed report for every tag follows. | ||
+ | |||
+ | For the leader and fixed fields, each data element is broken out and reported on separately. | ||
+ | |||
+ | For the leader, a list and count of all values for each position is given (with the exception of the base address; record lengths are optionally counted in ranges of 50 bytes; see the option for this below). | ||
+ | |||
+ | The fixed fields are reported on in groups based on the record format or material type; within each of these groups, a list and count of all values for each position is given. | ||
+ | |||
+ | For each tag, the main report includes the following seven columns: | ||
+ | Records--the number of records in which the tag appeared | ||
+ | TotOccs--the total occurrence count of the tag in the file | ||
+ | MaxOccs--the maximum ocurrences of the tag in a single record | ||
+ | TotalSz--the total number of bytes used by the tag's data | ||
+ | AvgSize--the average size of the tag's data (in bytes) | ||
+ | Longest--the longest occurrence of the tag (in bytes) | ||
+ | Shortest-the shortest occurrence of the tag (in bytes) | ||
+ | |||
+ | In addition, for variable tags, the total number of times each indicator and subfield value appeared in the tag are also listed. | ||
+ | |||
+ | For each subfield that appeared in a tag: | ||
+ | --a breakdown of the occurrences of the subfield in the tag (how many times the subfield appeared once in the tag, how many times it appeared twice, etc.) | ||
+ | --the average size of the subfield' | ||
+ | --the shortest occurrence of the subfield (in bytes) | ||
+ | --the longest occurrence of the subfield (in bytes) | ||
+ | |||
+ | For all subfields recorded in a tag: | ||
+ | --a list of subfield usage patterns and their counts | ||
+ | --a list of subfield punctuation marks and their counts | ||
+ | |||
+ | Following this detailed tag-by-tag report, two character encoding reports are displayed. Each tries to account for every byte that appears in the file (except for those bytes that are used in the directory). The first table lists of al the MARC-8 bytes, and the second table lists all of the UTF-8 sequences. The UTF-8 sequences are only tracked for those records where the Leader/09 byte is set to ' | ||
+ | |||
+ | Following this are a number of lists that add some detail to the general statistics. For example, the list of tags present in the file ordered by frequency, the list of tags present in the file ordered by size (number of bytes), etc. These lists display both a count and a percentage for each tag. This information sheds light on what tags are used most in a file, what tags are used least, etc. | ||
+ | |||
+ | |||
+ | RUNNING MARC ANALYSIS | ||
+ | |||
+ | Simply press the ' | ||
+ | |||
+ | When the analysis run is complete, a report will be generated. Press ' | ||
+ | |||
+ | Click ' | ||
+ | |||
+ | Once the program is running, press the ' | ||
+ | |||
+ | The name of the MARC file being analyzed is given in the status bar at the bottom. To use a different | ||
+ | |||
+ | |||
+ | RESULTS | ||
+ | |||
+ | The report generated by MARC Analysis is written to a file in the same directory as the MARC file that you are analyzing. The filename is composed by appending ' | ||
+ | |||
+ | You can also optionally navigate and save the results to a directory, and to a filename, of your own choosing. | ||
+ | |||
+ | You can also output the report in HTML format; in this case it will be called ' | ||
+ | |||
+ | |||
+ | OPTIONS | ||
+ | |||
+ | Click the Options button to change the program options. There are three groups of options: Data collection, Custom lists, and Output options. | ||
+ | |||
+ | |||
+ | DATA COLLECTION | ||
+ | |||
+ | These options make it possible to eliminate unwanted sections from the report. | ||
+ | |||
+ | To quickly enable/ | ||
+ | |||
+ | Note that unselecting the data collection options has little effect on performance, | ||
+ | |||
+ | LEADER ELEMENTS | ||
+ | |||
+ | If selected, the program will build a table of each code that appears, and the number of times it is used, in each of the positions defined in the Leader. | ||
+ | |||
+ | RECORD LENGTHS | ||
+ | |||
+ | This option groups all the record lengths in the file (Leader/ | ||
+ | |||
+ | Record Length Analysis | ||
+ | |||
+ | 450-499: | ||
+ | 500-549: | ||
+ | 550-599: | ||
+ | 600-649: | ||
+ | 650-699: | ||
+ | 700-749: | ||
+ | ... | ||
+ | |||
+ | The first row can be read as '56 records, representing 2.583% of the records in the file, had record lengths between 450 and 499 bytes. | ||
+ | |||
+ | 001 PREFIXES | ||
+ | |||
+ | If selected, the program builds a list of alphabetic prefixes in the 001 field. A prefix is considered to be any group of non-numeric characters that begins the 001 field. | ||
+ | |||
+ | 003 CODES | ||
+ | |||
+ | If selected, the program builds a list of the control number identifiers that appear in the 003 field. The maximum length of this field is 8 characters; the program removes any blank spaces at the end of the field before adding the string to the list. | ||
+ | |||
+ | 005/ | ||
+ | |||
+ | If selected, the program builds a list of dates from the 005 field. By default, the list is created using only the first six positions in the 005, ie. YYYYMM (down to the month level). | ||
+ | |||
+ | The length of the data collection from the 005 field can be controlled by changing the value for the ' | ||
+ | |||
+ | Note: the format of the 005 field is as follows: YYYYMMDDHHMMSS.[ms], | ||
+ | |||
+ | 006/007/008 ELEMENTS | ||
+ | |||
+ | If selected, the program will build a table of each code that appears, and the number of times it is used, in each of the positions defined in the 006, 007, and 008. | ||
+ | |||
+ | The reports for these tags are grouped into sections. For the 006, these sections are defined by the form of material (006/00); for the 007, these sections are defined by the type of material (007/00); and for the 008, these sections are defined by the record type (as coded in the Leader/06 and /07). | ||
+ | |||
+ | In addition, each fixed field report begins with a table of ' | ||
+ | |||
+ | 008/00 LENGTH= | ||
+ | |||
+ | The length of the data collection from the 008 Date Entered field can be controlled by changing the value for the ' | ||
+ | |||
+ | LCCN PREFIXES | ||
+ | |||
+ | If selected, the program builds a list of the prefixes used in the 010 field. A prefix is three characters long in a old LCCN (pre-2001), and two characters long in a new LCCN. | ||
+ | |||
+ | ISBN UNIQUENESS | ||
+ | |||
+ | If selected, when the ISBN stats are being collected, the program will additionally convert all ISBN-10 values to ISBN-13, and then dedupe the results. The results of this processing will be summarized in the 'Match Key' section, and reported in detail following the tag report for the 020. | ||
+ | |||
+ | GMDS (245 $h) | ||
+ | |||
+ | If selected, the program will build a list of all of the GMDs found in tag 254 subfield $h. Some normalizations are performed on this list (case, spacing, and punctuation are ignored). | ||
+ | |||
+ | SUBFIELD PATTERNS | ||
+ | |||
+ | If selected, the program will maintain a list of subfield patterns for each tag in the file. This option can make a report somewhat unruly for a large file, as subfield patterns are presented in the order in which they appear in the field and are not condensed in any way. | ||
+ | |||
+ | The following example shows the subfield pattern for a 260 field from a small file (about 1000 records): | ||
+ | |||
+ | Subfield Patterns (Counts/ | ||
+ | 811 abc | ||
+ | 169 aabc | ||
+ | 43 ababc | ||
+ | 17 c | ||
+ | 16 aababc | ||
+ | 7 ab | ||
+ | 6 abbc | ||
+ | 4 bc | ||
+ | 2 abca | ||
+ | 2 aabbc | ||
+ | 1 ac | ||
+ | 1 abcg | ||
+ | 1 abbbc | ||
+ | 1 abababc | ||
+ | 1 abaabc | ||
+ | 1 aabbbc | ||
+ | 1 aabababc | ||
+ | |||
+ | SUBFIELD PUNCTUATION | ||
+ | |||
+ | If selected, the program will maintain a list of each punctuation mark that appears before each subfield code in a variable tag, and a count of the punctuation mark's occurrences. The example that follows shows the subfield punctuation table for the 100 Tag (from a file of about 5000 records): | ||
+ | |||
+ | Subfield punctuation | ||
+ | Code Punct Occs | ||
+ | 4: . 5 | ||
+ | c: , 88 | ||
+ | c: . 5 | ||
+ | d: , 1054 | ||
+ | e: , 9219 | ||
+ | e: - 348 | ||
+ | e: . 30 | ||
+ | e: ) 4 | ||
+ | q: . 250 | ||
+ | |||
+ | RELATIONSHIP DESIGNATORS | ||
+ | |||
+ | If selected, the program will build a frequency table for each MARC content designator that is defined either as a ' | ||
+ | |||
+ | This report appears at the end of the analysis report, after the last tag (and before the character set usage report, if defined). The relationship designator report may become very large (thousands of rows). If that is not desired, use a custom tag list (see below) to target only those relationships you are interested in (eg. 100 $e, 700 $e, etc.). | ||
+ | |||
+ | Note that the relationship designator report supports some of the options found in the Normalization and Output section of the Custom lists page: Normalize to lowercase, excel-friendly output, sort by count, output custom list(s) only (in the latter case, this report is considered a ' | ||
+ | |||
+ | A brief excerpt from a relationship designator report follows. | ||
+ | |||
+ | 710 $4 | ||
+ | act: 1 | ||
+ | cmp: 2 | ||
+ | dst: 2 | ||
+ | pbl: 1 | ||
+ | prf: 4 | ||
+ | pro: 1 | ||
+ | Total for 710 $4: 11 | ||
+ | |||
+ | 710 $e | ||
+ | (production company): 2 | ||
+ | a film distributor: | ||
+ | actor: 4 | ||
+ | animation production company: 2 | ||
+ | animator: 9 | ||
+ | arranger: 1 | ||
+ | artist: 8 | ||
+ | audio producer: 5 | ||
+ | audio publisher: 1 | ||
+ | author: 35 | ||
+ | ... | ||
+ | |||
+ | Apart from the statistical aspect, this report will be useful for finding typos, missing subfield coding, and unwanted relationship terms in your records. If you are having difficulty finding the offending strings in your records, re-run the report with ' | ||
+ | |||
+ | CHARACTER SET USAGE | ||
+ | |||
+ | If selected, the program will track each byte in the file, and count the number of times it occurs. Two separate tables will be maintained: one for MARC-8 encoding, and another for UTF-8 encoding. The program will determine the record' | ||
+ | |||
+ | Hint: If you have some records (with leader/09 coded ' | ||
+ | |||
+ | For each row in the table, the following columns will be generated: codepoint (decimal), codepoint (hex), occurrence count, percent of total characters counted, and label. Here is a short example: | ||
+ | |||
+ | 065 x41 222430 0.429 Latin Uppercase A | ||
+ | 066 x42 176286 0.340 Latin Uppercase B | ||
+ | 067 x43 580179 1.119 Latin Uppercase C | ||
+ | 068 x44 210523 0.406 Latin Uppercase D | ||
+ | |||
+ | If there are UTF-8 encoded records in the file, a second table will be generated for each multibyte sequence that is encountered. The columns generated in this table are somewhat different: codepoint (hex), display character, occurrence count, percent of total characters counted. Note that to view display characters with codepoints beyond the extended Latin codepage, a unicode-compliant text editor will be needed (eg. notepad or editpad, as opposed to textpad). A brief example follows. | ||
+ | |||
+ | xC3 x80 À 7 0.000 | ||
+ | xC3 x81 Á 30 0.000 | ||
+ | xC3 x85 Å 7 0.000 | ||
+ | xC3 x86 Æ 2 0.000 | ||
+ | |||
+ | |||
+ | CUSTOM LISTS | ||
+ | |||
+ | Use this panel to define up to eight custom tag lists. For each tag/ | ||
+ | |||
+ | Enter a tag/ | ||
+ | |||
+ | ' | ||
+ | |||
+ | For example, to generate a list of holdings codes and their counts in a file, enter the tag/ | ||
+ | |||
+ | The length is important, as it determines the amount of granularity in the result. For example, to get a breakdown of call numbers in a database, a length of 3 for DDC, or 6 for LCC, will produce a nice overview, whereas a longer length, like 12, might simply dump of every call number in the system. The better way to do the latter would be to use MARC Review, as it provides far more options for configuring the output than MARC Analysis. | ||
+ | |||
+ | There is an option to normalize the strings that are collected by a custom Tag List. If this option is selected, each string will be shifted into uppercase, punctuation will be removed, leading and trailing blanks trimmed, and sequences of internal multiple blanks will be reduced to one blank. | ||
+ | |||
+ | There is also an option to output custom tag lists only. If this is selected, the report will contain only the general file statistics followed by the information for the tag(s) defined in your custom list. | ||
+ | |||
+ | NOTE: Data collected by a custom taglist is stored in system memory. Running ' | ||
+ | |||
+ | MATCH KEY STATS | ||
+ | |||
+ | On the right panel of the Custom lists page are a few options that apply to the Match Key analysis section of the report. | ||
+ | |||
+ | You can suppress this section altogether by unchecking ' | ||
+ | |||
+ | The primary match keys (LCCN, ISBN, ISSN, OCLC) are not editable (the point of showing them is simply to demonstrate what data is going to be collected). | ||
+ | |||
+ | But there are two slots where the user can enter their own 'match keys', which will be tracked in the same section of the report. | ||
+ | |||
+ | SUMMARY LISTS | ||
+ | |||
+ | Any or all of the following statistical summaries may be added to a MARC Analysis report. If there are any selections, the results appear at the end of the report. | ||
+ | |||
+ | Frequency: What % of records contain each tag used in the file | ||
+ | Occurrence: How many times each tag is used in the file | ||
+ | Overall Size: How many bytes does each tag use in the file | ||
+ | Maximum Length: What is the longest, 2nd longest, etc., tag in the file | ||
+ | Minimum Length: What is the shortest, 2nd shortest, etc., tag in the file | ||
+ | Average Length: What is the longest, on average, tag in the file | ||
+ | Content designator length summary: what is the most common content length in the file | ||
+ | Tag Summary: one row for each tag, each row with the 7 main columns (listed above) | ||
+ | |||
+ | You may limit the number of rows that are added by these statistics, eg., show only the top 10, 20, etc., items in each category. | ||
+ | Note: this limit does not apply to the tag summary report. | ||
+ | |||
+ | TAG SUMMARY | ||
+ | |||
+ | This report is selected using the summary lists option (see previous section). By default, this summary appears at the end of the main report. However, it can also be output to a separate report by selecting the " | ||
+ | |||
+ | A benefit of a separate file for the tag summary is that it may be loaded into a program that supports tables (like Excel). This is not possible if the summary is part of the main report. | ||
+ | |||
+ | You may also apply one of these categories to the overall report by right-clicking on the corresponding list item. For example, if you right-click on the second category--" | ||
+ | |||
+ | ONLY DISPLAY DETAILS FOR | ||
+ | |||
+ | This option makes it possible to reduce the size of the report so that it just focuses on a single tag, or a range of tag. | ||
+ | |||
+ | To achieve this, enter a three-digit tag, a list of tags (' | ||
+ | |||
+ | Turn off all of the options on the Data collection page | ||
+ | Make sure no Custom lists are defined | ||
+ | Turn off the Match Key analysis option | ||
+ | Turn off all Tag Lists to be added | ||
+ | |||
+ | The result will be a report that includes just the results for the tags that were entered here, after the general summary (which may not be suppressed). | ||
+ | |||
+ | Note: if any Custom Tag(s) are specified, the output for them will not be suppressed, even if they are not in the list of ' | ||