Differences
This shows you the differences between two versions of the page.
phelp:helpmarcsplit [2017/05/07 22:28] |
phelp:helpmarcsplit [2021/12/29 16:21] (current) |
||
---|---|---|---|
Line 1: | Line 1: | ||
+ | UTILITIES--MARC SPLIT | ||
+ | |||
+ | MARC Split allows you to break a large MARC file into several/ | ||
+ | |||
+ | WHY USE MARC SPLIT | ||
+ | |||
+ | Here are a few scenarios where MARC SPLIT would be useful. | ||
+ | |||
+ | First, say you need to transfer a large database across the internet. You decide (or are told) to split the database into several smaller files, then ftp each file separately. In this case, if the ftp job aborts, you will only need to resend one smaller file. MARC Split' | ||
+ | |||
+ | Second, say you want to copy your database to a set of removable media (floppy, CD, USB drive, etc.), and want each disk to contain a readable MARC file of roughly the same size. MARC Split' | ||
+ | |||
+ | Finally, you might use MARC Split to split your database into files by: holdings code, record type, media type, or any other distinct MARC element. For example, say you have 12 holdings codes in your system, and you need to generate 12 files, where each file contains all bib records to which each holdings code is attached. MARC Split' | ||
+ | |||
+ | Note: To split a file into two files based on the presence/ | ||
+ | |||
+ | SPLIT OPTIONS | ||
+ | |||
+ | There are three methods to use to split a file: | ||
+ | - by the number of records per file, | ||
+ | - by the number of bytes per file, | ||
+ | - by the data values in the file itself | ||
+ | |||
+ | |||
+ | By Records | ||
+ | |||
+ | This option is suitable for the first scenario discussed above. Click on ' | ||
+ | |||
+ | The size of each file in bytes will vary, of course, due to the variable-length nature of MARC. | ||
+ | |||
+ | If for some reason (perhaps for a web catalog) you need to save each record in your database to a separate file, you can do it by setting the Number of units to 1. Be careful if doing this for a large file. First, check that there is enough disk space, since, as a rule, each file created by the Windows operating system uses several times more disk space than the size of the MARC record itself. Second, be sure to direct the output to an empty folder--this will make clean up alot easier if something goes wrong, or if you just change your mind. | ||
+ | |||
+ | By Size | ||
+ | |||
+ | This option is suitable for when you are working with a known quantity of limited disk space, such as with removeable media. Click on 'By Size' option; then enter the capacity of the storage medium, or set a maximum file size, in kilobytes, in the ' | ||
+ | |||
+ | The number of files created will be roughly equal to the size of your source file in bytes divided by the number of bytes entered. The number of records in each file will vary, of course, due to the variable-length nature of MARC. | ||
+ | |||
+ | Note that the size of a storage device these days is typically given in Gigabytes, eg. '8 GB', or Megabytes ('650 MB'). You may find it necessary to convert this rounded off number into Kilobytes. There are many free converters on the web that will do this (google 'byte converter' | ||
+ | |||
+ | By Data | ||
+ | |||
+ | The By Data option is useful if you need to split a file by the value of a MARC content designator, and that content designator contains relatively consistent data. For example, you might need to split a file | ||
+ | - by language code | ||
+ | - by holdings code | ||
+ | - by media type | ||
+ | - by publication date | ||
+ | - or by any other MARC content designator | ||
+ | |||
+ | In a 'By Data' run, the output files are named using the strings contained in the specified MARC content designators (to which we add the file extension ' | ||
+ | |||
+ | In practice, Split By Data is a two-phase operation. | ||
+ | |||
+ | The utility first processes the selected MARC file in read-only mode; the purpose of this run is to gather statistics on the proposed files to be created. When that's done, the proposed file list is displayed, and you can either approve it, or cancel it and go back to the options. If you approve the proposed file list, the utility runs a second time to create the output files. | ||
+ | |||
+ | Its important to note that the Split utility does not attempt to validate the contents of the specified content designator. Any typos in your data will end up becoming filenames. Thus, continuing with the above example for 337 $a, you might end up with filenames like: ' | ||
+ | |||
+ | Using the options | ||
+ | |||
+ | Enter the location of the data in the ' | ||
+ | 040a - original cataloging agency code | ||
+ | 337a - media type (string value) | ||
+ | 852a - holdings code | ||
+ | 949l - holdings code | ||
+ | |||
+ | For fixed fields, enter the three-digit tag, followed by '/', | ||
+ | 000/06 - record type | ||
+ | 008/07 - publication date | ||
+ | 008/35 - language code | ||
+ | |||
+ | In the 'Data length' | ||
+ | |||
+ | The Data Length option is important since it will determine, for non-coded values, the number of files created--the greater the Data Length, the greater the number of files that will be created. | ||
+ | |||
+ | The 'If data repeats' | ||
+ | |||
+ | The first option is 'First occ only', which means Split will process only the first occurrence of each content designator. So, if you are splitting on holdings code, and a holdings field has data like this: | ||
+ | |||
+ | | ||
+ | |||
+ | --only ' | ||
+ | |||
+ | On the other hand, if 'Every occ' is selected, then the program will output one copy of the record for each unique occurrence of a string. Referring to the previous example: | ||
+ | |||
+ | | ||
+ | |||
+ | --there will be one record output for ' | ||
+ | |||
+ | If ' | ||
+ | |||
+ | Finally, if 'Split to file' is selected, and a record contains more than one occurrence, that record will be output to a file named (literally) ' | ||
+ | |||
+ | Normalization and deduping | ||
+ | |||
+ | The following normalizations will be applied to each data value found (whether the ' | ||
+ | - blank spaces in fixed field values are replaced with '#' | ||
+ | - leading and trailing blank spaces are deleted; internal blanks are preserved | ||
+ | - any character present that is not permitted in a Windows filename is replaced with ' | ||
+ | - strings longer than the value set in the 'Data length' | ||
+ | |||
+ | If ' | ||
+ | - MARC-8 diacritics are converted to ASCII approximations | ||
+ | - All punctuation marks except '#', | ||
+ | - Consecutive blank spaces within a string are reduced to a single space | ||
+ | - the string is shifted to lowercase | ||
+ | |||
+ | If ' | ||
+ | |||
+ | | ||
+ | |||
+ | --deduping would remove the second occurrence of ' | ||
+ | |||
+ | By default, any records that lack the specified Tag/Subf are written to a separate file, named (literally) ' | ||
+ | |||
+ | FILENAME PREFIX | ||
+ | |||
+ | This option applies to By Records and By Size splits, but not to the By Data option. | ||
+ | |||
+ | Enter up to seven characters to be used as a filename prefix. The default is the letter 'F. Whatever you enter here will be used to name the output files. For example, if you enter ' | ||
+ | |||
+ | Notes | ||
+ | |||
+ | The number of zeroes in the file sequence numbers is fixed (at 6) and does not depend on the number of files generated. | ||
+ | |||
+ | The file extension of a MARC file created by the Split utility will always be ' | ||
+ | |||
+ | |||
+ | OUTPUT FOLDER | ||
+ | |||
+ | Enter the folder to which you want the split files to be output. The folder-select dialog (activated by pressing the ' | ||
+ | |||
+ | Note: for a split By Data type of run, the output is always written to a subf-folder of the ' | ||
+ | |||
+ | If you type in the folder name manually, be sure to end with a trailing backslash. | ||
+ | |||
+ | LAST WORDS | ||
+ | |||
+ | There are a few cases where the Split By Data option might fail. The most prominent of these would be when the specified Tag/Subf results in too many files being created. The current file creation limit is 1,000 and this could easily be overrun on a MARC file of modest size (with ensuing negative consequences to your system) by selecting a common MARC field. | ||
+ | |||
+ | In fact, preventing this type of disaster is one of the reasons why Split By Data runs twice: first, in read-only mode, to discover the data and create a proposed list of files; then, after approval, it runs again to actually output the proposed files. | ||
+ | |||
+ | |||
+ | |||