Table of Contents
MARC Report 248: Program enhancements
The main enhancement in version 2.48 is to add a new capability, named 'Split By Data', to the MARC Split utility. This makes it possible to take any MARC file and split it into smaller files based on the data values found in a specified Tag/Subfield.
Split By Data: Options
MARC Split, as you know, allows you to break a large MARC file into several/many smaller ones. With version 248, there are now three methods available to split a file:
- by the number of records per file,
- by the number of bytes per file,
- by the data values in the file itself
Here is a screenshot of the updated MARC Split options form, with the new options for 'By Data' highlighted:
The By Data option might be useful if you need to split a file by the value of a MARC content designator, and that content designator contains relatively consistent data.
For example, you might need to split a MARC file
- by language code
- by holdings code
- by media type
- by publication date
and so on.
How does it work
For a quick demo, try these steps.
Select a reasonably large MARC file and then select MARC Split from the Utilities menu.
On the 'Split' form, select the By Data radiobutton and set up the form as follows:
Down in the Output folder section, click on the 'ellipsis button' and select a folder where the results will be output.
Now press Run. Hopefully, the green bar should start moving. Just press 'Cancel' if for some reason you need to abort the job.
When the green bar gets to the end of the file, a report will pop-up. Have a look at it, and then press Cancel. Read on to find out what's happening.
This utility is a bit different than others, in that it runs two passes on the source file. The first pass is a report mode, where the program applies the options on the form to determine what files will be created. When this pass completes, it pops up the proposed results and asks you to confirm them1). If you confirm the proposed results, the utility runs a second time to actually write the output files.
The reason for this two-stage processing is that the Split By Data utility has the potential to create a lot of files–thousands of them–because the filenames are generated using the actual data found in the specified Tag/Subf. So in the example above, which uses the MARC Country code in the 008, the resulting output filenames might be:
nyu.mrc cau.mrc enk.mrc mau.mrc
This will not normally be a problem for data like that found in fixed fields, as coded data values are usually very well controlled, and relatively limited in number. But let's say, instead of the scenario, above we tried:
You will no doubt find quite a different story, as the control we have over our variable field data is not as good. Here's a piece of a report from the same file, with Tag/Subf set to 700e instead of 008/15:
film co director 1 film composer 2 film diector 2 film directer 1 film director 6268 film director editor of moving i 1 film director film producer 2 film director screenwriter 2 film diretor 1 film distributor 4 film editor 43 film narrator 1 film photographer 1 film pproducer 1 film pro ducer 1 film procer 1 film prodcuer 4 film produce 1 film producer 7064 film producer editor of moving i 1 film proudcer 2 film publisher 1 film screenwriter 2 filmmaker 75 filmproducer 3 films producer 3 fim director 1 fim producer 1
Lots of typos here–the catalogers must not be MARC Report users at this library!
So this brings us to an important note about Split By Data:
There is no attempt to validate the contents of the specified Tag/Subf.
Keep this in mind when using this option.
To avoid a scenario where thousands of files are output to an unwitting user's hard drive, we have a arbitrarily set a limit on the number of files that Split By Data will create. If that limit is exceeded, then the first pass will fail2). The limit in version 248 is 1,000 files. That should accommodate most needs; if it doesn't, let us know about it.
Finally, just a reminder of the obvious: the MARC Split Utility, and especially Split By Data, take a MARC file and split it into smaller files based on your options. If you don't want all these files, and simply want a report or list of, say, every MARC Country code in your file and the number of times it appears, or every relationship designator, etc., this is not the tool to use.
Instead, use the Custom List option in MARC Analysis. It is much faster, much more flexible, with none of the limits that have been imposed on Split By Data.
Sample report 1: MARC Country code
MARC REPORT 2.48 03/14/17 3:02 PM Split By Data on 008/15, Length=3, Repeat='First occ only' MARC Source File: D:\un\_big_marc\verified-161001.mrc Records processed: 220695 Report filename: D:\un\_big_marc\splitResults\SplitReport-17031401.txt The current split options will generate the following results: Number of files: 177 Number of records: 220695 Number of no-hits: 0 Filename Record count nyu 107916 cau 25161 mnu 8791 meu 8108 mau 7364 ilu 5136 enk 5037 miu 4663 onc 3925 nju 3462 azu 3095 xxu 2798 mdu 2539 ctu 2317 pau 2293 ohu 2017 flu 1906 tnu 1701 vau 1655 dcu 1637 wiu 1574 sp# 1574 oru 1403 txu 1389 cou 1027 xx# 990 wau 923 inu 781 mx# 697 ncu 598 vtu 552 nmu 537 utu 528 ### 464 iau 415 gau 374 oku 364 scu 341 mou 291 bcc 262 riu 255 at# 253 nbu 246 nhu 241 nvu 237 ksu 210 gw# 194 deu 184 quc 182 ag# 160 mtu 150 aru 130 alu 129 lau 103 kyu 100 %%% 84 fr# 79 ja# 77 cc# 73 vra 70 ne# 62 xxk 60 si# 50 ck# 49 idu 43 stk 40 it# 38 xxc 36 msu 35 wyu 32 xna 26 nz# 26 ii# 25 hiu 24 sz# 23 abc 22 aku 19 sw# 17 ie# 16 wvu 14 ve# 12 pr# 12 ko# 12 is# 11 ch# 10 au# 10 [report truncated; another 90 items follow]
Sample report 2: RDA Carrier type
MARC REPORT 2.48 03/14/17 12:15 PM Split By Data on 338a, Length=32, Repeat='First occ only' MARC Source File: D:\un\_big_marc\verified-161001.mrc Records processed: 220695 Report filename: D:\un\_big_marc\splitResults\SplitReport-17031401.txt The current split options will generate the following results: Number of files: 12 Number of records: 221154 Number of no-hits: 163782 Filename Record count _SplitByDataNoHits 163782 volume 42741 videodisc 9163 audio disc 4926 other 460 sheet 35 computer disc 30 object 8 videocassette 5 unspecified 2 card 1 audiocassette 1 No records will be output more than once.