X-CAT – Concatenate XML files

Because of the special nature of XML Files, they cannot be concatenated using a typical concatenate utility.

X-Cat attempts to perform an xml-aware concatenation on the selected files.

BASIC CONCATENATION STEPS

The basic steps to concatenating XML files are the same as any other concatenation process.

First, choose the files that you want to concatenate, or join together. Second, choose the name and location for your results. And third, start the concatenation program.

The first time you run the program, the dialogs that select files will open in your My Documents folder. After that, however, these file selection dialogs will open in the last selected folder.

There are two ways to select files: using an explorer dialog, or setting filename patterns.

SELECT FILES WITH EXPLORER

To select files with explore, click on the 'Select/explorer' tab.

To launch an explorer dialog, click anywhere in the open space of this tab. You may then navigate to folders and select files (to select multiple files at the same time, hold down the <Ctrl> key or <Shift> key during selection). When you press the 'Open' button the selected files be added to the 'Select/explorer' window.

Alternately, you may drag and drop XML files from an open explorer window onto the 'Select/explorer' window.

Note that every file in this window will have a checkbox next to it. If you uncheck a file, it will be exccluded from the concatenation processing. Click the 'File|Clear File List' menu to empty the Source Files list and start from scratch.

SELECT FILES WITH A PATTERN

To select files using a filename pattern, click on the 'Select/patterns' tab. Enter the file pattern(s) you want to match in the window below; the typical filename pattern for this program would be:

*.xml

If you do not specify a path in your pattern, then the program will look for the files in the same folder as the xcat.exe (which probably is not what you want). To change this behavior, enter a path below (in the box labelled 'Set a relative path'). This will become the starting path that the program will use for all patterns entered above.

Once the patterns have been entered, click the 'Validate Patterns' button on the right. The program will apply the patterns and report the number of files that match each pattern.

If you want to select files from multiple folders, it may be necessary to enter a fully qualified path for each pattern.

RESULTS FILE

Once the file selections have been completed, return to the main tab ('Results file') and set the filename for the results. Clicking on the results filename box will generate a default results filename in the format:

xcat-results-YYMMDDnn.xml

where YYMMDD will be the current date, and 'nn' will be a unique sequence number.

You may use the default filename, or change it to whatever you wish.

RUN

Once the file selection is complete, and the results filename has been specified, press 'Run' to concatenate the filess. There is an 'Abort' button which may be used if you want to stop the concatenation.

When you press 'Run', the program will copy each selected file in turn to the results file. The status bar at the bottom of the form will provide an indication of which file is currently being processed; the progress bar above it will indicate how far the overall processing has to go.

RECORD COUNTS

There are no 'record' counts in X-Cat as (unlike MARC) there's no way to really know what constitutes a 'record'.

DUPLICATE FILES

Duplicate files are not excluded by default. For example, if you select a file more than once, it will be added to the Source Files list, but its checkbox will not be selected. If you want these files to be added more than once, you will have to manually check them.

XML DOCUMENT PROCESSING DETAILS

This program attempts to process all of the files specified as XML documents.

The first task is to test-load each file. This step validates the XML structure and extracts the document element. Any file that fails this step will be removed from the list of files specified.

The second task is to check the document elements extracted by the first step. For best results, they should all be the same. If they are not, the program will display a warning, listing the variant document elements that it found. If you override this warning, the program will continue to concatenate the files, even though it is possible the results will be corrupt.

Once this pre-processing is taken care of, the program begins to concatenate the files.

In general, for each file, the program collects all children of the document element (or root node).

For example, in a MARCXML or MODS file, the root node might be '<collection>' or '<modesCollection>', and thus, the children will be all of the '<record>' or '<mods>' elements.

Or, in an OAI file, the root node might be '<OAI-PMH>', and the children would be the '<responseDate>', '<request>', and '<ListRecords>' elements.

The first file in the file list is treated slightly differently than the rest. The first file is itself loaded into the resulting XML document; from there the program makes a list of its top-level elements. Then, each subsequent file is loaded into a scratch XML document, and the program searches each file for top-level elements that match the list created from the first file. Each matching element is then appended to the resulting XML document.

In the event that different types of documents are being concatenated, and the warning about doing this was overriden, its possible that a subsequent document may not match any top-level nodes from the first document. In this case, the program builds a new list of top-level nodes for that document, and appends them ('mashing' might be a more applicable term) into the resulting XML document.

There's one other special type of handling that may take place. Consider several OAI files with a top-level structure like so:

File 1:

<OAI-PMH>
<responseDate>
<request>
<ListRecords>

File 2:

<OAI-PMH>
<responseDate>
<request>
<ListRecords>

and so on.

Using the logic described above, the resulting XML document will look like this:

<OAI-PMH>
<responseDate>
<request>
<ListRecords>
<responseDate>
<request>
<ListRecords>
...

</OAI-PMH>

There's nothing wrong with this.

However, its possible to eliminate the extra <responseDate> and <request> elements from the results, by specifying the '<ListRecords>' elements on the XML Options tab of the program.

This 'XML Options' tab contains a single option. If its empty or unchecked, the standard processing of top-level elements applies. Otherwise, only the element entered in this option will be concatenated from subsequent files to the first.

phelp/helpxcat.txt · Last modified: 2021/12/29 16:21 (external edit)
Back to top
CC Attribution-Share Alike 4.0 International
Driven by DokuWiki