New UTF8 category in MARC Analysis

In the past, MARC Analysis maintained a table of UTF8 sequences like this:

UTF-8 Multi-Byte Sequences (each counted as 1 character)
Hex		Visual	Count	Percent
xE3 x81 x88 	え	1	0.000
xE3 x81 x8E 	ぎ	1	0.000
xE3 x81 xA8 	と	1	0.000
xE3 x81 xB2 	ひ	1	0.000
xE3 x82 x8B 	る	2	0.000
xE3 x82 x92 	を	1	0.000
xE3 x83 x88 	ト	1	0.000
etc.

Beginning with 2.32, a new UTF8 category will follow the one above; this new section counts the number of multibyte UTF8 sequences found in each record and in each tag. This category will look like this:

UTF-8 Multi-Byte Sequences by MARC Tag and Number of records: 
Tag	Seq	Records Found In
020:	113	111
100:	226	129
110:	20	13
130:	21	16
210:	5	5
222:	16	14
240:	21	15
245:	1120	414
etc.
232/marcanalysisutf8bytag.txt · Last modified: 2021/12/29 16:21 (external edit)
Back to top
CC Attribution-Share Alike 4.0 International
Driven by DokuWiki