Hi, Terry,
Thanks for mentioning this enhancement. I've been meaning to write about a related issue and this is a good opportunity to do so. I don't claim to be a Unicode expert but I've worked with it quite a bit and would like to offer a perspective.
First of all, I'm not sure I understand exactly what you are proposing. Your comments about the C and KC forms seem to be lumping them together without acknowledging that KC is lossy, so I hope these would be two different options.
Secondly, while I agree that the D form is inconvenient for many purposes, it's not lossy and in a sense is the most fundamental form (the C form is defined as "Canonical Decomposition, followed by Canonical Composition"). Conversions between the D and C forms are lossless and fairly trivial using software libraries available in at least some programming languages, so neither of these forms is really problematic.
The problem I've run into with MARCedit is that when MARCXML UTF8 input files contain composed (C form) characters and I specify conversion to MARC8, the composed characters don't get converted, they are turned into hex entities (e.g., ī for i with macron). This causes serious problem for display and indexing when loading these records into our Innovative system.
My solution for now is to pull the files into the MARCedit editor without the MARC8 conversion, compile them, run the records through a perl script that decomposes the characters, pull these into the editor with the MARC8 option checked, then compile them again. I still end up with a few problems, but they may be caused by problems with the original data (in this case, harvested via OAI-PMH from the Hathi Trust). Of course, I'm ignoring the fact that there are many Unicode characters that can't be converted to MARC8, but that is probably a non-issue in this particular case.
Anyway, it would be really useful to me for MARCedit to offer separate output options for UTF8 NFD, UTF8 NFC, MARC8; NFKC would be fine, too, but I'm not sure I would use it.
Many thanks for all your good work!
Mike
--
Michael Kreyche
Systems Librarian / Associate Professor
Libraries and Media Services
Kent State University
330-672-1918
> Date: Thu, 6 May 2010 10:48:49 -0700
> From: "Reese, Terry" <[log in to unmask]>
> Subject: UTF8 changes a coming for international users
>
> --_000_D4495111E3CAE8428B2496ED0FEA690E01F93BC32DEXCH2nwsorego_
> Content-Type: text/plain; charset="us-ascii"
> Content-Transfer-Encoding: quoted-printable
>
> I'm thinking that for most folks, this message won't be
> applicable - but I =
> wanted to throw this out to avoid any confusion.
>
> As many people know, the state of Unicode usage in MARC,
> well, sucks. Whil=
> e many systems utilize UTF8 in their records, the UTF8
> notation that is use=
> d isn't actually used in the real world. In order to allow
> for lossy conve=
> rsion between UTF8 and MARC8, the recommend UTF8 notation in
> MARC is to uti=
> lize combining glyphs, so you have two unique characters that
> are placed si=
> de by side to create a single combining character. Within an
> ILS system, I=
> 'm fairly certain that vendors normalize this data to what is
> called the "K=
> C" or "C" UTF8 notations to turn these multiple glyphs into a
> single, UTF8 =
> character for indexing purposes. However, when dealing with
> XML data or ut=
> ilizing MARC UTF8 data outside of the system, data formatted
> in this format=
> is getting to be problematic - specifically for
> international users that t=
> end to utilize the "KC" or "C" flavors.
>
> So what I'm going to be doing in the next update is adding a
> small option i=
> n the preferences and an option in the character conversion
> tool that will =
> allow the user to select the current UTF-8 translation, which
> represents th=
> e current recommend MARC21 UTF8 implementation for lossless
> data conversion=
> , and an additional option which will allow you to normalize
> the text - to =
> move from the current state of things, to the more
> standardized "KC" notati=
> on. Obviously, data run through this normalization process
> won't be able t=
> o be taken back to MARC8 - but for my many international
> users, they simply=
> don't care about that.
>
> I'm going to try and make sure that this change is done so
> that it doesn't =
> cause any confusion - but I wanted to give folks a heads up
> since I'm hopin=
> g to have an update for MarcEdit early next week.
________________________________________________________________________
This message comes to you via MARCEDIT-L, a Listserv(R) list for technical and instructional support in MarcEdit. If you wish to communicate directly with the list owners, write to [log in to unmask] To unsubscribe, send a message "SIGNOFF MARCEDIT-L" to [log in to unmask]
|