MARCEDIT-L Archives

May 2010

MARCEDIT-L@LISTSERV.GMU.EDU

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
"Reese, Terry" <[log in to unmask]>
Reply To:
MarcEdit support in technical and instructional matters <[log in to unmask]>
Date:
Mon, 10 May 2010 14:32:46 -0700
Content-Type:
text/plain
Parts/Attachments:
text/plain (144 lines)
Michael, 

The way that the changes have been implemented are the following:

1) Internally, in the MARCEngine, when translating UTF-8 data to MARC-8, the program will internally normalize data to the KD normalization.  This is the normalization that matches the MARC-8 spec -- so presumably, this will eliminate the need for you to normalize your data after processing and limit the number of NCR items that will show up in the records.

2) In the Options, you will have the option to select from two Normalizations -- the KD (which will tell MarcEdit to work as it does now), or the C (canonical) -- which will translation composite characters to their primary code points.  Regardless of the option that you use, the translation to remain lossless due to the internal normalize that I'll add for MARC-8 translation to ensure that folks don't accidently run into trouble due to this new feature.

By default, the option for the KD normalization will be selected, but what would be great is to hear from users to let me know if your ILS system can handle the C normalization.  Ideally, the C notion would be what I would like to standardize on -- but I could use the KC notation if that was required.

--TR

> -----Original Message-----
> From: MarcEdit support in technical and instructional matters
> [mailto:[log in to unmask]] On Behalf Of KREYCHE, MICHAEL
> Sent: Monday, May 10, 2010 8:07 AM
> To: [log in to unmask]
> Subject: Re: [MARCEDIT-L] UTF8 changes a coming for international users
> 
> Hi, Terry,
> 
> Thanks for mentioning this enhancement. I've been meaning to write
> about a related issue and this is a good opportunity to do so. I don't
> claim to be a Unicode expert but I've worked with it quite a bit and
> would like to offer a perspective.
> 
> First of all, I'm not sure I understand exactly what you are proposing.
> Your comments about the C and KC forms seem to be lumping them together
> without acknowledging that KC is lossy, so I hope these would be two
> different options.
> 
> Secondly, while I agree that the D form is inconvenient for many
> purposes, it's not lossy and in a sense is the most fundamental form
> (the C form is defined as "Canonical Decomposition, followed by
> Canonical Composition"). Conversions between the D and C forms are
> lossless and fairly trivial using software libraries available in at
> least some programming languages, so neither of these forms is really
> problematic.
> 
> The problem I've run into with MARCedit is that when MARCXML UTF8 input
> files contain composed (C form) characters and I specify conversion to
> MARC8, the composed characters don't get converted, they are turned
> into hex entities (e.g., &#x12B; for i with macron). This causes
> serious problem for display and indexing when loading these records
> into our Innovative system.
> 
> My solution for now is to pull the files into the MARCedit editor
> without the MARC8 conversion, compile them, run the records through a
> perl script that decomposes the characters, pull these into the editor
> with the MARC8 option checked, then compile them again. I still end up
> with a few problems, but they may be caused by problems with the
> original data (in this case, harvested via OAI-PMH from the Hathi
> Trust). Of course, I'm ignoring the fact that there are many Unicode
> characters that can't be converted to MARC8, but that is probably a
> non-issue in this particular case.
> 
> Anyway, it would be really useful to me for MARCedit to offer separate
> output options for UTF8 NFD, UTF8 NFC, MARC8; NFKC would be fine, too,
> but I'm not sure I would use it.
> 
> Many thanks for all your good work!
> 
> Mike
> --
> Michael Kreyche
> Systems Librarian / Associate Professor
> Libraries and Media Services
> Kent State University
> 330-672-1918
> 
> 
> > Date:    Thu, 6 May 2010 10:48:49 -0700
> > From:    "Reese, Terry" <[log in to unmask]>
> > Subject: UTF8 changes a coming for international users
> >
> > --_000_D4495111E3CAE8428B2496ED0FEA690E01F93BC32DEXCH2nwsorego_
> > Content-Type: text/plain; charset="us-ascii"
> > Content-Transfer-Encoding: quoted-printable
> >
> > I'm thinking that for most folks, this message won't be
> > applicable - but I =
> > wanted to throw this out to avoid any confusion.
> >
> > As many people know, the state of Unicode usage in MARC,
> > well, sucks.  Whil=
> > e many systems utilize UTF8 in their records, the UTF8
> > notation that is use=
> > d isn't actually used in the real world.  In order to allow
> > for lossy conve=
> > rsion between UTF8 and MARC8, the recommend UTF8 notation in
> > MARC is to uti=
> > lize combining glyphs, so you have two unique characters that
> > are placed si=
> > de by side to create a single combining character.  Within an
> > ILS system, I=
> > 'm fairly certain that vendors normalize this data to what is
> > called the "K=
> > C" or "C" UTF8 notations to turn these multiple glyphs into a
> > single, UTF8 =
> > character for indexing purposes.  However, when dealing with
> > XML data or ut=
> > ilizing MARC UTF8 data outside of the system, data formatted
> > in this format=
> >  is getting to be problematic - specifically for
> > international users that t=
> > end to utilize the "KC" or "C" flavors.
> >
> > So what I'm going to be doing in the next update is adding a
> > small option i=
> > n the preferences and an option in the character conversion
> > tool that will =
> > allow the user to select the current UTF-8 translation, which
> > represents th=
> > e current recommend MARC21 UTF8 implementation for lossless
> > data conversion=
> > , and an additional option which will allow you to normalize
> > the text - to =
> > move from the current state of things, to the more
> > standardized "KC" notati=
> > on.  Obviously, data run through this normalization process
> > won't be able t=
> > o be taken back to MARC8 - but for my many international
> > users, they simply=
> >  don't care about that.
> >
> > I'm going to try and make sure that this change is done so
> > that it doesn't =
> > cause any confusion - but I wanted to give folks a heads up
> > since I'm hopin=
> > g to have an update for MarcEdit early next week.
> 
> _______________________________________________________________________
> _
> 
> This message comes to you via MARCEDIT-L, a Listserv(R) list for
> technical and instructional support in MarcEdit.  If you wish to
> communicate directly with the list owners, write to MARCEDIT-L-
> [log in to unmask] To unsubscribe, send a message "SIGNOFF
> MARCEDIT-L" to [log in to unmask]

________________________________________________________________________

This message comes to you via MARCEDIT-L, a Listserv(R) list for technical and instructional support in MarcEdit.  If you wish to communicate directly with the list owners, write to [log in to unmask] To unsubscribe, send a message "SIGNOFF MARCEDIT-L" to [log in to unmask]

ATOM RSS1 RSS2