Terry (and others),
FYI, we have been in dialogue with Serials Solutions over the 2nd half
of 2009 concerning this very problem in some of the e-journals files
they've supplied to us. The files claim to be UTF-8 (according to the
Leader), but every so often there's an html entity ref, sometimes
(always? - I forget the details right now) of the "old-fashioned" sort
(é)
Personally, it would be nice if the data were correct before it reaches
the library (!), but the next best thing would indeed be to have
MarcEdit take care of it. (I look forward to the day when there isn't a
single library problem that MarcEdit can't take care of - we're well on
the way already (or Terry is, rather!). But I suspect cleaning up floods
in the toilets may remain the software's powers...)
Hugh
--
Hugh Taylor
Head, Collection Development and Description
Cambridge University Library
West Road, Cambridge CB3 9DR, England
email: [log in to unmask] fax: +44 (0)1223 333160
phone: +44 (0)1223 333069 (with voicemail) or
phone: +44 (0)1223 333000 (ask for pager 036)
Reese, Terry said - in whole or part - on 03/02/2010 21:38:
> Justin,
>
> MarcEdit actually doesn't support HTML escapes because the MARC documentation specifically discourages their use, preferring numeric notation instead. Because of that, I've never added this type of conversion to the MarcEdit character filter - and honestly, am not sure where I'd get a definitive list of available HTML encodings (since their application and support - even among web-browsers has been half hearted). I suppose if someone could provide a list, I could take a look at seeing how difficult adding this to the conversion routine would be - though, I'd also be interested in knowing if this type of dirty data affected other folks as well.
>
> --TR
>
> From: MarcEdit support in technical and instructional matters [mailto:[log in to unmask]] On Behalf Of Justin Rittenhouse
> Sent: Wednesday, February 03, 2010 12:31 PM
> To: [log in to unmask]
> Subject: [MARCEDIT-L] HTML Escaped Entities
>
> Hello Terry and others,
>
>
> We at Notre Dame have been using MarcEdit to do a lot of record cleanup for batch record loading we've been working through. We've discovered in a few of the record sets we've been working with that in the "dirtier" sets, there are a number of escaped characters. We've found that when translating to UTF, MarcEdit already handles hex-encoded characters (é or similar). However, we've also found some of these sets include HTML escaped characters (é or similar). From what we can tell, these aren't being handled on translation. Is it possible to do this with MarcEdit already (other than a straight find and replace), or can we get this on a list of fixes for the next version or so?
>
> Thanks!
> --
> Justin Rittenhouse
> Systems Support
> Library Systems Department
> Hesburgh Libraries
> http://www.library.nd.edu/
> (574) 631-3065
>
>
> ________________________________________________________________________
>
> This message comes to you via MARCEDIT-L, a Listserv(R) list for technical and instructional support in MarcEdit. If you wish to communicate directly with the list owners, write to [log in to unmask] To unsubscribe, send a message "SIGNOFF MARCEDIT-L" to [log in to unmask]
>
> ________________________________________________________________________
>
> This message comes to you via MARCEDIT-L, a Listserv(R) list for technical and instructional support in MarcEdit. If you wish to communicate directly with the list owners, write to [log in to unmask] To unsubscribe, send a message "SIGNOFF MARCEDIT-L" to [log in to unmask]
>
________________________________________________________________________
This message comes to you via MARCEDIT-L, a Listserv(R) list for technical and instructional support in MarcEdit. If you wish to communicate directly with the list owners, write to [log in to unmask] To unsubscribe, send a message "SIGNOFF MARCEDIT-L" to [log in to unmask]
|