MARCEDIT-L Archives

October 2019

MARCEDIT-L@LISTSERV.GMU.EDU

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Terry Reese <[log in to unmask]>
Reply To:
MarcEdit support in technical and instructional matters <[log in to unmask]>
Date:
Wed, 16 Oct 2019 16:04:43 -0400
Content-Type:
text/plain
Parts/Attachments:
text/plain (58 lines)
It uses a coefficient equivalency match.  Honestly -- it was good for what
it was for (dealing with foreign languages where it was phrase data).  This
wouldn't be good for 5xx data elements.  I've considered allow the
Levenstein(sp) distance algorithm to be used.  I use this for clustering --
it tests the distance (# of characters) between two words or phrases.  I
find in more useful in most generic use cases.

--tr

-----Original Message-----
From: MarcEdit support in technical and instructional matters
<[log in to unmask]> On Behalf Of Lisa Hatt
Sent: Wednesday, October 16, 2019 2:44 PM
To: [log in to unmask]
Subject: [MARCEDIT-L] Fuzzy matching (was Re: deduplication tool very slow)

On 10/16/2019 7:48 AM, Terry Reese wrote:

> Fuzzy matching was designed specifically to work for near term 
> matches.  I was introduced for a library combining authority files 
> from different languages.

This question now is merely academic since the project itself has been
completed (we determined most 500 fields in our WMS local bib data were
either duplicated or sufficiently otherwise represented in the master
records, and nuked and paved them), but --

Do you have any samples for data that would be considered match/no match at
the various confidence level settings? When I was working on my project
earlier this year to try to see whether the contents of 500 fields in one
file were matched by any 500, 505, or 520 fields in another file, I turned
it down as far as it would go, but the results seemed inconsistent.
Sometimes things were considered a match and discarded from the file when
the contents of the fields were so wildly different I didn't understand how
even 65% confidence would have matched them -- or even in some cases where
the field to be matched to was entirely absent, a "match" was found anyway
-- and in other cases data that was almost identical except for spaces
around punctuation marks (the sort of thing you advised me in the first
place to turn on fuzzy matching to handle) was considered unique.


--
Lisa Hatt
Cataloging | De Anza College Library
[log in to unmask] | 408-864-8459

________________________________________________________________________

This message comes to you via MARCEDIT-L, a Listserv(R) list for technical
and instructional support in MarcEdit.  If you wish to communicate directly
with the list owners, write to [log in to unmask] To
unsubscribe, send a message "SIGNOFF MARCEDIT-L" to
[log in to unmask]

________________________________________________________________________

This message comes to you via MARCEDIT-L, a Listserv(R) list for technical and instructional support in MarcEdit.  If you wish to communicate directly with the list owners, write to [log in to unmask] To unsubscribe, send a message "SIGNOFF MARCEDIT-L" to [log in to unmask]

ATOM RSS1 RSS2