I agree. If you're de-duping based on a simple criterion (say,
001 field or ISBN), then simply extracting the relevant field(s)
and uniq'ing the list would allow you to count the dups. E.g.
I routinely compare 170,000 ebook records from one vendor (converted
to .mrk format) in their current state with nearly the same records
as they were loaded three months ago, and identify which records
are new, which are obsolete, and which records appear in both the
new and the old lists. Using plain old tools like c:\uniq -d | wc
and c:\comm -13 file1 file2 > added.txt. (yes, in Windows.) These take
about a second to run on 150,000+ records.
pfs
pfs
On Wed, Oct 16, 2019, at 10:20, Edmunds, Jeffrey wrote:
>
> Hi Lloyd,
>
> Depending on what you're basing the de-duplication on, an idea that
> occurs to me is to use Export Tab Delimited Records to output the
> field(s) on which the de-duplication is based, then open the resulting
> txt file in Excel and use Excel's de-duplicate data feature to
> count/remove the duplicates.
>
> Good luck.
>
> Jeff
>
> *From:* MarcEdit support in technical and instructional matters
> <[log in to unmask]> on behalf of Lloyd Chittenden
> <[log in to unmask]>
> *Sent:* Wednesday, October 16, 2019 10:06 AM
> *To:* [log in to unmask] <[log in to unmask]>
> *Subject:* [MARCEDIT-L] deduplication tool very slow
> I'm trying to run the MarcEdit deduplication tool on a file of about
> 100,000 records. I think it's going to take about eight hours. Is there
> a way to speed it up? Maybe I can let MarcEdit use more memory or
> something? I don't actually need to remove the duplicates, I just want
> to count them. Is there a faster way?
>
> --
> Lloyd Chittenden
> Union Catalog Coordinator
> Marmot Library Network
> [log in to unmask]
> (970)312-8668
> ________________________________________________________________________ This message comes to you via MARCEDIT-L, a Listserv(R) list for technical and instructional support in MarcEdit. If you wish to communicate directly with the list owners, write to [log in to unmask] To unsubscribe, send a message "SIGNOFF MARCEDIT-L" to [log in to unmask]
>
> ________________________________________________________________________ This message comes to you via MARCEDIT-L, a Listserv(R) list for technical and instructional support in MarcEdit. If you wish to communicate directly with the list owners, write to [log in to unmask] To unsubscribe, send a message "SIGNOFF MARCEDIT-L" to [log in to unmask]
--
Paul Schaffner UM Library : Digital Content & Collections
[log in to unmask] | http://secure-web.cisco.com/19oNBinCH_Rkh3PM1p3GLGLEytS5S_4tmwDSJ01_UgdOWV6AGevnK8MXw5jVY0a68UqMIbdYicN-1n9pquaz0IR6ITNf9lk3F3SO-8VwhoXwZvG2Ohcm8k0ZnrPcFRJXC5SSGsgtJ8YXOHe7CBIVBPkDnnCr2RLPVGuhVX-WiGtlkmjOyanG4OueL5Oq8owpTDyYgLGIxK2Tu378NGkTO02gdDqcRCyUtwDgq6BSn2zSbPl-2tsNydsrVEKqjQBZjBmbK4CxQ4VLSfUQdk0fasKJeykcs38kLmYOk5cDhUkU-5wz8xSOrHRkMlpxt5EkoOqtWb4q1FMXtr0ltuElLKu_AVGVhiref77cMrmiJGNRuUWG7sXS0MOlwXd5-LUT_WF682tJv7F_5o-LlzdIaVfB6UyzhS1exxYwXhxjc1d70x7v1l8lxJSk3K0tbtFYtzHg5d2KfHaCg_G7OkIdXUg/http%3A%2F%2Fwww.umich.edu%2F%7Epfs%2F
________________________________________________________________________
This message comes to you via MARCEDIT-L, a Listserv(R) list for technical and instructional support in MarcEdit. If you wish to communicate directly with the list owners, write to [log in to unmask] To unsubscribe, send a message "SIGNOFF MARCEDIT-L" to [log in to unmask]
|