MARCEDIT-L Archives

August 2011

MARCEDIT-L@LISTSERV.GMU.EDU

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Andy Helck <[log in to unmask]>
Reply To:
MarcEdit support in technical and instructional matters <[log in to unmask]>
Date:
Fri, 26 Aug 2011 02:14:03 +0000
Content-Type:
text/plain
Parts/Attachments:
text/plain (153 lines)
Rosalie and Kumi,

I guess we are on opposite schedules, which makes this kind of fun.

I mostly use the general find/replace function in MarcEdit rather than the subfield edit function that Terry provides. Also, I am not in front of a computer with marcedit installed, so I wont be able to verify these  strings until tomorrow but you are welcome to give them a try.  So for starters I am guessing you have fields that look sort of like this:

\\260  $aNew York$bScribners$c9/1/2004
and what you want to end up with is 
\\260  $aNew York$bScribners$c2004



at the heart of the matter is to recognize the original date string (let's call this the search string) and save the 3 fields within it for later use:

([0-9]*)//([0-9]*)//([0-9]*)

this is not the only string that would work, but it goes to the basics of regular expression language. So "[0-9]" means "any character from 0 to 9" and the asterisk that follows means "the character we just discussed repeated zero or more times" so [0-9]* is a regular expression pattern that can be matched by any number of search strings including: "9", "043", "11111111111111111", etc. It also matches the empty string, ie the string with zero characters.

Note 3 of these constructions in the example above, each parenthesized and separated by double slashes. The parentheses tell the regular expression evaluator (marcedit) that whatever part of the search string is found that matches the expression inside the parentheses, this needs to saved verbatim for future use. So in the example above, we are going to match 3 numbers, and save each one for future use. The double slashes should be thought of as single slashes, meaning we are searching for 3 numbers separated by slashes. "04/23/1632" is an example of a string that would be matched. Now the reason that we use 2 slashes to indicate 1 slash is that the slash mark itself is one of a handful of characters (like parenthesis and asterisk) that have a special meaning in the regular expression, so that if you want a plain old slash mark you have to work a bit harder.

Okay, so lets jump into the mind of MarcEdit after it has matched a search string like "04/21/2213" against the regular expression pattern "([0-9]*)//([0-9]*)//([0-9]*)".  The date came from the file of records you are processing and the regular expression is something that you specified to marcedit as something to find. You also specified to marcedit what the found thing should be replaced with. You could set the replace expression to something like "1984" and completely ignore what was in the search string. But that would only be correct a very small fraction of the time, so you want an expression that specifies which of the "captured" groups you want to use. Recall we used the parentheses 3 times, and the 3rd time it was matched to what we can reasonably expect to be the year. So our replace string consists of the 3rd capture group. In your case, I guess that's about all you want, more generally you might be adding in your own text ("published in the year of XXXX") or combine it with other capture groups.

So how do you specify a capture group? its pretty specific. The dollar sign introduces the capture group, and you specify which one with a number:

find: ([0-9]*)//([0-9]*)//([0-9]*)
replace: $3

so again, I am not in front of a computer where I can test this, and what I give you is at the 'heart' of the regular expression. If you are using MarcEdit's most general Find/Replace function (the one I am most familiar with) than this operation would change every date string in the entire file. You want to limit it to the 260 fields subfield c.  I would probably stick with what I know, and write a regular expression that recognized the entire line starting with 260 and grabbing and saving the $a and $b subfields and then grabbed the date string in $c. However...

...that's really not the best way. Terry wrote a wonderful function called Edit Subfield that goes right to the part of the 260 field that you want to modify. Unfortunately, I rarely use this feature (long story) so I don't know for sure if what I have outlined above will work verbatim in his Edit Subfield function.

There are some videos to watch on Youtube, and I will look into this more closely in about 16 hours. But feel free to give it a try. There is a bit of a learning curve to this, so its not a bad idea to give yourself an hour or two to start with a blank file, and maybe just type in a few lines:

2/3/1134
5/38/2012

then you can run the example I gave you under the general find replace (ctrl h) and see what happens. Be sure to copy and paste in my find and replace examples above, and click on the checkbox called "use regular expressions". Then (if you get this to work) move on to a couple of sample 260 strings taken from a real file of real records, and see how well it translates to Edit Subfield.


finally, I have said nothing about your more difficult problem which is recognizing and putting into simplest form the difference between:

"01/02/2003 - 01/02/2004" which becomes "2003-2004"
and the more subtle:
"1/02/2003 - 12/30/2003" which becomes simply "2003"

suffice it to say at this point that regular expressions allow you to refer to the capture groups right away within the search string itself, so that you can in effect write a pattern that says 

(any month)/(any day)/(any year) - (any month)/(any day)/(the same year as capture group #3)

this is not a literal regular expression, its kind of half regular expression and half english, because what I call 'any day' is just '([0-9]*) and I am being deliberatly vague about "the same year as capture group #3"


So I hope some of this makes a little sense. The regular expression thing is much bigger than marcedit, it's what many computer programmers use to solve difficult text processing problems. As such, there is lot of literature about these things, but most of it is written for the specialized audience of computer programmers rather than the specialized audience of technical librarians. Also, these things have been around for 40 or more years, so there are different implementations (think Windows vs Mac but its not that simple). You want to know about the Microsoft version. I found the book by O'Reilly useful, 
http://www.amazon.com/Mastering-Regular-Expressions-Jeffrey-Friedl/dp/0596528124/ref=sr_1_1?ie=UTF8&qid=1314324640&sr=8-1
but you can also find stuff on Microsoft's own websites under regular expression.

It will seem quite daunting, so when you get distressed just sit down and change the dates by hand for a few hours, realize you have 10,000 more records to do and learn a little more about regular expressions. Its a subject at least as deep as the tax code, but you don't need to become a complete expert. If you approach it gently and give it a week you will probably get the hang of it. then again if you want to give me a sample file I can probably have a test solution for you in about the same time frame.

good luck, and I will verify my examples tomorrow.




Yours,

Andy Helck
Wilkinson Public Library
Telluride CO 81435
________________________________________
From: MarcEdit support in technical and instructional matters [[log in to unmask]] on behalf of Rosalie Martin [[log in to unmask]]
Sent: Thursday, August 25, 2011 6:58 PM
To: [log in to unmask]
Subject: Re: [MARCEDIT-L] Changing date format with a regular expression

Thanks Andy,
Could you provide an example of a regular expression for a replacement string for just the year:
"it is not a big deal to extract the day, month and year from your existing format of mm/dd/yyyy. Any number of regular expressions can pull out the 3 fields and allow you to create a replacement string using just the year."

Then maybe we can take it form there? (We are both pretty much novices to applying regular expressions, so we just need a jump start)
Thanks again,
Rosalie and Kumi

On 25 August 2011 13:15, Andy Helck <[log in to unmask]<mailto:[log in to unmask]>> wrote:
Hi Rosalie,

maybe someone else will have the answer you need..but if I understand your posting properly you have an interesting problem. it is not a big deal to extract the day, month and year from your existing format of mm/dd/yyyy. Any number of regular expressions can pull out the 3 fields and allow you to create a replacement string using just the year. Based on this, its not too big a stretch to also create a regular expression that recognizes a range of dates (mm/dd/yyyy - MM/DD/YYYY) in the same format. Again, once you pull in all the fields (month, day, year) you can easily enough compose a replacement string that uses just the fields that you want to retain.

The tough part comes from your second example, where a date range that happens to be contained in a single calendar year shows up not as the redundant range "2000 - 2000", but is magically shortened to "2000." This seems like the tricky part. Not impossible, just tricky. you are going to have to create a regular expression that recognizes those specific ranges where the year is the same in the beginning of the range as in the end of the range, and compose the replacement string accordingly.  You might well need a set of find/replace strings (rather than just one) that will accomplish your task.

Maybe someone else has a better solution than what I have outlined, so I won't post examples right now...but if no one has a "magic bullet" solution I will be happy to try to come up with some specific regular expressions that you might try.

Yours,

Andy Helck
Wilkinson Public Library
Telluride CO 81435
________________________________________
From: MarcEdit support in technical and instructional matters [[log in to unmask]<mailto:[log in to unmask]>] on behalf of Rosalie Martin [[log in to unmask]<mailto:[log in to unmask]>]
Sent: Wednesday, August 24, 2011 5:26 PM
To: [log in to unmask]<mailto:[log in to unmask]>
Subject: [MARCEDIT-L] Changing date format with a regular expression

Hello,
Just wondering if someone on the list knows offhand of a regular expression to change the date format for the 260 subfield c. We have a number of EBSCOhost records which have the date appear as:

9/1/2004 -  (for an annual which began in 2004) [change to 2004- ]

12/7/2000 - 12/7/2000 (for a book with date of publication of 2000) [change to 2000]

1/1/2005 - 2/1/2006 (for an annual published between 2005 and 2006) [change to 2005 - 2006]

Any help would be greatly appreciated,
Thanks,
Rosalie

--
Rosalie Martin
Cataloguing Librarian
Electronic Resources Unit
Information Resources Division
Monash University Library
Box 4
Monash University Vic. 3800
Ph. +61 3 99052627<tel:%2B61%203%2099052627>

________________________________________________________________________

This message comes to you via MARCEDIT-L, a Listserv(R) list for technical and instructional support in MarcEdit. If you wish to communicate directly with the list owners, write to [log in to unmask]<mailto:[log in to unmask]>. To unsubscribe, send a message "SIGNOFF MARCEDIT-L" to [log in to unmask]<mailto:[log in to unmask]>.

________________________________________________________________________

This message comes to you via MARCEDIT-L, a Listserv(R) list for technical and instructional support in MarcEdit.  If you wish to communicate directly with the list owners, write to [log in to unmask]<mailto:[log in to unmask]>. To unsubscribe, send a message "SIGNOFF MARCEDIT-L" to [log in to unmask]<mailto:[log in to unmask]>.



--
Rosalie Martin
Cataloguing Librarian
Electronic Resources Unit
Information Resources Division
Monash University Library
Box 4
Monash University Vic. 3800
Ph. +61 3 99052627

________________________________________________________________________

This message comes to you via MARCEDIT-L, a Listserv(R) list for technical and instructional support in MarcEdit. If you wish to communicate directly with the list owners, write to [log in to unmask] To unsubscribe, send a message "SIGNOFF MARCEDIT-L" to [log in to unmask]

________________________________________________________________________

This message comes to you via MARCEDIT-L, a Listserv(R) list for technical and instructional support in MarcEdit.  If you wish to communicate directly with the list owners, write to [log in to unmask] To unsubscribe, send a message "SIGNOFF MARCEDIT-L" to [log in to unmask]

ATOM RSS1 RSS2