Cleaning up urls with awk

Here’s my stupid awk trick of the day: using the field separator option to mess with URLs. I spent something like an hour trying to write regular expressions and then reading other people’s solutions to cleaning up urls from log files and other sources.

For example, given a list of about a million urls like this:
<br />http://bloggggggggy.com/path/to/the/post/2009/1/26/blahblahblah.html<br />http://www.bloggggggggy.com/morejunk.html<br />https://www.bloggggggggy.com<br />http://yetanotherblogomigod.blogspot.com/<br />http://yetanotherblogomigod.blogspot.com/somejunk.php?stuff&morestuff<br />
I want to end up with a list that’s just
<br />bloggggggy.com<br />yetanotherblogomigod.blogspot.com<br />
You can do this in php with some regular expressions:
<br />preg_match("/^(http:\/\/)?([^\/]+)/i", $URLstring, $result);<br />$domain = $result[2];<br />
(Though I saw a lot of other solutions that were much longer and more involved)
or, here’s one method in Perl:
<br />$url =~ s!^https?://(?:www\.)?!!i;<br />$url =~ s!/.*!!;<br />$url =~ s/[\?\#\:].*//;<br />
But for some reason I was trying to do it in one line in awk, because that’s how my brain is working these days, and I couldn’t get the regular expression right.

Suddenly I realized that if I split the lines on “/”, the domain name would always be the third field.

So,
<br />awk -F"/" '{print $3}' hugelistofurls.txt > cleanlist.txt<br />
gave me a nicer list of urls.

and
<br />awk -F"/" '{print $1,"//",$3} hugelistofurls.txt | sort | uniq -c | sort -nr > counted-sorted-cleanlist.txt<br />

gave me just about what I wanted.

After I did that and finished squeaking with happiness and wishing I could show someone who would care (which unfortunately I couldn’t which is why I’m blogging it now) I realized I wanted the www stuff taken out. So I backed up and did it in two steps,

<br />awk -F"/" '{print $1,"//",$3}' hugelistofurls.txt > cleanlistofurls.txt<br />awk -F"www." '{print $1 $2}' cleanlistofurls.txt | sort | uniq -c | sort -nr > reallyclean-sorted-listofurls.txt<br />

which gave me something like this:

<br />3 http://blogggggggy.com<br />2 http://yetanotherblogomigod.blogspot.com<br />

Exactly what I wanted!

While I appreciate a nice regular expression and it can be a fun challenge to figure them out, getting the job done with awk felt a lot simpler, and I’m more likely to remember how to do it in an off-the-cuff way, next time I have a giant list of urls to wrestle with.

How would you approach this same problem, either in awk or using another tool or language? Do you think one way or another is superior, and why?

Related posts:

Hard drive down!

The ominous clicking noise from my hard drive should have given me a clue. Backup was on my to-do list, but never a priority. That’s why I’m talking to you from August 8th, when I last copied my entire hard drive with Carbon Copy Cloner over to my glossy & beautiful Western Digital Passport 120GB USB drive. I’ve got amnesia in my exoskeleton. It’s horrible!

The sudden crash, inability for my laptop or a bootup CD or booting from another laptop in target mode probably means my data is intact on the drive, but the drive’s controllers are messed up. I got a quote over the phone from Drive Savers up in Novato; something like “600 to $3900”, with the low estimate being if they could only get a little bit of garbled data off and if they didn’t have to disassemble the drive. IntelliRecovery is in Hunters Point and cheaper – $400 to $1600. Can I justify spending $1000 for the last 2 months of my scripts, work data, email, book editing project, and music? It’s a close call, because that’s probably how much my time is worth to reconstruct everything and re-do all the work I’ve lost.

The evening of the crash, I took my MacBook to the Apple Store. They said it would be around $300 and 5-7 business days to send my laptop out and put a new drive in it.

The PowerBook Guy office just around the corner from the San Francisco Apple Store replaced my hard drive and gave it back to me in 2 hours. So I’m up and running again.

I think my future backup plans will be to do a full backup to my pocket hard drive every week as part of my work routine. And every night I will back up the work and book-editing files.

It was interesting to see what bits of the computer are crucial for me to feel comfortable. Firefox profile is way more key than I realized. Thunderbird profile is also very useful. Adium contacts. The keychain. My various .rc files. Ecto. My greasemonkey scripts and other python and Perl stuff for work. But with just the Firefox profile and a term window I can be up and running at a basic level from my own shell accounts (on pair and dreamhost). So now I’m trying to come up with some super basic set of “junk that I need” which I could carry around on my tiny flash drive.

Go and back up your data right now, by the way!

Related posts: