Cleaning up urls with awk
Here’s my stupid awk trick of the day: using the field separator option to mess with URLs. I spent something like an hour trying to write regular expressions and then reading other people’s solutions to cleaning up urls from log files and other sources.
For example, given a list of about a million urls like this:
http://bloggggggggy.com/path/to/the/post/2009/1/26/blahblahblah.html
http://www.bloggggggggy.com/morejunk.html
https://www.bloggggggggy.com
http://yetanotherblogomigod.blogspot.com/
http://yetanotherblogomigod.blogspot.com/somejunk.php?stuff&morestuff
I want to end up with a list that’s just
bloggggggy.com
yetanotherblogomigod.blogspot.com
You can do this in php with some regular expressions:
preg_match("/^(http:\/\/)?([^\/]+)/i", $URLstring, $result);
$domain = $result[2];
(Though I saw a lot of other solutions that were much longer and more involved)
or, here’s one method in Perl:
$url =~ s!^https?://(?:www\.)?!!i;
$url =~ s!/.*!!;
$url =~ s/[\?\#\:].*//;
But for some reason I was trying to do it in one line in awk, because that’s how my brain is working these days, and I couldn’t get the regular expression right.
Suddenly I realized that if I split the lines on “/”, the domain name would always be the third field.
So,
awk -F"/" '{print $3}' hugelistofurls.txt > cleanlist.txt
gave me a nicer list of urls.
and
awk -F"/" '{print $1,"//",$3} hugelistofurls.txt | sort | uniq -c | sort -nr > counted-sorted-cleanlist.txt
gave me just about what I wanted.
After I did that and finished squeaking with happiness and wishing I could show someone who would care (which unfortunately I couldn’t which is why I’m blogging it now) I realized I wanted the www stuff taken out. So I backed up and did it in two steps,
awk -F"/" '{print $1,"//",$3}' hugelistofurls.txt > cleanlistofurls.txt
awk -F"www." '{print $1 $2}' cleanlistofurls.txt | sort | uniq -c | sort -nr > reallyclean-sorted-listofurls.txt
which gave me something like this:
3 http://blogggggggy.com
2 http://yetanotherblogomigod.blogspot.com
Exactly what I wanted!
While I appreciate a nice regular expression and it can be a fun challenge to figure them out, getting the job done with awk felt a lot simpler, and I’m more likely to remember how to do it in an off-the-cuff way, next time I have a giant list of urls to wrestle with.
How would you approach this same problem, either in awk or using another tool or language? Do you think one way or another is superior, and why?