Learning Python by writing a screen scraper

Just for fun I decided to write a tool for work in Python instead of Perl, and I thought I’d describe the process. Partly because other people can be very opaque about how they learn things, and especially how they learn technical things or approach something unfamiliar.

At work, I had a big list of URLs to look through, to check a particular detail of sidebar widget code on each blog.

Originally, I got kind of excited as I browsed this page of twill commands. I installed twill on my laptop and tried out the examples. Wooo! That was so easy! It was like screen scraping with pseudocode. So, without really looking into it any further I figured it would be easy to churn through the few thousand URLs I needed to check. It looked like maybe a day of work, or two if I was floundering around, which is often likely when learning something new.

When I actually sat down to do it, I realized that twill didn’t have any control flow commands. So there wasn’t a way within twill, fabulous as it seemed, to tell it to go through a list of variables. So I started to try to write it in Python. I went to a few of Seth‘s informal Python lessons months ago and then paired with Danny a little bit. We wrote a thingie with Django to let people create random Culture names. (Result: I am the Human Sol-Terran Elizabeth Badgerina Karen da’Champions-West. I am currently travelling on a Ship, the ROU Knock Knock, Psychopath Class.) In other words, as of Monday, I could write Hello World in Python, but only if I look at the manual first.

So I wrote pseudocode first, like this:
read in biglist.txt
for each line in biglist.txt
split the line into $blogurl, $blogemail
get that url's http dump
if it has "OLDCODE" in it
Write $blogurl, $blogemail to hasoldcode.logfile
if not
if it has "NEWCODE" in it
write $blogurl, $blogemail to hasnewcode.logfile

Pseudocode rocks.

The Python docs made me want to chew my laptop in half. I read them anyway, mostly the string functions but that barely helped. Instead I just brute force googled things like “how do i find a substring in a string with python” which was often helpful if only so that I felt less lonely. It took me a while just to figure out how to concatenate two strings. I kept trying to use a dot, which didn’t work!

The most helpful thing was turning up bits of other people’s code, the simpler the better, with brief explanations.

The next most helpful thing was cheating by asking Danny, who kindly just said “Oh, do this, import urllib and do f=urllib2.urlopen(blog) and then h= ”.join(f.readlines()) and print h.” Testing things out in the interpreter was useful too. The point isn’t whether you lift it out of a class, a book, someone else’s code, or another person. Just start out with a few lines of something that works. Then, fiddle with it and master it.

But there was no way I could figure out from scratch what to do with the first giant horrible alpha-vomit error that popped up. Danny kindly IM-ed me “try: THINGIE except ValueError:” which I then googled and figured out how to use. Some googling of bits of the error message would have gotten me some useful examples. Here, I may have chickened out and asked for help too soon.

I put my pseudocode into a file as comments. I wrote a file of a few test urls and emails. Then I wrote the program kind of half-assedly a couple of times, piece by piece, only trying to do one thing at a time. It mostly worked. That was the end of Day 1.

In the morning life was much better. Then I rewrote my pseudocode to include all the things I’d forgotten. I threw everything out and started over. Suddenly I felt like I knew what I was doing. A couple of hours later it all worked great.

While I was in that state of knowing-what-I-was-doing and being able to see it all clearly, it was exactly like the point in writing a poem where I know the map ahead. It is knowing the map and how to navigate and having not just the destination but having built a mental image of the entire trip. So, as with going down a road where I’ve never been before, but have imagined out the map, I feel a sense of the entire poem, all at once. It’s a holographic feeling. In that state of mind, I am very happy, and want to work without stopping.

It’s funny to talk about such a simple bit of programming that way but that’s how it felt. I also knew where I was doing something in an inelegant way, but that it was okay and I’d fix it later.

When it was all working I paused for a bit, then went back and fixed the inelegant bit.

After that, I put in some status messages so I could watch them scroll past with every URL. (A good idea since the error checking is not very thorough.)

Here is the code. I can see more ways to improve it and make it more general. It would be nicer to just use the filename as the scrolling status message, perhaps giving the files names that would look better as they scroll past. There are also stylistic questions like, I know many people would combine

outfile = open(“gotsitedown.txt”,’a’)
outfile.writelines(msg)

into one line, but I couldn’t read that again and understand what it meant a month from now, so I tend not to write code that way. Maybe once I’m better at it.


#! /usr/bin/python
  import urllib2

  # This reads in a list of urls & emails, comma separated.
  # It checks each url for a specific phrase in its HTML 
  # and writes the url and email to a log file.
  # The status print lines are for fun, to watch it scroll. 

  lines = open("biglist.txt",'r').readlines()
  for l in lines:
   line = l.strip()
   try:
     (blog,email) = line.split(",")
   except ValueError:
     continue
   try:
     f = urllib2.urlopen(blog)
     h= ''.join(f.readlines())
     if 'NEW BIT OF CODE' in h:
        filename = "gotnewcode.txt"
        status = "New! "
        if 'OLD BIT OF CODE' in h:
          filename = "gotbothcodes.txt" # replaces filename!
          status = "Mixed up codes: "
     elif 'OLD BIT OF CODE' in h:
        filename= "gotoldcode.txt"
        status = "Old code: "
     else:
        filename = "gotnocode.txt"
        status = "No code here: "
     msg = blog + "," + email + "\n"
     outfile = open(filename,'a')
     outfile.writelines(msg)
     print  status + msg
  # check for 404 or other not found error
   except (urllib2.HTTPError, urllib2.URLError) :
     msg = blog + "," + email + "\n"
     outfile = open("gotsitedown.txt",'a')
     outfile.writelines(msg)
     status = "Site down: "
     print status + msg

My co-worker Julie was sitting across the table from me doing something maddeningly intricate with Drupal and at the end of the day she agreed with me that it is best to code something the wrong way at least twice in order to understand what you’re doing. “If I haven’t done it wrong three times, something is wrong.”

I wish now that I had written down all the wrong ways, or saved the wrong code, to compare how it improved. One wrong way went like this: instead of writing the 5 different logfiles of blogs with new code, old code, no code, site down, and mixed old and new code, I thought of making directories and writing all the url names to the directory. That was before I thought of putting the emails in a csv file with the urls. Why did it make sense at first? Who knows! It might be a good rule, though. The first think you think of is probably wrong. You can’t see a more optimal way until you have walked around in the labyrinth down some possible yet wrong ways.

This WordPress Code Highlighter Plugin might be the inspiration that pushes me finally off of Blogspot and onto WordPress for this blog, so that I can do this super nicely in text and not in an image.

Related posts:

Joy of unit testing

(From about 2 weeks ago, late at night)

I was just vaguely napping and realized I was still thinking in my sleep about the php code I had just been writing. Though I barely even know php at all, it wasn’t that hard to just guess at it because it was mostly like Perl. My thinking in Perl is a bit stuck. Today with Oblomovka I wrote out what I wanted my program to do, then he started writing tests. At first I didn’t get it that the tests didn’t run actually in the program. My thinking was inside out. I thought I’d run a bit of code, then run something that tested if it did it right, or that error/die statements would be sprinkled around. But as I saw what Oblomovka was doing it was like a light went on and I felt like everything I’ve written has been incredibly sloppy! Works fine, tells you if it doesn’t work, but was like wearing shoes instead of making a road. Or the other way around.

It was really fun to write the very simple tests and then figure out how to send it the simplest possible thing to fake it out so that the tests would pass. So for example if you were writing a simulated ball game, you would not start by simulating a baseball game. Instead, you might vaguely sketch out what happens in a game. Then, you’d write a test that goes like, “Does a ball exist? If not, FAIL.” You would watch it fail. It’s supposed to. Then think of the smallest thing it needs to do to pass. Your program would then merely need to go, “Oh hai. I’m a baseball” and the test would pass. You’d write another test that goes, “Is there a bat?” and “Is a baseball coming at my bat?” As you write fake bats, balls, and ball-coming-at-you actions, the baseball game starts to take shape. All the tests have to keep passing. The structure of how to build it becomes more clear, in a weird way. This isn’t quite the right analogy. I can’t quite get into the way of thinking and end up just hacking quickly on ahead. But for a little while, I felt the rightness of this way of doing things.

Technorati Tags: , , ,

Related posts:

Code that isn’t at all poetry, but that is structure & patterns

Happy Poetry Month! Rather than poeting, for the past few days I’ve been twiddling with code. It is much the same state of mind as translating, or basic composition, but for me at least, not quite poetry. It does require moving a bunch of words around, arranging them, and imagining their interpretation, organizing words in order to have an effect. For poetry or composition, an effect on a listener/reader, so you are imagining a logical and emotional state and the interpretation and effect of a person. For code obviously you are writing so that a machine will follow your orders perfectly; but less obviously you are writing for yourself in the future and for other future, human readers of your procedural pattern of thought. You are writing for your future (self or other) human, so they can modify and extend that code and put it to other uses. In other words, it has a bit in common with an oral or folk tradition. Repetition and patterns are good in poetry if you want to create structure for extension and improvisation.

So, just now I was doing some of my baby-Perl for some contract work. And the deal is, there are a bunch of users, and their accounts go through various bureaucratic steps, and through various work people and departments, some steps requiring others to happen first, for the account to become fully active. This is a fairly common situation for any institution. So, I had a Perl script that would take some command line options and then would do various things with the user and account data. As more people started realizing I could manipulate account stuff, and could generate reports, etc, they started asking me for new tasks. So, the hacky little script grew very quickly to a giant horrible tangly mess full of regular expressions that I did not understand anymore, 30 minutes after I wrote and tested them. A reg exp is a thing of beauty but it is not a joy forever. Instead, it makes my head hurt.

So I started about 4 times over this last month to rewrite that mess to make it easily understandable and extensible. I scribbled and thought on post-it notes so I could try to break down what needed to happen into chunks that I could move around & visualize, easier than in a 200 line text file.

It went kind of like this: use GetOpt::Long to tell from the command line what kind of report or change is required. Log in to several systems. Iterate over a range of account IDs in a big loop. Then do some http page getting and parsing. Then a lot of if else statements to see what command line option is turned on. Mixed in with some more tests and if elsing. Then again depending on command line option, do some other junk, write to some other web pages and outfiles. At the end of all that looping, write some more outfiles.

Ugh! You can see that any new capability meant that I had to do more page parsing and more reg exping, as well as thinking through all the logic of the whole if-else mess.

Today I suddenly realized several things. Speed doesn’t matter for this. I can set it going and let it chug away.

So, number one, for each user ID, just read in all the possible pages that have info on that account. It is only 5 web pages on 3 different systems. Read them in and parse out all their fields.

Number two, think of each account as having a state. There are 8 different bits of information that change account state, out of the 50+ possible bits of info. So, after parsing all the pages, look at the 8 pieces of information I care about, to determine the account’s state.

For things I then want to do, they fall into two categories. Reporting and state change. Reporting is easy. For changing account state, I can define for each case of one state to another; what actions it takes to change the account state. There are objects, and states, and transactions.

I have never really understood object-orientedness no matter how many times I think about it, and use and write code that is in theory, sort of object-oriented. It’s not like I get it now, but I get it more than I did.

Suddenly everything clicked into place and I understood how to write the code in a way that would be useful and elegant. I understood the root of the problem. It all fell together in a system. It looked like a pattern, like information that was beautiful. I know, it is a bunch of account data in a bureaucratic procedure. After years of being “programmer analyst” doing back end tools for university departments, I had to find beauty where I could. The “click” feeling means I look back on my month of sporadic attempts to write this program, and it looks like I was brain-deadishly trying to make something out of legos by gluing their corners together, when all the while I could have been snapping them together how they are supposed to go. But, before, I could not see the intersections.

So, just now I had the exhilarating (yet slightly shame-faced) feeling that I had just reinvented the wheel, or some basic principle of computer science that if I had any sense, I would have known from taking some classes. On the other hand, taking computer science classes doesn’t guarantee you know what you are doing or can build something that other people find useful & usable.

Related posts: