Getting Lots and Lots of United States Congress Documents

Before I completely forget how I did it, I wanted to note down for myself what I did to scrape down all the bills from the 111th Congress (the one prior to the one sitting now). My reason for doing this, since the reasons always come after the decision to act, was to get a body of text for the Topic Modeling for Humanities Research workshop at MITH next weekend. While I’m sure I not the first to consider this, it seemed to me an interesting angle to topic model on a body of work that is not the kind that seems to get mentioned much in DH conversations. What I’m hoping to ask of the corpus are questions about rhetoric that might then be addressed to previous Congresses’ work in an independent or comparative way.

In brief, I used the GovTrack methods for accessing their raw data to download all of the bills. So far I haven’t manipulated the files at all, but I’m going to take another look before the workshop and see whether there aren’t headers and footers worth removing just to get away from that noise now rather than during the workshop. My guess is I’ll extend Matthew Jockers’ Python tool for prepping Project Gutenberg texts for TEI for getting this done.