On March 19th, I went down to NYU for a half-day conference titled “Symposium on Data Mining and Visualization for the Humanities”. It was held at an interesting space, named the Open House for the fact that it’s an entire building on Macdougal Street and that it’s a space for anyone at NYU to use for events on a first-come, first-serve basis. It was a very warm day and the space had insufficient air conditioning (read: none), so my headache was a bit distracting by the end of the day (read: it felt like someone was pulling my right eye out of my head via my hypothalamus.) Notwithstanding, I think I saw some good things.
Zephyr Frank of Stanford, the first speaker, demonstrated his work on visualizing literature, or rather, visualizing aspects of literature. Starting with a quick rundown of Google NGram, Frank moved quickly to demonstrate what he’s been able to see using some tools created at Stanford for examining texts. In particular, Frank looked at some 19th-century Brazilian plays with sophisticated tools allowing him to redefine the notion of “window of action” and identify the development of character’s relationships through the course of the text. He argued that the tools allowed him to draw out, for example, typologies of what was permissible speech (to whom, under what circumstances). In addition, he demonstrated a animation of the text that made evident a three-node typology in which two of the characters did not speak to each other. Finally, the tools enabled mapping of words onto a 4-box matrix of gender and publicness; certain words end up being marked as public and male, such as “truth”, and others private and female, such as “hope”. As with the other talks and tools I’ll relate here, Frank didn’t suggest that these evidences drew conclusions as much as suggested avenues for exploration.
Mark Hansen of the UCLA department of statistics, described some of his public art projects involving large corpora of text. Though the projects were interesting, there wasn’t a lot replicable in any practical sense. Most of us don’t, for example, have access to the New York Times lobby or the complete textual corpus of same. That said, his projects did contain signposts for desanctifying text and manipulating it in innovative ways.
Lev Manovich of UCSD was the final speaker and the one with the greatest whiz-bang factor to boot. He was also the boldest speaker of the three, edging into arrogance more than once. Manovich’s essential points, readable in part in the online version of his presentation*, were 1) that humanists must study not just Big Data but Big Cultural Data to discern patterns and 2) that numbers are better than words for describing culture. I’d disagree with this latter, thinking more that it’s the back-and-forth between numbers and words (and wordless visuals, sound, or other sensory constructs) that is newly possible and powerful. The ability to re-present non-numeric artifacts digitally means that we can, as Manovich described, find new patterns at varying scales, but we still must have a human-comprehensible language not just to communicate our findings but also to look at those patterns in ways that numbers cannot reach.
Since the talk, Manovich’s software studies group has released more of the tools they use to take large visual datasets and research them for patterns. I gave one of them — ImagePlot — a test drive, first using some images found at and downloaded from the Yale Library’s Digital Collections Search. (If there’s a known way using the Yale Digital Collections tool to bulk download images, I’d love to hear it.) Since I intentionally only downloaded a handful of images so as to keep iterations brief, there’s not much to see there.
Once I had that down, I took on a larger dataset, some 1850+ photographs from a Yale faculty member’s recent project. I’m not just being coy about the dataset; what I did would be in violation of its CC license if I were to release my work. In any case, with that set, I was able to start seeing some things of limited value but that indeed demonstrated how this tool might be used. Some of those results were confirmations of what could be seen without too much trouble by mildly sophisticated spreadsheet manipulation: the number of photographs taken has changed over the life of the project, the brightness spread has changed, and the plurality of the photos cluster over time in one part of the brightness scale. However, it is not trivial that presenting the work this way makes those conclusions easy to grasp and communicate. They also speak to a researcher’s future needs or important questions to pose. Do I need to balance out my work with photos taken during this part of the year or that? Are there fewer photos as time progresses or more? Why? Is there a correlation in the number of photos taken and the frequency of appearance of my subject matter, or is this a workflow issue? Do I need to take more daytime photos or more low-light photos? In short, what are the biases in my work and do they represent biases of the source material or biases in my methodology or process?
Even though neither dataset could be considered true Big Data (Manovich recommended, if I recall correctly, 1000 objects as the floor and prefers to work with 10,000 and up), I feel reasonably confident articulating some problems I had getting working. One was that I did not see in ImageJ, the underlying image processing software, anywhere to measure some of the aspects Manovich used in his examples, particularly hue and saturation. Also, it seems that you have to have your relevant columns (filename, date, and mean brightness) as the first, second, and third columns in the tab-delimited file. Or, rather, that’s the only way I could get it to work but I wouldn’t swear that’s the only way to make it work. I can say with some certainty that when I used the same data matrix, having the relevant data in columns 1, 5, and 8 (or 1, 7, and 8, for that matter) failed to produce the expected result, while having the data in columns 1, 2, and 3 produced what I expected. Conversely, the order of the axial data didn’t seem to matter; that is, whether I had the x-axis in column 2 and the y-axis in column 3 or vice-versa seemed to be immaterial. Another issue is that the software does not process empty values, so make sure that the tab-delimited file has an appropriate value (likely 0) for missing data, or trump up a clear placeholder. A final issue, and one I confirmed with the developers, is that the software does not handle chronological data as date values but rather as numeric values. You must convert (at least for the purposes of the graph) your date or datetime values to integers of some sort.