Data Cleaning and Project Ideas

This week (and a little last week), the readings are focused on digital humanities projects in museum spaces. This is a nice parallel to project work, as I’m getting started working with real data– though it seems like the week might be mostly focused on cleaning. There are a lot of questions that come along with cleaning the data that I hadn’t thought about in advance, since not all the fields are standardized or populated. The question ‘what do we do with the outliers’, or fields that aren’t completed the way we might hope they’d be, isn’t just a question of how to get the data into a workable form. Rather, it’s a pretty significant interprative act.

Update 10-24/25-19: Originally, my first small visualization project plan involved using the ‘culture’ or ‘period’ markers associated with each piece. Unfortunately, after looking at the data more closely, it turned out that these fields weren’t populated for nearly any of the pieces in the collection (out of 16243 objects, 14892 were missing values for the first category, and 16116 were missing for the second). My second, main project was supposed to use ULAN identifiers in order to gather more information about artists. While this field was more populated than the two discussed before, nearly half of the pieces were still missing this information (for 8565 pieces the ULAN id was marked ‘not found’). I expected a non-trivial number of pieces to be missing this data: since, for example, it is usually not possible to identify the creator of objects found from archeological digs, there could be no artist information associated with these sort of pieces. Still, I didn’t expect this many to be missing.

Additionally, many of the pieces that did have this information were by the same artists (as would be expected), but again, the scale was surprising. For example, the three artists most prominent in the collection are Rube Goldberg, Thomas Nast, and Andy Warhol. There are 549 of Goldberg’s pieces, 463 of Nast’s, and 326 of Warhol’s. It seems that WCMA houses a lot of cartoons! :)

Actually, it could itself be a fun small project to create a visualization of the most prominent artists in the collection that links ways to learn more about them. Though I would be a bit surprised if they don’t already have something similar!

As a result of these findings, I decided to change my project idea. The change for the smaller visualization component is of course necessary, considering the lack of data. It would still be possible to do the visualization based on ULAN, but it seems much less interesting knowing that the artists for whom this information would be available (and thus the artists that would be included in the visualization) would be so limited. Since these artists are already the most prominent in the collection (generally speaking), I don’t think it would be the most interesting focus to highlight them in yet another way, especially since I still have quite some time to be working on this!

Instead, drawing inspiration from this finding, an alternative project that might provide another interesting take on the collection is a visualization of the pieces we know the least about. These are the pieces that would fall through the cracks in most analylses, and I like the idea of visualizing them all together. Since the data is sparcely populated (both generally and, in this case, by virtue the pieces chosen), I think the most interesting approach is to use the images associated with the items. These are available for most pieces/ objects, so a visualization based on them should not have to leave too many pieces out, comparatively.

Using the thumbnail images, another visualization could be one based on the accession dates of all the pieces. This could show how the collection has grown over time. It could either be a moving graphic to show the changes in a more dynamic way, or a series separated by date ranges.

A last, more involved idea is an interactive visualization of the entire collection based on the dominant colors in the images. I know a couple different projects based on color have already been done with the collection, so those past projects should provide some insight into what types of color analyses make the most sense, as well as provide some sort of roadmap for working with the images and how to go about carrying out the analysis. The museum has a static visualization similar that sparked this idea, as I think it would be a cool piece to build upon by taking the general idea and making it interactive, and thus more accessable as a way for engaging with the pieces themselves. I don’t think this visualization is available online, but I will come back and link it if so. Alternatively, the size visualization would be another really good canditate for interactivity!

It would be especially fascinating to add an extra dimension to either of the visualizations above. For example, if we could have axes usign color, scale, or time, we might be able to see some fun relations between any combination of those.