February 2013 – Glorious Generalist

Reflections on Code4Lib 2013

This originally appeared on the ACRL TechConnect blog.

Disclaimer: I was on the planning committee for Code4Lib 2013, but this is my own opinion and does not reflect other organizers of the conference.

We have mentioned Code4Lib before on this blog, but for those who are unfamiliar, it is a loose collective of programmers working in libraries, librarians, and others interested in code and libraries. (You can read more about it on the website.) The Code4Lib conference has emerged as a venue to share very new technology and have discussions with a wide variety of people who might not attend conferences more geared to librarians. Presentations at the conference are decided by the votes of anyone interested in selecting the program, and additionally lightning talks and breakout sessions allow wide participation and exposure to extremely new projects that have not made it into the literature or to conferences with a longer lead time. The Code4Lib 2013 conference ran February 11-14 at University of Illinois Chicago. You can see a list of all programs here, which includes links to the video archive of the conference.

While there were many types of projects presented, I want to focus on those talks which illustrated what I saw as thread running through the conference–care and emotion. This is perhaps unexpected for a technical conference. Yet those themes underlie a great deal of the work that takes place in academic library technology and the types of projects presented at Code4Lib. We tend to work in academic libraries because we care about the collections and the people using those collections. That intrinsic motivation focuses our work.

Caring about the best way to display collections is central to successful projects. Most (though not all) the presenters and topics came out of academic libraries, and many of the presentations dealt with creating platforms for library and archival metadata and collections. To highlight a few: Penn State University has developed their own institutional repository application called ScholarSphere that provides a better user experience for researchers and managers of the repository. The libraries and archives of the Rock and Roll Hall of Fame dealt with the increasingly common problem of wanting to present digital content alongside more traditional finding aids, and so developed a system for doing so. Corey Harper from New York University presented an extremely interesting and still experimental project to use linked data to enrich interfaces for interacting with library collections. Note that all these projects combined various pieces of open source software and library/web standards to create solutions that solve a problem facing academic or research libraries for a particular setting. I think an important lesson for most academic librarians looking at descriptions of projects like this is that it takes more than development staff to make projects like this. It takes purpose, vision, and dedication to collecting and preserving content–in other words, emotion and care. A great example of this was the presentation about DIYHistory from the University of Iowa. This project started out initially as an extremely low-tech solution for crowdsourcing archival transcription, but got so popular that it required a more robust solution. They were able to adapt open source tools to meet their needs, still keeping the project very within the means of most libraries (the code is here).

Another view of emotion and care came from Mark Matienzo, who did a lightning talk (his blog post gives a longer version with more details). His talk discussed the difficulties of acknowledging and dealing with the emotional content of archives, even though emotion drives interactions with materials and collections. The records provided are emotionless and affectless, despite the fact that they represent important moments in history and lives. The type of sharing of what someone “likes” on Facebook does not satisfactorily answer the question of what they care about,or represent the emotion in their lives. Mark suggested that a tool like Twine, which allows writing interactive stories could approach the difficult question of bringing together the real with the emotional narrative that makes up experience.

One of the ways we express care for our work and for our colleagues is by taking time to be organized and consistent in code. Naomi Dushay of Stanford University Library presented best practices for code handoffs, which described some excellent practices for documenting and clarifying code and processes. One of the major takeaways is that being clear, concise, and straightforward is always preferable, even as much as we want to create cute names for our servers and classes. To preserve a spirit of fun, you can use the cute name and attach a description of what the item actually does.

Originally Bess Sadler, also from Stanford, was going to present with Naomi, but ended up presenting a different talk and the last one of the conference on Creating a Commons (the full text is available here). This was a very moving look at what motivates her to create open source software and how to create better open source software projects. She used the framework of the Creative Commons licenses to discuss open source software–that it needs to be “[m]achine readable, human readable, and lawyer readable.” Machine readable means that code needs to be properly structured and allow for contributions from multiple people without breaking, lawyer readable means that the project should have the correct structure and licensing to collaborate across institutions. Bess focused particularly on the “human readable” aspect of creating communities and understanding the “hacker epistemology,” as she so eloquently put it, “[t]he truth is what works.” Part of understanding that requires being willing to reshape default expectations–for instance, the Code4Lib community developed a Code of Conduct at Bess’s urging to underline the fact that the community aims at inclusion and creating a safe space. She encouraged everyone to keep working to do better and “file bug reports” about open source communities.

This year’s Code4Lib conference was a reminder to me about why I do the work I do as an academic librarian working in a technical role. Even though I may spend a lot of time sitting in front of a computer looking at code, or workflows, or processes, I know it makes access to the collections and exploration of those collections better.

This originally appeared on the ACRL TechConnect blog.

I started working on this project yesterday, but I wanted to write it up as quickly as possible so that I could see how others are approaching this issue. First of all, I should say that this approach was inspired by this article in the Code4Lib Journal: “Using XSLT and Google Scripts to Streamline Populating an Institutional Repository” by Stephen X. Flynn, Catalina Oyler, Marsha Miles.

The problem I had was a faculty member who sent a CV to a liaison for adding the items to the repository, but whose citations were not showing up in the citation databases–and I now work at an institution with all of the resources I need for this. So I wanted to go the other way, and start with the CV and turn that into something that I could use to query SHERPA/RoMEO.

It occurred to me that the best tool for this might be Google Refine (now OpenRefine, I guess), which I’ve always wanted to play around with. I am sure there are lots of other ways to do this, but I found this pretty easy to get set up. Here’s the approach I’m taking, with a version of my own CV.

Start with a CV, and identify the information you want–you could copy the whole thing or use screen scraping or what have you, but most people’s CVs are about 20 pages long and you only care about 1 or 2 pages of journal articles.
Copy this into a text editor to remove weird formatting or spacing. You want to have each citation in its own line, so if you had a CV with hanging indents or similar you would have to remove those.
Now (assuming you’ve installed and opened Google Refine), either import this text file, or copy in the text. Import it as line based text file, and don’t select anything else.
Click on Create Project>>, and it will bring it into Google Refine. Note that each line from the text file has become a row in the set of data, but now you have to turn it into something useful.
My tactic was to separate the author, date, title, journal title, and other bibliographic information into their own columns. The journal title is the only one that really matters for these purposes, but of course you want to hang on to all the information. There are probably any number of ways to accomplish this, but since citations all have a standard structure, it’s really easy to exploit that to make columns. The citations above are in APA style, and since I started with social work faculty to test this out, that’s what I am starting with, but I will adjust for Chicago or MLA in the future. Taking a look at one citation as example we see the following:

Heller, M. (2011). A Review of “Strategic Planning for Social Media in Libraries”. Journal of Electronic Resources Librarianship, 24 (4), 339-240)

Note that we always have a space after the name, an open parentheses, the date, a closed parentheses, and a period followed by a space (I’ve colored all the punctuation we want blue). So I can use this information to split the columns. To do so, select “Split into several columns” from the Edit Column menu.

Then in the menu, type in the separator you want to use, which in this case is space open parentheses. Split into two columns, and leave the rest alone. Note that you can also put a regular expression in here if necessary. Since dates are always the same length you could get away with field lengths, but this way works fine.

After this, you will end up with the following change to your data:

Now the author is in the first column, and the opening parenthesis is gone.
Following along with the same rationale for each additional field and renaming the columns we end up with (cooking show style) the following:
Each piece of information is in its own column so we can really start to do something with it.
I am sure there’s a better way to do this, but my next step was to use the journal title as the query term to the SHERPA/RoMEO API call. This was super easy once I watched the data augmentation screencast; there is documentation here as well. Open up the following option from the edit column menu:

You get a box to fill in the information about your API call. You have all kinds of options, but all you really need to do for this is format your URL in the way required by SHERPA/RoMEO. You should get an API key, and can read all about this in the article I linked to above. There are probably several ways to do this, but I found that what I have below works really well. Note that it will give you a preview to see if the URL is formatted in the way you expect. Give your column a name, and set the Throttle delay. I found 1000 worked fine.
In a copy and pastable format, here’s what I have in the box:'http://www.sherpa.ac.uk/romeo/api29.php?ak=[YOUR API KEY HERE]&qtype=starts&jtitle=' + escape(value,'url')
Now it will run the query and send back a column of XML that looks like this:
From this you can extract anything you want, but in my case I want to quickly get the pre and post-archiving allowances, plus any conditions. This took me awhile to figure out, but you can use the Googe Refine Expression Language parseHtml function to work on this. Click on Add column based on this column from the Edit Column menu, and you will get a menu to fill in an expression to do something to your data. After a bit of trial and error, I decided the following makes the most sense. This grabs the text out between the <prearchiving> element in the XML and shows you the text. You can see from the preview that you are getting the right information. Do the same thing for post-archiving or any other columns you want to isolate. The code you want is value.parseHtml().select("elementname")

Conditions are a little different, since there are multiple conditions for each journal. In that case, you would use the following syntax (after join you can put whatever separator you want): forEach(value.parseHtml().select("condition"),v,v.htmlText()).join(". ")"
Now you have your data neatly structured and know the conditions for archiving, hooray! Again, cooking show style, here’s what you end up with. You can certainly remove the SHERPA/RoMEO column at this point, and export the data as Excel or whatever format you want it in.
BUT WAIT, IT GETS BETTER. So that was a lot of work to do all that moving and renaming. Now you can save this for the future. Click on Undo/Redo and then the Extract option.

Make sure to unclick any mistakes you made! I entered the information wrong the first time for the API call, so that added an unnecessary step. Copy and paste the JSON into a text editor and save for later.
From now on when you have your CV data, you can click on the Undo/Redo tab and then choose Apply. It will run through the steps for you and automatically spit out your nicely formatted and researched publications. Well… realistically the first time it will spit out something with multiple errors, and you will see all the typos that are messing up your plan. But since the entire program is built to clean up messy data, you’re all set on that end. Here’s the APA format I described above for you to copy and paste if you like–make sure to fill in your API key: cvomatic

I hope this is useful to some people. I know even in this protean form this procedure will save me a ton of time and allow library liaisons to concentrate on outreach to faculty rather than look up a ton of things in a database. Let me know if you are doing something similar and the way you are doing it.