Categories
Libraries

Analyzing CVs for publisher copyrights and self-archiving with OpenRefine

This originally appeared on the ACRL TechConnect blog.

I started working on this project yesterday, but I wanted to write it up as quickly as possible so that I could see how others are approaching this issue. First of all, I should say that this approach was inspired by this article in the Code4Lib Journal: “Using XSLT and Google Scripts to Streamline Populating an Institutional Repository” by Stephen X. Flynn, Catalina Oyler, Marsha Miles.

The problem I had was a faculty member who sent a CV to a liaison for adding the items to the repository, but whose citations were not showing up in the citation databases–and I now work at an institution with all of the resources I need for this. So I wanted to go the other way, and start with the CV and turn that into something that I could use to query SHERPA/RoMEO.

It occurred to me that the best tool for this might be Google Refine (now OpenRefine, I guess), which I’ve always wanted to play around with. I am sure there are lots of other ways to do this, but I found this pretty easy to get set up. Here’s the approach I’m taking, with a version of my own CV.

  1. Start with a CV, and identify the information you want–you could copy the whole thing or use screen scraping or what have you, but most people’s CVs are about 20 pages long and you only care about 1 or 2 pages of journal articles.cv
  2. Copy this into a text editor to remove weird formatting or spacing. You want to have each citation in its own line, so if you had a CV with hanging indents or similar you would have to remove those.text
  3. Now (assuming you’ve installed and opened Google Refine), either import this text file, or copy in the text. Import it as line based text file, and don’t select anything else.
    importtorefine
  4. Click on Create Project>>, and it will bring it into Google Refine. Note that each line from the text file has become a row in the set of data, but now you have to turn it into something useful.refine1
  5. My tactic was to separate the author, date, title, journal title, and other bibliographic information into their own columns. The journal title is the only one that really matters for these purposes, but of course you want to hang on to all the information. There are probably any number of ways to accomplish this, but since citations all have a standard structure, it’s really easy to exploit that to make columns. The citations above are in APA style, and since I started with social work faculty to test this out, that’s what I am starting with, but I will adjust for Chicago or MLA in the future. Taking a look at one citation as example we see the following:

    Heller, M. (2011). A Review of “Strategic Planning for Social Media in Libraries”. Journal of Electronic Resources Librarianship, 24 (4), 339-240)

    Note that we always have a space after the name, an open parentheses, the date, a closed parentheses, and a period followed by a space (I’ve colored all the punctuation we want blue). So I can use this information to split the columns. To do so, select “Split into several columns” from the Edit Column menu.
    refine2
    Then in the menu, type in the separator you want to use, which in this case is space open parentheses. Split into two columns, and leave the rest alone. Note that you can also put a regular expression in here if necessary. Since dates are always the same length you could get away with field lengths, but this way works fine.
    refine3
    After this, you will end up with the following change to your data:
    refine4
    Now the author is in the first column, and the opening parenthesis is gone.

  6. Following along with the same rationale for each additional field and renaming the columns we end up with (cooking show style) the following:
    refine5Each piece of information is in its own column so we can really start to do something with it.
  7. I am sure there’s a better way to do this, but my next step was to use the journal title as the query term to the SHERPA/RoMEO API call. This was super easy once I watched the data augmentation screencast; there is documentation here as well. Open up the following option from the edit column menu:
    refine6
    You get a box to fill in the information about your API call. You have all kinds of options, but all you really need to do for this is format your URL in the way required by SHERPA/RoMEO. You should get an API key, and can read all about this in the article I linked to above. There are probably several ways to do this, but I found that what I have below works really well. Note that it will give you a preview to see if the URL is formatted in the way you expect. Give your column a name, and set the Throttle delay. I found 1000 worked fine.
    refine7In a copy and pastable format, here’s what I have in the box:'http://www.sherpa.ac.uk/romeo/api29.php?ak=[YOUR API KEY HERE]&qtype=starts&jtitle=' + escape(value,'url')
  8. Now it will run the query and send back a column of XML that looks like this:
    refine8
  9. From this you can extract anything you want, but in my case I want to quickly get the pre and post-archiving allowances, plus any conditions. This took me awhile to figure out, but you can use the Googe Refine Expression Language parseHtml function to work on this. Click on Add column based on this column from the Edit Column menu, and you will get a menu to fill in an expression to do something to your data. After a bit of trial and error, I decided the following makes the most sense. This grabs the text out between the <prearchiving> element in the XML and shows you the text. You can see from the preview that you are getting the right information. Do the same thing for post-archiving or any other columns you want to isolate. The code you want is value.parseHtml().select("elementname")
    refine9
    Conditions are a little different, since there are multiple conditions for each journal. In that case, you would use the following syntax (after join you can put whatever separator you want): forEach(value.parseHtml().select("condition"),v,v.htmlText()).join(". ")"
  10. Now you have your data neatly structured and know the conditions for archiving, hooray! Again, cooking show style, here’s what you end up with. You can certainly remove the SHERPA/RoMEO column at this point, and export the data as Excel or whatever format you want it in.
    refine11
  11. BUT WAIT, IT GETS BETTER. So that was a lot of work to do all that moving and renaming. Now you can save this for the future. Click on Undo/Redo and then the Extract option.
    refine12
    Make sure to unclick any mistakes you made! I entered the information wrong the first time for the API call, so that added an unnecessary step. Copy and paste the JSON into a text editor and save for later. refine13
  12. From now on when you have your CV data, you can click on the Undo/Redo tab and then choose Apply. It will run through the steps for you and automatically spit out your nicely formatted and researched publications. Well… realistically the first time it will spit out something with multiple errors, and you will see all the typos that are messing up your plan. But since the entire program is built to clean up messy data, you’re all set on that end. Here’s the APA format I described above for you to copy and paste if you like–make sure to fill in your API key: cvomatic

I hope this is useful to some people. I know even in this protean form this procedure will save me a ton of time and allow library liaisons to concentrate on outreach to faculty rather than look up a ton of things in a database. Let me know if you are doing something similar and the way you are doing it.

Categories
Dreaming

Changing it all around

So… I now work at a different place with a different job title and a very different commute. What’s new with you?

My November writing more or less went by the wayside when I was offered, spent a week thinking about, and ultimately accepted a job as Digital Services Librarian at Loyola University. Chicago. I’ve been at the new job for four days now. It’s been six days since my last day at Dominican. I think my head is still spinning from making all these changes so quickly. I didn’t start off 2012 with any plans to change jobs. In fact, this was the only job I’ve applied for in nearly three years. I’ve always wanted to to work at Loyola. You would want to work here too if you saw the view from the library. Not to mention there are great academic programs and excellent faculty with whom I am looking for to working.

And with this hopefully brings a return to more blogging. I’ve been doing nothing but writing lately it feels like, but not much publicly. Lots of new things to learn and explore, and so I’ll be back.

Categories
Scholarly Communication technology

What Should Technology Librarians Be Doing About Alternative Metrics?

This originally appeared on the ACRL TechConnect blog.

Bibliometrics– used here to mean statistical analyses of the output and citation of periodical literature–is a huge and central field of library and information science. In this post, I want to address the general controversy surrounding these metrics when evaluating scholarship and introduce the emerging alternative metrics (often called altmetrics) that aim to address some of these controversies and how these can be used in libraries. Librarians are increasingly becoming focused on the publishing side of the scholarly communication cycle, as well as supporting faculty in new ways (see, for instance, David Lankes’s thought experiment of the tenure librarian). What is the reasonable approach for technology-focused academic librarians to these issues? And what tools exist to help?

There have been many articles and blog posts expressing frustration with the practice of using journal impact factors for judging the quality of a journal or an individual researcher (see especially Seglen). One vivid illustration of this frustration is in a recent blog post by Stephen Curry titled “Sick of Impact Factors”. Librarians have long used journal impact factors in making purchasing decisions, which is one of the less controversial uses of these metrics 1 The essential message of all of this research about impact factors is that traditional methods of counting citations or determining journal impact do not answer questions about what articles are influential and how individual researchers contribute to the academy. For individual researchers looking to make a case for promotion and tenure, questions of use of metrics can be all or nothing propositions–hence the slightly hysterical edge in some of the literature. Librarians, too, have become frustrated with attempting to prove the return on investment for decisions–see “How ROI Killed the Academic Library”–going by metrics alone potentially makes the tools available to researchers more homogeneous and ignores niches. As the alt metrics manifesto suggests, the traditional “filters” in scholarly communication of peer review, citation metrics, and journal impact factors are becoming obsolete in their current forms.

Traditional Metrics

It would be of interest to determine, if possible, the part which men of different calibre [sic] contribute to the progress of science.

Alfred Lotka (a statistician at the Metropolitan Life Insurance Company, famous for his work in demography) wrote these words in reference to his 1926 statistical analysis of the journal output of chemists 2 Given the tools available at the time, it was a fairly limited sample size, looking at just the first two letters of an author index for the period of 16 years compared with a slim 100 page volume of important works “from the beginning of history to 1900.” His analysis showed that the more articles published in a field, the less likely it is for an individual author to publish more than one article. As Per Seglen puts it, this showed the “skewness” of science 3

The original journal impact factor was developed by Garfield in the 1970s, and used the “mean number of citations to articles published in two preceding years” 4.   Quite clearly, this is supposed to measure the general amount that a journal was cited, and hence a guide to how likely a researcher was to read and immediately find useful the body of work in this journal in his or her own work. This is helpful for librarians trying to make decisions about how to stretch a budget, but the literature has not found that a journal’s impact has much to do with an individual article’s citedness and usefulness 5 As one researcher suggests, using it for anything other than its intended original use constitutes pseudoscience 6 Another issue with which those at smaller institutions are very familiar is the cost of accessing traditional metrics. The major resources that provide these are Thomson Reuters’ Journal Citation Reports and Web of Science, and Elsevier’s Scopus, and both are outside the price range of many schools.

Metrics that attempt to remedy some of these difficulties have been developed. At the journal level, the Eigenfactor® and Article Influence Score™ use network theory to estimate “the percentage of time that library users spend with that journal”, and the Article Influence Score tracks the influence of the journal over five years. 7. At the researcher level, the h-index tracks the impact of specific researchers (it was developed with physicists in mind). The h-index takes into account the number of papers the researcher has published in how much time when looking at citations. 8

These are included under the rubric of alternative metrics since they are an alternative to the JCR, but rely on citations in traditional academic journals, something which the “altmetric” movement wants to move beyond.

Alt Metrics

In this discussion of alt metrics I will be referring to the arguments and tools suggested by Altmetrics.org. In the alt metrics manifesto, Priem et al. point to several manifestations of scholarly communication that are unlike traditional article publications, including raw data, “nanopublication”, and self-publishing via social media (which was predicted as so-called “scholarly skywriting” at the dawn of the World Wide Web 9). Combined with sharing of traditional articles more readily due to open access journals and social media, these all create new possibilities for indicating impact. Yet the manifesto also cautions that we must be sure that the numbers which alt metrics collect “really reflect impact, or just empty buzz.”  The research done so far is equally cautious. A 2011 study suggests that tweets about articles (tweetations) do correlate with citations but that we cannot say that number of tweets about an article really measures the impact. 10

A criticism expressed in the media about alt metrics is that alternative metrics are no more likely to be able to judge the quality or true impact of a scientific paper than traditional metrics. 11 As Per Seglen noted in 1992, “Once the field boundaries are broken there is virtually no limit to the number of citations an article may accrue.” 12 So an article that is interdisciplinary in nature is likely to do far better in the alternative metrics realm than a specialized article in a discipline that still may be very important. Mendeleley’s list of top research papers demonstrates this–many (though not all) the top articles are about scientific publication in general rather than about specific scientific results.

What can librarians use now?

Librarians are used to questions like “What is the impact factor of Journal X?” For librarians lucky enough to have access to Journal Citation Reports, this is a matter of looking up the journal and reporting the score. They could answer “How many times has my article been cited?” in Web of Science or Scopus using some care in looking for typos. Alt metrics, however, remind us that these easy answers are not telling the whole story. So what should librarians be doing?

One thing that librarians can start doing is helping their campus community get signed up for the many different services that will promote their research and provide article level citation information. Below are listed a small number (there are certainly others out there) of services that you may want to consider using yourself or having your campus community use. Some, like PubMed, won’t be relevant to all disciplines. Altmetrics.org lists several tools beyond what is listed below to provide additional ideas.

These tools offer various methods for sharing. PubMed allows one to embed “My Bibliography” in a webpage, as well as to create delegates who can help curate the bibliography. A developer can use the APIs provided by some of these services to embed data for individuals or institutions on a library website or institutional repository. ImpactStory has an API that makes it relatively easy to embed data for individuals or institutions on a library website or institutional repository. Altmetric.com also has an API that is free for non-commercial use. Mendeley has many helpful apps that integrate with popular content management systems.

Since this is such a new field, it’s a great time to get involved. Altmetrics.org held a hackathon in November 2012 and has a Google Doc with the ideas for the hackathon. This is an interesting overview of what is going on with open source hacking on alt metrics.

Conclusion

The altmetrics manifesto program calls for a complete overhaul of scholarly communication–alternative research metrics are just a part of their critique. And yet, for librarians trying to help researchers, they are often the main concern. While science in general calls for a change to the use of these metrics, we can help to shape the discussion through educating and using alternative metrics.

 

Works Cited and Suggestions for Further Reading
Bourg, Chris. 2012. “How ROI Killed the Academic Library.” Feral Librarian. http://chrisbourg.wordpress.com/2012/12/18/how-roi-killed-the-academic-library/.
Cronin, Blaise, and Kara Overfelt. 1995. “E-Journals and Tenure.” Journal of the American Society for Information Science 46 (9) (October): 700-703.
Curry, Stephen. 2012. “Sick of Impact Factors.” Reciprocal Space. http://occamstypewriter.org/scurry/2012/08/13/sick-of-impact-factors/.
“Methods”, 2012. Eigenfactor.org.
Eysenbach, Gunther. 2011. “Can Tweets Predict Citations? Metrics of Social Impact Based on Twitter and Correlation with Traditional Metrics of Scientific Impact.” Journal Of Medical Internet Research 13 (4) (December 19): e123-e123.
Gisvold, Sven-Erik. 1999. “Citation Analysis and Journal Impact Factors – Is the Tail Wagging the Dog?” Acta Anaesthesiologica Scandinavica 43 (November): 971-973.
Hirsch, J. E. “An Index to Quantify an Individual’s Scientific Research Output.” Proceedings of the National Academy of Sciences of the United States of America 102, no. 46 (November 15, 2005): 16569–16572. doi:10.1073/pnas.0507655102.
Howard, Jennifer. 2012. “Scholars Seek Better Ways to Track Impact Online.” The Chronicle of Higher Education, January 29, sec. Technology. http://chronicle.com/article/As-Scholarship-Goes-Digital/130482/.
Jump, Paul. 2012. “Alt-metrics: Fairer, Faster Impact Data?” Times Higher Education, August 23, sec. Research Intelligence. http://www.timeshighereducation.co.uk/story.asp?storycode=420926.
Lotka, Alfred J. 1926. “The Frequency Distribution of Scientific Productivity.” Journal of the Washington Academy of Sciences 26 (12) (June 16): 317-324.
Mayor, Julien. 2010. “Are Scientists Nearsighted Gamblers? The Misleading Nature of Impact Factors.” Frontiers in Quantitative Psychology and Measurement: 215. doi:10.3389/fpsyg.2010.00215.
Oransky, Ivan. 2012. “Was Elsevier’s Peer Review System Hacked to Get More Citations?” Retraction Watch. http://retractionwatch.wordpress.com/2012/12/18/was-elseviers-peer-review-system-hacked-to-get-more-citations/.
Priem, J., D. Taraborelli, P. Groth, and C. Neylon. 2010. “Altmetrics: A Manifesto.” Altmetrics.org. http://altmetrics.org/manifesto/.
Seglen, Per O. 1992. “The Skewness of Science.” Journal of the American Society for Information Science 43 (9) (October): 628-638.
———. 1994. “Causal Relationship Between Article Citedness and Journal Impact.” Journal of the American Society for Information Science 45 (1) (January): 1-11.
Vanclay, Jerome K. 2011. “Impact Factor: Outdated Artefact or Stepping-stone to Journal Certification?” Scientometrics 92 (2) (November 24): 211-238. doi:10.1007/s11192-011-0561-0.
Notes
  1. Jerome K. Vanclay,  “Impact Factor: Outdated Artefact or Stepping-stone to Journal Certification?” Scientometrics 92 (2) (2011):  212.
  2. Alfred Lotka, “The Frequency Distribution of Scientific Productivity.” Journal of the Washington Academy of Sciences 26 (12) (1926)): 317.
  3. Per Seglen, “The Skewness of Science.” Journal of the American Society for Information Science 43 (9) (1992): 628.
  4. Vanclay, 212.
  5. Per Seglen, “Causal Relationship Between Article Citedness and Journal Impact.” Journal of the American Society for Information Science 45 (1) (1994): 1-11.
  6. Vanclay, 211.
  7. “Methods”, Eigenfactor.org, 2012.
  8. J.E. Hirsch, “An Index to Quantify an Individual’s Scientific Research Output.” Proceedings of the National Academy of Sciences of the United States of America 102, no. 46 (2005): 16569–16572.
  9. Blaise Cronin and Kara Overfelt, “E-Journals and Tenure.” Journal of the American Society for Information Science 46 (9) (1995): 700.
  10. Gunther Eysenbach, “Can Tweets Predict Citations? Metrics of Social Impact Based on Twitter and Correlation with Traditional Metrics of Scientific Impact.” Journal Of Medical Internet Research 13 (4) (2011): e123.
  11. see in particular Jump.
  12. Seglen, 637.