Categories
publication

Digital Content: Who May Publish? Who May Sell? Who May Access?

This originally appeared on the ACRL TechConnect blog.

No matter whether a small university press focusing on niche markets to the Big Six giants looking for the next massive bestseller, the publishing industry has been struggling to come to terms with the reality of new distribution models. Those models tends to favor cheaper and faster production with a much lower threshold for access, which generally has been good news for consumers. Several recent rulings and  statements have brought the issues to the forefront of conversation and perhaps indicated some common themes in publishing which are relevant to all libraries and their ability to purchase and/or provide digital content.

Academic Publishing: Dissertation == Monograph?

On July 22 the American Historical Association issued a “Statement on Policies Regarding the Embargoing of Completed History PhD Dissertations”. In this statement, the American Historical Association recommended that all libraries and graduate programs allow dissertations to be embargoed for up to six years. This is, in theory, to allow junior scholars enough time to publish a monograph based on the dissertation in order to receive tenure. This would be under the assumption that academic publishers would not publish a book based on a dissertation freely available online. Reactions to this statement prompted the AHA to release a Q & A page to clarify and support their position, including pointing out that publishers’ positions are too unclear to be sure there is no risk to an open access dissertation, and “like it or not”, junior faculty must produce a monograph to get tenure. They claim that in some cases that this benefits junior scholars to give them more time to revise their work before publication–while this is true, it indicates that a dissertation is not equivalent to a published scholarly monograph. The argument from the publisher’s side appears to be that libraries (who are the main purchasers of scholarly monographs) will not purchase books based on revised dissertations freely available online, the truth of which has been debated widely. Libraries do purchase print copies of titles (both monographs and serials) which are freely available online.

From my personal experience as an institutional repository manager, I know the attitude to embargoing dissertations varies widely by advisor and department. Like most people making an argument about this topic, I do not have much more than anecdotes to provide. I checked the most commonly downloaded dissertations from the past year, and it appeared the most frequently downloaded title (over 2000 over 2012-2013) is also the only one that has been published as a book that has been purchased by at least one library. Clearly this does not control for all variables and warrants further study, but it is a useful clue that open access availability does not always affect publication and later purchase. Further, from the point of view of open access creating more equal access to resources across the world, Google Analytics for that dissertation indicates that the sessions over the past year with the most engaged users came from, in order, the UK, the United States, Mauritius, and Sri Lanka.

What Should a Digital Book Cost?

In mid-July Denise Cote, the judge in the Apple e-book price fixing case, issued an opinion stating that Apple did collude with the publishers to set prices on ebooks. Reading the story of the negotiations in the opinion is a thrilling behind the scenes look at companies trying to get a handle on a fairly new market and trying to understand how they will make money. Below I summarize the 160 page opinion, which is well worth reading in its entirety.

The  problem with ebook pricing started with Amazon, which set a price of $9.99 for new releases that normally would have had list prices of $25-$30. This was frustrating to the major publishing houses, who worried (probably rightly so) that consumers would be unwilling to pay more than $10 for books after getting used to this low price point. Amazon would effectively price everyone else out of the market. Even after publishers raised the wholesale price of new releases, Amazon would sell them at loss to preserve the $9.99 price. The publishers spent 2009 developing strategies to combat Amazon, but it wasn’t until late 2009 with the entry of Apple into the ebook market that they saw a real opportunity.

Apple agreed with the Big Six publishers that setting all books at $9.99 was too low, but was unwilling to enter into a market in which they could not compete with Amazon. To accomplish this, they wanted the publishers to agree to the same terms, which included lower wholesale prices for ebooks. The negotiations that followed over late 2009 and early 2010 started positively, but ended in dissatisfaction. Because Apple was unwilling to sell anything as a loss leader, they felt that a wholesale model would leave them too vulnerable to Amazon. To address that, they proposed to sell books with an agency model (which several publishers had suggested). With an agency model, Apple would collect a 30% commission on sales just as they did with the App Store. To ensure that publishers did not set unrealistically high prices, Apple would set pricing caps. The other crucial move that Apple made was to insist that publishers move all retailers of ebooks to the agency model in order to ensure Apple would be able to compete on price across the board. Amazon  had no interest in the agency model, and in early 2010 had a series of meeting with the publishers that made this clear. After all the agreements were signed with Apple (the only Big Six publisher who did not participate was Random House), the publishers needed to move Amazon to an agency model to fulfill the terms of their contract. Macmillan was the first publisher to set up a meeting with Amazon to discuss this requirement. The response to the meeting was for Amazon to remove the “buy” button from all Macmillan books, both print and Kindle editions. Amazon eventually had to capitulate to the publishers to move to an agency model, which was complete by mid-2010, but submitted a complaint to the Federal Trade Commission. Random House finally agreed to an agency model with Apple in early 2011, thanks to a spot of blackmail on Apple’s part (it wouldn’t allow any Random House apps without a agency deal).

Ultimately the court determined that Apple violated the Sherman Act by conspiring with the publishers to force all their retailers to sell books at the same prices and thus removing competition. A glance at Amazon’s Kindle store bestsellers today shows books priced from $1.99 to $13.99 for the newest Stephanie Plum mystery (the same price as it is in the Apple bookstore). For all titles priced higher than $9.99, Amazon notes that the “price is set by the publisher.” Whether this means anything to the average consumer is debatable. Compare these negotiations to the on-going struggle libraries have had with availability of ebooks for lending–publishers have a lot to learn about libraries in addition to new models for digital sales, some of which was covered at the series of talks with the Big Six publishers that Maureen Sullivan held in early 2012. Over recent months publishers have made more ebooks available to libraries. But some libraries, most notably the Douglas County, Colorado libraries, are setting their own terms for purchasing and lending ebooks.

What Can You Do With a Digital File?

The last ruling I want to address is about the music resale service ReDigi, about which Kevin Smith goes into detail. This was was a service that provided a way for people to re-sell purchased MP3s, but ultimately the judge ruled that it was impossible to transfer the original file and so this did not fit under the first sale doctrine. The first sale doctrine (17 USC § 109) holds that “the owner of a particular copy or phonorecord lawfully made … is entitled, without the authority of the copyright owner, to sell or otherwise dispose of the possession of that copy or phonorecord.” Another case that was decided in April by the Supreme Court, Kirtsaeng v. Wiley, upheld this in the case of international sales of physical items, which was an important decision for libraries. But digital materials are more complicated. First sale applies to computer programs on physical media (except in certain circumstances), but does not cover material that has been licensed rather than sold, which is how most digital files are distributed. (For how the US Attorney’s Office approaches this in criminal investigations, see this document.) So when you “buy” that Kindle book from Amazon or load a book onto your iPad you are licensing the product for limited use on a limited number of devices and no legal recourse for lending or getting rid of the content, even if you try hard to follow the law as ReDigi did. Librarians are well aware of this and its implications, and license quite a bit of content that we can loan and/or distribute under limited circumstances. Libraries are safest in the long term if they can own the content outright rather than licensing, as are consumers. But it will be a long time before there is clarity about the legal way to transfer owner of a digital file at the consumer level.

Conclusion

Librarians and publishers have a complicated relationship. We need each other if either is to succeed, but even if our ends are the ultimately the same, our means are very different. These recent events indicate that there is still much in flux and plenty of room for constructive dialog with content creators and publishers.

Categories
data what-if

Collecting Data: How Much do We Really Need?

This originally appeared on the ACRL TechConnect blog.

Many of us have had conversations in the past few weeks about data collection due to the reports about the NSA’s PRISM program, but ever since April and the bombings at the Boston Marathon, there has been an increased awareness of how much data is being collected about people in an attempt to track down suspects–or, increasingly, stop potential terrorist events before they happen. A recent Nova episode about the manhunt for the Boston bombers showed one such example of this at the New York Police Department. This program is called the Domain Awareness System at the New York Police Department, and consists of live footage from almost every surveillance camera in the New York City playing in one room, with the ability to search for features of individuals and even the ability to detect people acting suspiciously. Added to that a demonstration of cutting edge facial recognition software development at Carnegie Mellon University, and reality seems to be moving ever closer to science fiction movies.

Librarians focused on technical projects love to collect data and make decisions based on that data. We try hard to get data collection systems as close to real-time as possible, and work hard to make sure we are collecting as much data as possible and analyzing it as much as possible. The idea of a series of cameras to track in real-time exactly what our patrons are doing in the library in real-time might seem very tempting. But as librarians, we value the ability of our patrons to access information with as much privacy as possible–like all professions, we treat the interactions we have with our patrons (just as we would clients, patients, congregants, or sources) with care and discretion (See Item 3 of the Code of Ethics of the American Library Association). I will not address the national conversation about privacy versus security in this post–I want to address the issue of data collection right where most of us live on a daily basis inside analytics programs, spreadsheets, and server logs.

What kind of data do you collect?

Let’s start with an exercise. Write a list of all the statistical reports you are expected to provide your library–for most of us, it’s probably a very long list. Now, make a list of all the tools you use to collect the data for those statistics.

Here are a few potential examples:

Website visitors and user experience

  • Google Analytics or some other web analytics tool
  • Heat map tool
  • Server logs
  • Surveys

Electronic resource access reports

  • Electronic resources management application
  • Vendor reports (COUNTER and other)
  • Link resolver click-through report
  • Proxy server logs

The next step may require a little digging. For library created tools, do you have a privacy policy for this data? Has it gone through the Institutional Review Board? For third-party tools, is there a privacy policy? What are the terms or use or user license? (And how many people have ever read the entire terms of service?). We will return to this exercise in a moment.

How much is enough?

Think about with these tools what type of data you are collecting about your users. Some of it may be very private indeed. For instance, the heat map tool I’ve recently started using (Inspectlet) not only tracks clicks, but actually records sessions as patrons use the website. This is fascinating information–we had, for instance, one session that was a patron opening the library website, clicking the Facebook icon on the page, and coming back to the website nearly 7 hours later. It was fun to see that people really do visit the library’s Facebook page, but the question was immediately raised whether it was a visit from on campus. (It was–and wouldn’t have taken long to figure out if it was a staff machine and who was working that day and time). IP addresses from off campus are very easy to track, sometimes down to the block–again, easy enough to tie to an individual. We like to collect IP addresses for abusive or spamming behavior and block users based on IP address all the time. But what about in this case? During the screen recordings I can see exactly what the user types in the search boxes for the catalog and discovery system. Luckily, Inspectlet allows you to obscure the last two octets (which is legally required some places) of the IP address, so you can have less information collected. All similar tools should allow you the same ability.

Consider another case: proxy server logs. In the past when I did a lot of EZProxy troubleshooting, I found the logs extremely helpful in figuring out what went wrong when I got a report of trouble, particularly when it had occurred a day or two before. I could see the username, what time the user attempted to log in or succeeded in logging in, and which resources they accessed. Let’s say someone reported not being able to log in at midnight– I could check to see the failed logins at midnight, and then that username successfully logging in at 1:30 AM. That was a not infrequent occurrence, as usually people don’t think to write back and say they figured out what they did wrong! But I could also see everyone else’s logins and which articles they were reading, so I could tell (if I wanted) which grad students were keeping up with their readings or who was probably sharing their login with their friend or entire company. Where I currently work, we don’t keep the logs for more than a day, but I know a lot of people are out there holding on to EZProxy logs with the idea of doing “something” with them someday. Are you holding on to more than you really want to?

Let’s continue our exercise. Go through your list of tools, and make a list of all the potentially personally identifying information the tool collects, whether or not you use them. Are you surprised by anything? Make a plan to obscure unused pieces of data on a regular basis if it can’t be done automatically. Consider also what you can reasonably do with the data in your current job requirements, rather than future study possibilities. If you do think the data will be useful for a future study, make sure you are saving anonymized data sets unless it is absolutely necessary to have personally identifying information. In the latter case, you should clear your study in advance with your Institutional Review Board and follow a data management plan.

A privacy and data management policy should include at least these items:

  • A statement about what data you are collecting and why.
  • Where the data is stored and who has access to it.
  • A retention timeline.

F0r example, in the past I collected all virtual reference transaction logs for studying the effectiveness of a new set of virtual reference services. I knew I wanted at least a year’s worth of logs, and ideally three years to track changes over time. I was able to save the logs with anonymized IP addresses and once I had the data I needed I was able to delete the actual transcripts. The privacy policy described the process and where the data would be stored to ensure it was secure. In this case, I used the RUSA Guidelines for Implementing and Maintaining Virtual Reference Services as a guide to creating this policy. Read through the ALA Guidelines to Drafting a Library Privacy Policy for additional specific language and items you should include.

What we can do with data

In all this I don’t at all mean to imply that we shouldn’t be collecting this data. In both the examples I gave above, the data is extremely useful in improving the patron experience even while giving identifying details away. Not collecting data has trade-offs. For years, libraries have not retained a patron’s borrowing record to protect his or her privacy. But now patrons who want to have an online record of what they’ve borrowed from the library must use third-party services with (most likely) much less stringent privacy policies than libraries. By not keeping records of what users have checked out or read through databases, we are unable to provide them personalized automated suggestions about what to read next. Anyone who uses Amazon regularly knows that they will try to tempt you into purchases based on your past purchases or books you were reading the preview of–even if you would rather no one know that you were reading that book and certainly don’t want suggestions based on it popping up when you are doing a collection development project at work and are logged in on your personal account. In all the decisions we make about collecting or not collecting data, we have to consider trade-offs like these. Is the service so important that the benefits of collecting the data outweigh the risks? Or, is there another way to provide the service?

We can see some examples of this trade-off in two similar projects coming out of Harvard Library Labs. One, Library Hose, was a Twitter stream with the name of every book being checked out. The service ran for part of 2010, and has been suspended since September of 2010. In addition to daily tweet limits, this also was a potential privacy violation–even if it was a fun idea (this blog post has some discussion about it). A newer project takes the opposite approach–books that a patron thinks are “awesome” can be returned to the Awesome Box at the circulation desk and the information about the book is collected on the Awesome Box website. This is a great tweak to the earlier project, since this advertises material that’s now available rather than checked out, and people have to opt in by putting the item in the box.

In terms of personal recommendations, librarians have the advantage of being able to form close working relationships with faculty and students so they can make personal recommendations based on their knowledge of the person’s work and interests. But how to automate this without borrowing records? One example is a project that Ian Chan at California State University San Marcos has done to use student enrollment data to personalize the website based on a student’s field of study. (Slides). This provides a great deal of value for the students, who need to log in to check their course reserves and access articles from off campus anyway. This adds on top of that basic need a list of recommended resources for students, which they can choose to star as favorites.

Conclusion

In thinking about what type of data you collect, whether on purpose or accidentally, spend some time thinking about what is strictly necessary to accomplish the work that you need to do. If you don’t need a piece of data but can’t avoid collecting it (such as full IP addresses or usernames), make sure you have a privacy policy and retention schedule, and ensure that it is not accessible to more people than absolutely necessary.

Work to educate your patrons about privacy, particularly online privacy. ALA has a Choose Privacy Week, which is always the first week in May. The site for that has a number of resources you might want to consult in planning programming. Academic librarians may find it easiest to address college students in terms of their presence on social media when it comes to future job hunting, but this is just an opening to larger conversations about data. Make sure that when you ask patrons to use a third party service (such as a social network) or recommend a service (such as a book recommending site) that you make sure they are aware of what information they are sharing.

We all know that Google’s slogan is “Don’t be evil”, but it’s not always clear if they are sticking to that. Make sure that you are not being evil in your own data collection.

Categories
Scholarly Communication

Citation Manager Roundup

This originally appeared on the ACRL TechConnect blog.

In April of this year, the two most popular free citation managers–Mendeley and Zotero–both underwent some big changes. On April 8th, TechCrunch announced that Elsevier had purchased Mendeley, which had been surmised in January. 1 Just a few days later, Zotero announced the release of version 4, with a number of new features. 2 Just as with the sunsetting of Google Reader, this has prompted many to consider what citation managers they have been using and think about switching or changing practices. I will not address subscription or paid products like RefWorks and EndNote specifically, though there are certainly many reasons you might prefer one of those products.

Mendeley: a new Star Wars movie in the making?

The rhetoric surrounding Elsevier’s acquisition of Mendeley was generally alarmist in nature, and the hashtag “#mendelete” that popped up immediately after the announcement suggests that many people’s first instinct was to abandon Mendeley. Elsevier has been held up as a model of anti-open access, and Mendeley as a model for open access. Yet Mendeley has always been a for-profit company, and, like Google, benefits itself and its users (particularly the science community) by knowing what they are reading and sharing. After all, the social features of Mendeley wouldn’t have any value if there was no public sharing. Institutional Mendeley accounts allow librarians to see what their users in aggregate are reading and saving, which helps them make collection development decisions– a service beyond what the average institutional citation manager product accomplishes. Victor Henning promises on the Mendeley blog that nothing will change, and that this will give them more freedom to develop more features 3. As for Elsevier, Oliver Dumon promises that Mendeley will remain independent and allowed to follow their own course–and that bringing it together with ScienceDirect and Scopus will create a “central workflow and collaboration site for authors”.4

There are two questions to be answered in this. First, is it realistic to assume that the Mendeley team will have the creative freedom they say they will have? And second, are users comfortable with their data being available to Elsevier? For many, the answers to both these questions seem to be “no” and “no.” A more optimistic point of view is that if Elsevier must placate Mendeley users who are open access advocates, they will allow more openness than before.

It’s too early to say, but I remain hopeful that Mendeley can continue to create a more open spirit in academic publishing. Peter Hoyt (a former employee of Mendeley and founder of PeerJ) suggests that much of the work that he oversaw to open up Mendeley was being stymied by Elsevier specifically. For him, this went against his personal ethos and so he was unable to stay at Mendeley–but he is confident in the character and ability of the people remaining at Mendeley.  5. I have never been a heavy user of Mendeley, but I have maintained a free account for the past few years. I use it mainly to create a list of my publications on my personal website, using a WordPress plug-in that uses the Mendeley API.

What’s new with Zotero

Zotero is a very different product than Mendeley. First, it is open-source software, with lots of ways to participate in development. Zotero was developed by the Roy Rosenzweig Center for History and New Media at George Mason University, with foundation and user support. It was developed specifically to support the research work of humanists. Originally a Firefox plug-in, Zotero now works as a standalone piece of software that interacts with Firefox, Chrome, and Safari to recognize bibliographic data on websites and pull them into a database that can be synced across computers (and even some third party mobile software). The newest version of Zotero includes several improvements. The one I am most excited about is detailed download display, which tells you what folder you’re saving a reference into, which is crucial for my workflow. Zotero is the citation manager I use on a daily basis, and I rely on it for formatting the footnotes you see on ACRL TechConnect posts or other research articles I produce. Since much of my research is on the open web, books, or other non-journal article resources, I find the ability of Zotero to pick up library catalog records and similar metadata more useful than the Mendeley import bookmarklet.

Both Zotero and Mendeley offer free storage for metadata and PDFs, with a cost for storage above the free level. (It is also possible to use a WebDAV server for syncing Zotero files).

Zotero Mendeley
300 MB Free
2 GB $20 / year 2 GB Free
6 GB $60 / year 5 GB $55 / year
10 GB $100 / year 10 GB $110 / year
25 GB $240 / year Unlimited $165 / year
Some concluding thoughts

Several graduate students in science 6 have written blog posts about switching away from Mendeley to Zotero. But they aren’t the same thing at all, and given the backgrounds of their creators, Mendeley is more skewed to the sciences, and Zotero more to the humanities.

Nor, as I like to point out, must they be mutually exclusive. I use Zotero for my daily citation management since I much prefer it for grabbing citations online, but sync my Zotero library with Mendeley to use the social and API features in Mendeley. I can choose to do this as an individual, but consider carefully the implications of your choice if you are considering an institutional subscription or requiring students or members of a research group to use a particular service.

  1. Lunden, Ingrid. “Confirmed: Elsevier Has Bought Mendeley For $69M-$100M To Expand Its Open, Social Education Data Efforts.” TechCrunch, April 18, 2013. http://techcrunch.com/2013/04/08/confirmed-elsevier-has-bought-mendeley-for-69m-100m-to-expand-open-social-education-data-efforts/.
  2. Takats, Sean. “Zotero 4.0 Launches.” Zotero, April 11, 2013. http://www.zotero.org/blog/zotero-4-0-launches/.
  3. Henning, Victor. “Mendeley and Elsevier – Here’s More Info.” Mend, April 19, 2013. http://blog.mendeley.com/community-relations/mendeley-and-elsevier-heres-more-info/
  4. Dumon, Oliver. “Elsevier Welcomes Mendeley.” Elsevier Connect, April 8, 2013. http://elsevierconnect.com/elsevier-welcomes-mendeley/.
  5. Hoyt, Jason. “My Thoughts on Mendeley/Elsevier & Why I Left to Start PeerJ,” April 9, 2013. http://enjoythedisruption.com/post/47527556151/my-thoughts-on-mendeley-elsevier-why-i-left-to-start.
  6. For one, see “Mendeley Sells Out; I’m Moving to Zotero.” LJ Villanueva’s Research Blog. Accessed May 20, 2013. http://research.coquipr.com/archives/492.