Wednesday, July 29, 2015

LOC's Twitter Archive In Limbo

In a perfect example of where libraries generally find themselves in the era of fast-paced technological innovation and Big Data, the Library of Congress is having some trouble transitioning.

In the spring of 2010, the Library of Congress announced it was taking a big stride toward preserving the nation’s increasingly digital heritage — by acquiring Twitter’s entire archive of tweets and planning to make it all available to researchers. But more than five years later, the project is in limbo. The library is still grappling with how to manage an archive that amounts to something like half a trillion tweets. And the researchers are still waiting.

The archive’s fate is yet another example of the difficulty of safeguarding the historical records of an era when people communicate using easily deletable emails, websites that can be taken down in seconds and transient tweets, Vines and Snaps. But the library’s critics also see it as a cautionary tale from the 28-year tenure of retiring Librarian of Congress James Billington.

During Billington’s time in office, say critics, the library has espoused grand technological ambitions but didn’t back them up with the planning, budget or nuts-and-bolts needed to turn them from buzzy news releases to tangible accomplishments. It has also repeatedly faced criticism for its management of the U.S. Copyright Office, which has been drawn into numerous controversies on issues involving software, cellphones and online music streaming.

This isn't a unique story. It's a very public example, but most libraries are going through something similar on a much smaller scale.

But in response to deeply critical Government Accountability Office report in March about the library’s tech shortcomings, he said he was taking steps to “fully realize the possibilities of the digital era,” including plans to hire a chief information officer by September; the library has not had a permanent CIO since 2012.

And this is a great example of the types of expertise that libraries need to stay at the forefront and successfully keep pace with technological advances.

Osterberg said the library is still making progress on the tweet archive — officially known as the Twitter Research Access project. “The Library has been working to index the collection and develop use policies,” while having to balance “the size and dynamic nature of the Twitter platform” and “the resource realities of a public institution,” she said.

And this sentence sums up the issue to a T.

While some have accused the library of wasting time and money on preserving social media ephemera, the institution has argued that the huge stash of tweets could provide future generations an invaluable real-time record of how humans in the 21st century communicate.

“Archiving and preserving outlets such as Twitter will enable future researchers access to a fuller picture of today’s cultural norms, dialogue, trends and events to inform scholarship, the legislative process, new works of authorship, education and other purposes,” Osterberg wrote in a short 2013 white paper that was the last major update on the state of the tweet collection.

This is, again, where we have to ask ourselves if we have too much collective knowledge and what is truly useful to archive and preserve. I was mostly convinced during a recent IBM Watson discussion to accept that the goal is not less data but to provide better ways to organize and retrieve data. It's futile to reject Big Data. And libraries and librarians need to understand the skills necessary to provide relevant services in the Big Data age. Coding camp anyone?

Tuesday, July 28, 2015

New Repository For Dark Data

Do you ever wonder what happens to the vast treasure trove of data on which researchers rely for some of their most startling discoveries? Most of it goes "dark" and is never seen again after a research project is over.

CHE is reporting that Researchers at the University of North Carolina at Chapel Hill are leading an effort to create a one-stop shop for data sets that would otherwise be lost to the public after the papers they were produced for are published. The goal of the project, called DataBridge, is to expand the life cycle of so-called dark data. It will serve as an archive for data sets and metadata, and will group them into clusters of information to make relevant data easier to find.

The hope is that eventually researchers from around the country will submit their data after publishing their findings.

Ideally, this is a great way to share data that is often very time consuming and expensive to extract.

Ultimately, The researchers are also interested in including another type of “dark data”: archives of social-media posts. For example, the group has imagined creating algorithms to sort through tweets posted during the Arab Spring, for researchers studying the role of social media in the movement.

And in some cases, the project could serve as a model for libraries at research institutions that are looking to better track data in line with federal requirements.

But as commenters to the article noted, there are issues with reusing data sets. What about authenticity and ownership rights?

To that end, librarians tried to get involved with the MLA beta repository that allows its members to post such data sets, as well as blog posts and conference papers, to assign a suitable license to them, and to receive a DOI for them (thereby going some way towards solving the authenticity and ownership issues). It was developed in collaboration with the librarians at Columbia's CDRS, and while only members can deposit their work, anyone can view and download it.

This is a good start for organizing data in a searchable format with disruptive technologies, such as the Internet of Things, capable of producing vast data sets that librarians should be at the forefront of sifting through.

Monday, July 27, 2015

Disruptive Technologies & Libraries

Continuing with the theme of technology, in 2013, McKinsey released a report on the 12 disruptive technologies that have the greatest potential to drive substantial economic impact and disruption by 2025.

Important technologies can come in any field or emerge from any scientific discipline, but they share four characteristics: high rate of technology change, broad potential scope of impact, large economic value that could be affected, and substantial potential for disruptive economic impact. Many technologies have the potential to meet these criteria eventually, but leaders need to focus on technologies with potential impact that is near enough at hand to be meaningfully anticipated and prepared for. Therefore, we focused on technologies that we believe have significant potential to drive economic impact and disruption by 2025. 

Here is the list of 12: 


And here is the projected impact:

































Quite a few of the forecasted disruptive technologies will have a large impact on the way that people access information:
  • Automation of knowledge work: Intelligent software systems that can perform knowledge work tasks involving unstructured commands and subtle judgments
  • The Internet of Things: Networks of low-cost sensors and actuators for data collection, monitoring, decision making, and process optimization
  • Mobile Internet: Increasingly inexpensive and capable mobile computing devices and Internet connectivity
  • Cloud technology: Use of computer hardware and software resources delivered over a network or the Internet, often as a service
The report goes on to discuss other interesting observations and implications, and it is well worth the read. One thing that is specifically noted is that not all technologies live up to the hype.
    The link between hype and potential is not clear. Emerging technologies often receive a great deal of notice. News media know that the public is fascinated with gadgets and eager for information about how the future might unfold. The history of technology is littered with breathless stories of breakthroughs that never quite materialized. The hype machine can be equally misleading in what it chooses to ignore. As Exhibit E5 shows, with the exception of the mobile Internet, there is no clear relationship between the amount of talk a technology generates and its potential to create value.
    The lesson for leaders is to make sure that they and their advisers have the knowledge to make their own assessments based on a structured analysis involving multiple scenarios of technology advancement and potential impact. 

    And it's very important that policymakers do not make preemptive decisions in light of where technology stands now and where it is projected to go in the future. None of us know exactly where it will go or what it will mean, specifically for the future of libraries. In 2025, I suspect that we will be closer to our own personal Watson's but still very far from computers completing replacing humans in the knowledge-work sector. 

    Friday, July 24, 2015

    Innovation & Jobs

    With so much chatter recently about technology killing jobs, it's hard not to notice.

    A 2014 NYTimes article reviewed books with competing outlooks. One camp is optimistic about technology and jobs, while the other is much more pessimistic.

    As noted, in looking at the effect technology can have on jobs, look no further than Kodak. "At its peak, Kodak employed 140,000 people; Instagram had only 13 employees when it was bought by Facebook (for $1 billion!) in 2012." This is the pessimistic view.

    In addition, Erik Brynjolfsson and Andrew McAfee, two economists from the Massachusetts Institute of Technology, note that “[r]apid and accelerating digitization is likely to bring economic rather than environmental disruption, stemming from the fact that as computers get more powerful, companies have less need for some kinds of workers.” They believe that we are at a moment when technological innovation is about to accelerate, and make the world much wealthier, just as the Industrial Revolution did 250 years ago. Yet buried in their sunny prose is a darker forecast: that while this digital revolution will be great for innovators, entrepreneurs and other creative people, not everyone will participate — especially those who do jobs that software can do better.

    On the other side of the fence is Robert J. Gordon, a macroeconomist at Northwestern University. "In his view, the next 40 years of innovation is not going to look much different from the past 40 years, which he believes haven’t been nearly as transformative or wealth-creating as the discovery of electricity and the invention of the light bulb." When asked whether future innovation would cost jobs, he said he thought it would, but no more or less than has always been the case.

    So, in essence, we are where we were 30 years ago. We have one side telling us that doom is imminent, and we have the other side telling us that the type of innovation that will kill jobs is still many decades away.

    What is important is that we should recognize now that we have the ability to take control of our destiny rather than letting technology take control of us. We need to stay ahead of the curve and make sure that there is a place for us (librarians) in the future.

    Thursday, July 23, 2015

    Intelligence Augmentation (IA) v. Artificial Intelligence (AI)

    While at AALL, I watched Kyla Moran present on IBM's Watson. One thing struck me: the big difference between intelligence augmentation (IA) and artificial intelligence (AI). Kyla likened it to Ironman's JARVIS v. Terminator.

    It's a long-running "joke" of sorts within the librarian profession that "they've" been predicting our demise in favor of artificial intelligence for at least 30 years. And it's gotten louder recently with books like Rise of the Robots.

    Kyla commented that Watson is augmented intelligence. He makes us smarter. And IBM is not trying to overtake humans with machines.

    According to Wikipedia:
    Intelligence amplification (IA) (also referred to as cognitive augmentation and machine augmented intelligence) refers to the effective use of information technology in augmenting human intelligence. The idea was first proposed in the 1950s and 1960s by cybernetics and early computer pioneers.

    IA is sometimes contrasted with AI (Artificial Intelligence), that is, the project of building a human-like intelligence in the form of an autonomous technological system such as a computer or robot. AI has encountered many fundamental obstacles, practical as well as theoretical, which for IA seem moot, as it needs technology merely as an extra support for an autonomous intelligence that has already proven to function.

    Augmented intelligence will be our reality in the near future. We will use computers to aid us in our capability to retrieve relevant results in the age of big data.

    The ethics are thorny in this area, and although IBM says that it doesn't want to replace humans, it's not inherent that other entities will be so ethical. That said, we won't be in a position anytime soon to be completely replaced by computers, and it's important for the public to understand where computing power stands now and why librarians are still needed. If the public perception is that library's are not needed, then budgets will be slashed. But if the public understands the need for libraries and librarians, we will continue to be supported and offer our invaluable services.

    Wednesday, July 22, 2015

    Google Fares Better Than Proprietary Plagiarism Software

    Expensive plagiarism detection software from vendors such as Turnitin and SafeAssign proves to be no better than Google at detecting plagiarism. In fact, in past studies, Google has done a better job.

    InsideHigherEd recently reported on a study by Susan E. Schorn from the University of Texas at Austin. The data come from Susan E. Schorn, a writing coordinator at the University of Texas at Austin. Schorn first ran a test to determine Turnitin’s efficacy back in 2007, when the university was considering paying for an institutionwide license. Her results initially dissuaded the university from paying a five-figure sum to license the software, she said. A follow-up test, conducted this March, produced similar results. For the 2007 test, Schorn created six essays that copied and pasted text from 23 different sources, which were chosen after asking librarians and faculty members to give examples of commonly cited works. Examples included textbooks and syllabi, as well as websites such as Wikipedia and free essay repositories. Of the 23 sources, used in ways that faculty members would consider inappropriate in an assignment, Turnitin identified only eight, but produced six other matches that found some text, nonoriginal sources or unviewable content. That means the software missed almost two-fifths, or 39.34 percent, of the plagiarized sources.

    SafeAssign (the product UT-Austin ended up choosing, as it was bundled with the university's learning management system) fared even worse. It missed more than half, or 56.6 percent, of the sources used in the test. Mark Strassman, Blackboard's senior vice president of industry and product management, said the company has since "changed the match algorithms … changed web search providers" and "massively" grown the database of submissions SafeAssign uses.

    Google -- which Schorn notes is free and worked the fastest -- trounced both proprietary products. By searching for a string of three to five nouns in the essays, the search engine missed only two sources. Neither Turnitin nor SafeAssign identified the sources Google missed.

    A more recent test shows that results are not much better since 2007. As UT-Austin recently replaced its learning management system, it also needed to replace its plagiarism detection software. Schorn therefore conducted the Turnitin test again this March. Out of a total of 37 sources, the software fully identified 15, partially identified six and missed 16. That test featured some word deletions and sentence reshuffling -- common tricks students use to cover up plagiarism.

    We must be cognizant of the limitations of these plagiarism detectors. While they are useful, plagiarism detectors are a starting point, and we cannot use them with abandon.

    Monday, July 20, 2015

    AALL Annual Conference 2015

    The AALL Annual Conference 2015 is currently underway. Follow me on Twitter @gngrlibrarian for updates, or go to #AALL15 to see updates from all attendees.

    Friday, July 17, 2015

    .Law Domain Names Available Oct. 12

    Last April, the ABA Journal reported on a new .law domain. Minds + Machines has the exclusive license to operate the new .law domain from the Internet Corporation for Assigned Names and Numbers. Minds + Machines said in a press release that it was partnering with the Legal Marketing Association to allow its members to submit an expression of early interest in .law domain names.

    But that doesn’t mean others can’t submit their own expressions of interest, Andreozzi says. Those who submit an expression of interest aren’t obligated to buy the domain, but they will be allowed to purchase it when .law becomes generally available, if no one else expresses an interest. When more than one person is interested, an auction is held.

    Standard names such as johnsmith.law will cost $200, while premium names such as ohiodivorce.law will start at $500. The cost will be based on factors such as the number of characters, and the value of certain practice areas.

    Anyone who applies for a .law domain will have to certify that he or she is a lawyer and submit to a verification process. Lawyers can apply on behalf of themselves, their law firms and their companies.

    The company has announced that, beginning July 30, trademark holders will be able to register corresponding names. The names will be available for sale to lawyers Oct. 12.

    Other new domains of interest to lawyers will also become available. They include .attorney, .esq and .lawyer.

    Thursday, July 16, 2015

    Libraries Matter More Than Ever

    Salon had it right when it stated that libraries are more important than ever.
    In our heartfelt but na├»ve fondness for “quiet, inviting spaces” full of books and nothing else, we fail to realize that libraries are becoming more important, not less, to our communities and our democracy.

    One of the main reasons that libraries are more important than ever is because libraries and librarians help sift through the mountains of data that humans are currently producing.
    Humans are producing such quantities of data—2.5 quintillion bytes of data daily, to be precise—and on such a steep curve, that 90 percent of all existing data is less than two years old. An overwhelming amount of information, access to which is marked by the same stark inequality that exists between economic classes, demands to be moderated for the public good, and libraries are the institutions that do that.

    The risk of a small number of technically savvy, for-profit companies determining the bulk of what we read and how we read it is enormous. The great beauty of the rich, diverse library system that has developed over past century and a half has been the role of librarians in selecting and making available a range of material for people to consult and enjoy. No one pressing an ideology can co-opt this system; no single commercial entity can do an end run around the library system in the interest of profit.

    Libraries and librarians help moderate this data in an age when we are really starting to question if there is too much collective knowledge. It's not that libraries are becoming less important as the need for print materials lowers; it's that the public needs to adjust its notion of what it means to be a library.