Publications

DeepPatent2: A Large-Scale Benchmarking Corpus for Technical Drawing Understanding

DeepPatent2: A Large-Scale Benchmarking Corpus for Technical Drawing Understanding

by Kehinde Ajayi, Xin Wei, Martin Gryder, Winston Shields, Jian Wu, Shawn M. Jones, Michal Kucer, and Diane Oyen

Recent advances in computer vision (CV) and natural language processing have been driven by exploiting big data on practical applications. However, these research fields are still limited by the sheer volume, versatility, and diversity of the available datasets. CV tasks, such...

Read More
Synthesizing Web Archive Collections Into Big Data: Lessons From Mining Data From Web Archives

Synthesizing Web Archive Collections Into Big Data: Lessons From Mining Data From Web Archives

by Shawn M. Jones, Himarsha Jayanetti, Martin Klein, Michele C. Weigle, and Michael L. Nelson

Web archives are sources of big data. When presenting human visitors with archived web pages, or mementos, web archives often apply user interface augmentations to assist them. Unfortunately, these augmentations present challenges for natural language processing, computer visi...

Read More
Summarizing Web Archive Corpora Via Social Media Storytelling By Automatically Selecting and Visualizing Exemplars

Summarizing Web Archive Corpora Via Social Media Storytelling By Automatically Selecting and Visualizing Exemplars

by Shawn M. Jones, Martin Klein, Michele C. Weigle, and Michael L. Nelson

People often create themed collections to make sense of an ever-increasing number of archived web pages. Some of these collections contain hundreds of thousands of documents. Thousands of collections exist, many covering the same topic. Few collections include standardized met...

Read More
Discovering Image Usage Online: A Case Study With

Discovering Image Usage Online: A Case Study With "Flatten the Curve"

by Shawn M. Jones and Diane Oyen

Understanding the spread of images across the web helps us understand the reuse of scientific visualizations and their relationship with the public. The “Flatten the Curve” graphic was heavily used during the COVID-19 pandemic to convey a complex concept in a simple form. It d...

Read More
Abstract Images Have Different Levels of Retrievability Per Reverse Image Search Engine

Abstract Images Have Different Levels of Retrievability Per Reverse Image Search Engine

by Shawn M. Jones, Diane Oyen

Much computer vision research has focused on natural images, but technical documents typically consist of abstract images, such as charts, drawings, diagrams, and schematics. How well do general web search engines discover abstract images? Recent advancements in computer visio...

Read More
Creating Structure in Web Archives With Collections: Different Concepts From Web Archivists

Creating Structure in Web Archives With Collections: Different Concepts From Web Archivists

by Himarsha R. Jayanetti, Shawn M. Jones, Martin Klein, Alex Osbourne, Paul Koerbin, Michael L. Nelson, Michele Weigle

As web archives’ holdings grow, archivists subdivide them into collections so they are easier to understand and manage. In this work, we review the collection structures of eight web archive platforms. We note a plethora of different approaches to web archive collection struct...

Read More
Robustifying Links With Zotero

Robustifying Links With Zotero

by Martin Klein, Shawn M. Jones

Referencing resources on the web has become an integral part of our digital scholarship. However, the long-term availability and accessibility of these resources has rarely been the focus of significant research and development efforts. In this paper we introduce the Zotero Ro...

Read More
The DSA Toolkit Shines Light Into Dark and Stormy Archives

The DSA Toolkit Shines Light Into Dark and Stormy Archives

by Shawn M. Jones, Himarsha R. Jayanetti, Alex Osborne, Paul Koerbin, Martin Klein, Michele C. Weigle, and Michael L. Nelson

The Dark and Stormy Archives (DSA) Project applies social media storytelling to a subset of a collection to facilitate collection understanding at a glance. As part of this work, we developed the DSA Toolkit, which helps archivists and visitors leverage this capability. As par...

Read More
Hypercane: Toolkit for Summarizing Large Collections of Archived Webpages

Hypercane: Toolkit for Summarizing Large Collections of Archived Webpages

by Shawn M. Jones, Michele C. Weigle, Michael L. Nelson

In the Dark and Stormy Archives (DSA) project, we focus on storytelling techniques to summarize collections of archived web pages. Since collections can have hundreds or even thousands of seeds (initial URLs) and each seed can be recrawled many times, with each version separat...

Read More
Hypercane: Intelligent Sampling for Web Archive Collections

Hypercane: Intelligent Sampling for Web Archive Collections

by Shawn M. Jones, Michele C. Weigle, Martin Klein, Michael L. Nelson

Humans can choose individual documents from a web archive collection, but doing so is difficult if they are unfamiliar with the collection. The issue is scale. Most web archive collections consist of thousands of documents. Hypercane is a tool that automates the selection of d...

Read More
It's All About The Cards: Sharing on Social Media Probably Encouraged HTML Metadata Growth

It's All About The Cards: Sharing on Social Media Probably Encouraged HTML Metadata Growth

by Shawn M. Jones, Valentina Neblitt-Jones, Michele C. Weigle, Martin Klein, and Michael L. Nelson

In a perfect world, all articles consistently contain sufficient metadata to describe the resource. We know this is not the reality, so we are motivated to investigate the evolution of the metadata that is present when authors and publishers supply their own. Because applying ...

Read More
Improving Collection Understanding for Web Archives with Storytelling: Shining Light Into Dark and Stormy Archives

Improving Collection Understanding for Web Archives with Storytelling: Shining Light Into Dark and Stormy Archives

by Shawn M. Jones

Collections are the tools that people use to make sense of an ever-increasing number of archived web pages. As collections themselves grow, we need tools to make sense of them. Tools that work on the general web, like search engines, are not a good fit for these collections be...

Read More
Interoperability for Accessing Versions of Web Resources with the Memento Protocol

Interoperability for Accessing Versions of Web Resources with the Memento Protocol

by Shawn M. Jones, Martin Klein, Herbert Van de Sompel, Michael L. Nelson, and Michele C. Weigle

Used by a variety of researchers, web archive collections have become invaluable sources of evidence. If a researcher is presented with a web archive collection that they did not create, how do they know what is inside so that they can use it for their own research? Search eng...

Read More
Automatically Selecting Striking Images for Social Cards

Automatically Selecting Striking Images for Social Cards

by Shawn M. Jones, Michele C. Weigle, Martin Klein, Michael L. Nelson

To allow previewing a web page, social media platforms have developed social cards: visualizations consisting of vital information about the underlying resource. At a minimum, social cards often include features such as the web resource’s title, text summary, striking image, a...

Read More
Robustifying Links To Combat Reference Rot

Robustifying Links To Combat Reference Rot

by Shawn M. Jones, Martin Klein, and Herbert Van de Sompel

Links to web resources frequently break, and linked content can change at unpredictable rates. These dynamics of the Web are detrimental when references to web resources provide evidence or supporting information. In this paper, we highlight the significance of reference rot, ...

Web mentions

Read More
SHARI -- An Integration of Tools to Visualize the Story of the Day

SHARI -- An Integration of Tools to Visualize the Story of the Day

by Shawn M. Jones, Alexander C. Nwala, Martin Klein, Michele C. Weigle, Michael L. Nelson

Tools such as Google News and Flipboard exist to convey daily news, but what about the past? In this paper, we describe how to combine several existing tools with web archive holdings to perform news analysis and visualization of the “biggest story” for a given date. StoryGrap...

Read More
MementoEmbed and Raintale for Web Archive Storytelling

MementoEmbed and Raintale for Web Archive Storytelling

by Shawn M. Jones, Martin Klein, Michele C. Weigle, Michael L. Nelson

For traditional library collections, archivists can select a representative sample from a collection and display it in a featured physical or digital library space. Web archive collections may consist of thousands of archived pages, or mementos. How should an archivist display...

Web mentions

Read More
Social Cards Probably Provide For Better Understanding Of Web Archive Collections

Social Cards Probably Provide For Better Understanding Of Web Archive Collections

by Shawn M. Jones, Michele C. Weigle, Michael L. Nelson

Used by a variety of researchers, web archive collections have become invaluable sources of evidence. If a researcher is presented with a web archive collection that they did not create, how do they know what is inside so that they can use it for their own research? Search eng...

Read More
Improving Collection Understanding in Web Archives

Improving Collection Understanding in Web Archives

by Shawn M. Jones

Ever since the Internet Archive started large-scale web archiving in 1996, historians, sociologists, and journalists have found web archives to be an important source of information for their work. Archive-It, a service focused on creating collections, allows curators to gene...

Read More
The Off-Topic Memento Toolkit

The Off-Topic Memento Toolkit

by Shawn M. Jones, Michelle C. Weigle, and Michael L. Nelson

Web archive collections are created with a particular purpose in mind. A curator selects seeds, or original resources, which are then captured by an archiving system and stored as archived web pages, or mementos. The systems that build web archive collections are often configu...

Web mentions

Read More
The Many Shapes of Archive-It

The Many Shapes of Archive-It

by Shawn M. Jones, Alexander Nwala, Michelle C. Weigle, and Michael L. Nelson

Web archives, a key area of digital preservation, meet the needs of journalists, social scientists, historians, and government orga- nizations. The use cases for these groups often require that they guide the archiving process themselves, selecting their own original resources...

Read More
Avoiding spoilers: wiki time travel with Sheldon Cooper

Avoiding spoilers: wiki time travel with Sheldon Cooper

by Shawn M. Jones, Michael L. Nelson, and Herbert Van de Sompel

A variety of fan-based wikis about episodic fiction (e.g., television shows, novels, movies) exist on the World Wide Web. These wikis provide a wealth of information about complex stories, but if fans are behind in their viewing they run the risk of encountering “spoilers”—inf...

Read More
Uniform Access to Raw Mementos

Uniform Access to Raw Mementos

by Herbert Van de Sompel, Michael L. Nelson, Lyudmila Balakireva, Martin Klein, Shawn M. Jones, and Harihar Shankar

Most web archives augment Mementos when presenting them to the user, often for usability or legal purposes. Research efforts and software projects need access the original captured “raw” Mementos. So that users and software do not need to resort to archive-specific solutions, ...

Read More
Scholarly Context Adrift: Three out of Four URI References Lead to Changed Content

Scholarly Context Adrift: Three out of Four URI References Lead to Changed Content

by Shawn M. Jones, Herbert Van de Sompel, Harihar Shankar, Martin Klein, Richard Tobin, and Claire Grover

Increasingly, scholarly articles contain URI references to “web at large” resources including project web sites, scholarly wikis, ontologies, online debates, presentations, blogs, and videos. Authors reference such resources to provide essential context for the research they r...

Web mentions

Read More
Persistent URIs Must Be Used To Be Persistent

Persistent URIs Must Be Used To Be Persistent

by Herbert Van de Sompel, Martin Klein, Shawn M. Jones

We quantify the extent to which references to papers in scholarly literature use persistent HTTP URIs that leverage the Digital Object Identifier infrastructure. We find a significant number of references that do not, speculate why authors would use brittle URIs when persisten...

Web mentions

Read More
Rules of Acquisition for Mementos and Their Content

Rules of Acquisition for Mementos and Their Content

by Shawn M. Jones, Harihar Shankar

Text extraction from web pages has many applications, including web crawling optimization and document clustering. Though much has been written about the acquisition of content from live web pages, content acquisition of archived web pages, known as mementos, remains a relativ...

Read More
Avoiding Spoilers in Fan Wikis of Episodic Fiction

Avoiding Spoilers in Fan Wikis of Episodic Fiction

by Shawn M. Jones, Michael L. Nelson

A variety of fan-based wikis about episodic fiction (e.g., television shows, novels, movies) exist on the World Wide Web. These wikis provide a wealth of information about complex stories, but if readers are behind in their viewing they run the risk of encountering “spoilers” ...

Web mentions

Read More
Avoiding Spoilers on Mediawiki Fan Sites Using Memento

Avoiding Spoilers on Mediawiki Fan Sites Using Memento

by Shawn M. Jones

A variety of fan-based wikis about episodic fiction (e.g., television shows, novels, movies) exist on the World Wide Web. These wikis provide a wealth of information about complex stories, but if readers are behind in their viewing they run the risk of encountering spoilers” –...

Read More
Bringing Web Time Travel to MediaWiki: An Assessment of the Memento MediaWiki Extension

Bringing Web Time Travel to MediaWiki: An Assessment of the Memento MediaWiki Extension

by Shawn M. Jones, Michael L. Nelson, Harihar Shankar, Herbert Van de Sompel

We have implemented the Memento MediaWiki Extension Version 2.0, which brings the Memento Protocol to MediaWiki, used by Wikipedia and the Wikimedia Foundation. Test results show that the extension has a negligible impact on performance. Two 302 status code datetime negotiatio...

Read More