Skip to main content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

Text Mining With the Hathi Trust Research Center

About text mining

Text mining or text analysis are blanket terms for analyzing documents with software tools. Text mining uses automation to analyze collections of textual materials in order to capture key concepts and themes and uncover hidden relationships and trends.  For example, the material can be all material by a certain author, or all works in a subgenre, or all works by a certain set of authors.

Text mining can be used to address questions such as:

  • Which concepts occur together?
  • What else are they linked to?
  • What higher level categories can be made from extracted information?

 

About the HTRC

The Hathi Trust Research Center (HTRC) , the research arm of HathiTrust, facilitates scholarly research by providing mechanisms for researchers to access content and providing computational tools for text analysis.

        Most HTRC services require an account. Register for an account by going to the Portal and choosing "Sign up" from the menu at  analytics.hathitrust.org. Anyone with an email address from a nonprofit institution of higher education is allowed to register, including those whose institutions are not HathiTrust members. (UNH is a HathiTrust member)

         You can create a workset of books in HatihiTrust Digital Library and import this into HTRC to run basic algorithms. It is also possible to work with HTRC to gain access to the entire HathiTrust corpus, including materials still in copyright, to use in nonconsumptive research* activities. In the 2010 Authors Guild vs Google amended settlement agreement states: "Non-Consumptive Research" means research in which computational analysis is performed on one or more Books, but not research in which a researcher reads or displays substantial portions of a Book to understand the intellectual content presented within the Book.”  Non-consumptive analytics includes image analysis, text extraction, textual analysis and information extraction, linguistic analysis, automated translation, and indexing and search. There is more on HathiTrust's Non-Consumptive Use Research Policy available here.

   

Documentation

Getting Started Guide                
HTRC's documentation and FAQ to get you started.
 

HTRC provides extensive documentation on the Tools, including instruction videos, tutorials, presentations, examples and Getting Started FAQs.

 

bookworm: HathiTrust

Graphically explore language trends over time in millions of volumes in the Hathi Trust Digital Library.