Skip to Main Content
It looks like you're using Internet Explorer 11 or older. This website works best with modern browsers such as the latest versions of Chrome, Firefox, Safari, and Edge. If you continue with this browser, you may see unexpected results.

Text Mining With the Hathi Trust Research Center

Tools

  • Voyant

    The easiest to use web text analysis tool. Voyant is free and allows users to upload or paste text. The program determines word frequencies, colocates and display them graphically.

  • Wordle

    Creates a word cloud from your own text.

  • MALLET

    MALLET (MAchine Learning for LanguagE Toolkit) is a collection of tools that facilitate document classification, sequence tagging, and topic modeling. There is also an add-on toolkit (Graphical Models in MALLET) for visualization.

  • WordSeer

    WordSeer is a collection of text analysis tools targeted at humanities scholars that includes side-by-side comparison, grammatical search, and document/sentence/word-set features.

  • Google Books Ngram Viewer   FREE 
    Charts the frequencies of any word or short sentence using yearly count of n-grams found in the sources printed between 1500- present. If you are interested in performing a large scale analysis on the underlying data, download of the corpora is available

  • Google Books BYU View   FREE 
    Compares The Corpus of Historical American English (COHA), Google Books (Standard), and the Google Books (BYU / Advanced) corpus in NGrams.

  • Cultoromics Bookworm Viewer   FREE 
    Developed by Culturomics at Harvard, it is an interface tool for queries in the Google Books corpus. Users can run queries in highly selective corpora based on subject (books on world history, American books on science, etc.) though these corpora are much smaller than those in the full Google Books collection.

 

  • JSTOR Data for Research

    Data for Research is a free data mining tool for journal content on JSTOR, available to the public. It provides the ability to obtain data sets via bulk downloads, and includes a faceted search interface, online viewing of document-level data, downloadable datasets (including word frequencies, citations, key terms, and ngrams)

Open source text corpa