Skip to Main Content
UNH Library home

Text Mining With the Hathi Trust Research Center

Worksets

 

Worksets

HTRC worksets are user-created collections of HathiTrust volumes to be treated as data and analyzed using HTRC tools and services. Worksets are curated by researchers, and they can be shared and cited to improve reproducibility.

Create or find a workset


Creating a workset

HTRC worksets can be created, and existing worksets can be viewed, when you are logged in.

Worksets are manifests of HathiTrust volume IDs with additional metadata and functionality. To create a workset, create a collection in HathiTrust, download the metadata, and upload the resulting comma-separated file.  Worksets can be public (viewable by users signed-in to HTRC Analytics) or private (viewable only to you).

Workset Format

Worksets start as lists of HathiTrust volume identification numbers (for example, hvd.hn5f64). If uploading a volume list file to create a workset, the file should be in CSV (comma-separated-value) or TXT format, and while it may contain other columns, it is only required to have your volume IDs in the first column. The file should contain a header row containing the text "volume" or "id".

Using a workset

Run one of the supplied text analysis algorithms against an HTRC workset. You can also use the HathiTrust volume IDs to download HTRC Extracted Features or call the volumes in your workset in the HTRC Data Capsule environment using the HTRC Data API.

 

Workset Builder

  • You can select public domain volumes to analyze using the Workset Builder or you can upload your own workset. Workset Builder - use the HTRC interface to select public domain volumes and use canned-algorithms for quick analysis. This tool is currently under development and does not currently include in-copyright works.
  • Use HTRC-designed preset algorithms to explore your workset corpora.