Skip to Main Content
UNH Library home

Text Mining With the Hathi Trust Research Center

Extracted Features

Extracted Features

HTRC releases research datasets to facilitate text analysis using the HathiTrust Digital Library. While copyright-protected texts are not available for download from HathiTrust, useful research can be performed on the basis of non-consumptive analysis of features extracted from full text. These features include volume-level metadata, page-level metadata, part-of-speech-tagged tokens, and token counts.  This data would not allow you to analyze the text at the level of syntax, but would enable "bag-of-words" methods such as topic modeling.  Additionally, HTRC has partnered with advanced researchers to release a derived dataset, Word Frequencies in English-Language Literature, 1700-1922.

 

A full explanation of the dataset's features, motivation, and creation is available at the EF Dataset documentation page

 

A sample is available for download through your browser – sample.zip – as well as thematic collections: DocSouth(87 volumes), EEBO(355 volumes), ECCO(505 volumes).