Research Guides: Text Mining With the Hathi Trust Research Center: Extracted features data set

Extracted Features

HTRC releases research datasets to facilitate text analysis using the HathiTrust Digital Library. While copyright-protected texts are not available for download from HathiTrust, useful research can be performed on the basis of non-consumptive analysis of features extracted from full text. These features include volume-level metadata, page-level metadata, part-of-speech-tagged tokens, and token counts. This data would not allow you to analyze the text at the level of syntax, but would enable "bag-of-words" methods such as topic modeling. Additionally, HTRC has partnered with advanced researchers to release a derived dataset, Word Frequencies in English-Language Literature, 1700-1922.

A full explanation of the dataset's features, motivation, and creation is available at the EF Dataset documentation page

A sample is available for download through your browser – sample.zip – as well as thematic collections: DocSouth(87 volumes), EEBO(355 volumes), ECCO(505 volumes).