Skip to main content

Data Management Toolkit @ UNH

This toolkit provides information to help researchers develop data management plans and effectively manage their research data.

Choosing file formats

The file format(s) in which you record, store, and transmit your data is a primary factor in one's ability to use your data in the future.

Since technology continually changes, researchers should plan for both hardware and software obsolescence. How will your data be read if the software used to produce it becomes unavailable?

Formats more likely to be accessible in the future are:

  • Non-proprietary
  • Open, documented standard
  • Common usage by research community
  • Standard representation (ASCII, Unicode)
  • Unencrypted
  • Uncompressed

Consider migrating your data into a format with the above characteristics, in addition to keeping a copy in the original software format.           

Examples of preferred format choices:

  • PDF/A, not Word
  • ASCII, not Excel
  • MPEG-4, not Quicktime
  • TIFF or JPEG2000, not GIF or JPG
  • XML or RDF, not RDBMS

For examples of how data archives treat different file formats, see the UK Data Archive page on data formats and software. Note that not all repositories are able to migrate data files to newer file formats for preservation.

Data identifiers

You'll want put your datasets where other people can access them and give your datasets identifiers that can be referenced easily. Many repositories assign data identifies to your data. 

Data identifiers must be globally unique and persistent. That is to say, they must not be repeated elsewhere and they must not change over time.  

There are many different schemes:

  • PURL -- A PURL is a Persistent Uniform Resource Locator. Functionally, a PURL is a URL. However, instead of pointing directly to the location of an Internet resource, a PURL points to an intermediate resolution service. The PURL resolution service associates the PURL with the actual URL and returns that URL to the client. Caltech CODA provides Persistent URLs.
  • DOI -- A DOI (Digital Object Identifier) is a name (not a location) for an entity on digital networks. It provides a system for persistent and actionable identification and interoperable exchange of managed information on digital networks.
  • ACCESSION -- Accession numbers used by the National Center for Biotechnology Information (NCBI) are unique and citable.
  • InChI -- The IUPAC International Chemical Identifier (InChI) is a non-proprietary identifier for chemical substances that can be used in printed and electronic data sources thus enabling easier linking of diverse data compilations.
  • URI -- Uniform Resource Identifier (URI) consists of a string of characters used to identify or name a resource on the Internet. Such identification enables interaction with representations of the resource over a network, typically the World Wide Web, using specific protocols.