Documentation

Registry of file formats

Policy Statement

YUDL requires immediate identification of the type of file format submitted in order to help mitigate risk posed by format obsolescence. To this end, YUDL employs the use of DROID, JHOVE, file utility, Exiftool, PRONOM, NLNZ Metadata Extractor, ffident, and Tika through the FITS software package.

While YUDL is not dependent on or restricted to any particular format or group of formats, it aims to use well-known, widely accepted formats that support long-term preservation. If a submitter wants to use a specific format not meeting these criteria, an agreement must be reached between the submitter and YUDL.

Implementation Examples

YUDL makes use of FITS for format identification during the ingestion process where a file format is associated with each file.

Example characterization and reference to format registry:

<?xml version="1.0" encoding="UTF-8"?>
<fits xmlns="http://hul.harvard.edu/ois/xml/ns/fits/fits_output" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://hul.harvard.edu/ois/xml/ns/fits/fits_output http://hul.harvard.edu/ois/xml/xsd/fits/fits_output.xsd" version="0.7.4 (fits-mcgath fork)" timestamp="02/07/13 4:26 PM">
  <identification>
    <identity format="Tagged Image File Format" mimetype="image/tiff" toolname="FITS" toolversion="0.7.4 (fits-mcgath fork)">
      <tool toolname="Jhove" toolversion="1.9" />
      <tool toolname="file utility" toolversion="5.09" />
      <tool toolname="Exiftool" toolversion="9.13" />
      <tool toolname="NLNZ Metadata Extractor" toolversion="3.4GA" />
      <tool toolname="ffident" toolversion="0.2" />
      <tool toolname="Tika" toolversion="1.3" />
      <version toolname="Jhove" toolversion="1.9">5.0</version>
    </identity>
  </identification>
  <fileinfo>
    <size toolname="Jhove" toolversion="1.9">33543972</size>
    <creatingApplicationName toolname="Exiftool" toolversion="9.13">Adobe Photoshop Elements 2.0</creatingApplicationName>
    <lastmodified toolname="Exiftool" toolversion="9.13" status="CONFLICT">2007:03:09 11:00:49-05:00</lastmodified>
    <lastmodified toolname="Tika" toolversion="1.3" status="CONFLICT">2007-03-09T11:00:48</lastmodified>
    <filepath toolname="OIS File Information" toolversion="0.1" status="SINGLE_RESULT">/mnt/DIY/Archives/ASC/tiffs/02000-02999/ASC02000.tif</filepath>
    <filename toolname="OIS File Information" toolversion="0.1" status="SINGLE_RESULT">/mnt/DIY/Archives/ASC/tiffs/02000-02999/ASC02000.tif</filename>
    <md5checksum toolname="OIS File Information" toolversion="0.1" status="SINGLE_RESULT">b2b263bf5207481e42ac5945538ec985</md5checksum>
    <fslastmodified toolname="OIS File Information" toolversion="0.1" status="SINGLE_RESULT">1173456049000</fslastmodified>
  </fileinfo>
  <filestatus>
    <well-formed toolname="Jhove" toolversion="1.9" status="SINGLE_RESULT">true</well-formed>
    <valid toolname="Jhove" toolversion="1.9" status="SINGLE_RESULT">true</valid>
  </filestatus>
  <metadata>
    <image>
      <byteOrder toolname="Jhove" toolversion="1.9" status="SINGLE_RESULT">little endian</byteOrder>
      <compressionScheme toolname="Jhove" toolversion="1.9">Uncompressed</compressionScheme>
      <imageWidth toolname="Jhove" toolversion="1.9">7108</imageWidth>
      <imageHeight toolname="Exiftool" toolversion="9.13">4716</imageHeight>
      <colorSpace toolname="Jhove" toolversion="1.9">BlackIsZero</colorSpace>
      <orientation toolname="Jhove" toolversion="1.9" status="SINGLE_RESULT">normal*</orientation>
      <samplingFrequencyUnit toolname="Jhove" toolversion="1.9" status="CONFLICT">in.</samplingFrequencyUnit>
      <samplingFrequencyUnit toolname="Tika" toolversion="1.3" status="CONFLICT">Inch</samplingFrequencyUnit>
      <xSamplingFrequency toolname="Jhove" toolversion="1.9" status="CONFLICT">6000000/10000</xSamplingFrequency>
      <xSamplingFrequency toolname="Exiftool" toolversion="9.13" status="CONFLICT">600</xSamplingFrequency>
      <xSamplingFrequency toolname="NLNZ Metadata Extractor" toolversion="3.4GA" status="CONFLICT">600.0</xSamplingFrequency>
      <ySamplingFrequency toolname="Jhove" toolversion="1.9" status="CONFLICT">6000000/10000</ySamplingFrequency>
      <ySamplingFrequency toolname="Exiftool" toolversion="9.13" status="CONFLICT">600</ySamplingFrequency>
      <ySamplingFrequency toolname="NLNZ Metadata Extractor" toolversion="3.4GA" status="CONFLICT">600.0</ySamplingFrequency>
      <bitsPerSample toolname="Jhove" toolversion="1.9" status="CONFLICT">integer</bitsPerSample>
      <bitsPerSample toolname="Exiftool" toolversion="9.13" status="CONFLICT">8</bitsPerSample>
      <samplesPerPixel toolname="Jhove" toolversion="1.9">1</samplesPerPixel>
      <scanningSoftwareName toolname="Jhove" toolversion="1.9">Adobe Photoshop Elements 2.0</scanningSoftwareName>
      <YSamplingFrequency toolname="Tika" toolversion="1.3" status="SINGLE_RESULT">600.0</YSamplingFrequency>
    </image>
  </metadata>
</fits>

Acknowledgements

Adapted from and inspired by:

License

CC0

CC0 1.0 Universal (CC0 1.0) Public Domain Dedication

Definition of DIP

Definition of DIP

Dissemination Information Package (DIP)

  • OAIS describes a DIP as "the Information Package, derived from a part, or all, of one or more AIPs, received by the Consumer in response to a request to the OAIS."
  • York University Digital Library's (YUDL) DIPs are always generated from a single AIP.
  • User access to archival objects is provided through the [YUDL website](http://digital.library.yorku.ca](http://digital.library.yorku.ca).
  • The user, depending on their level of access, will may see basic object metadata, and an access version of the digital object.
  • Context information is provided in the form of links to other items in a given collection.
  • The DIP is retrieved using the URI for the corresponding AIP. In turn, the AIP contains metadata tying it back to the SIP.

Acknowledgements

Adapted from and inspired by:

License

CC0

CC0 1.0 Universal (CC0 1.0) Public Domain Dedication

Definition of AIP

Definition of AIP

Archival Information Package (AIP)

  • The information package consisting of the Content Information (CI), Preservation Description Information (PDI), Packaging Information (PI), and Descriptive Information (DI) that is archived by York University Libraries (YUL).
  • The level of content in a York University Digital Library (YUDL) AIP can vary, depending on the amount of content provided by the submitter.
  • This description will use the OAIS Information Model to illustrate completeness of our conceptual model, and will describe, in general terms, what a YUDL AIP looks like.

Content Information (CI)

  • The Content Data Object (CDO) is generally stored with from the primary preservation metadata file, which is held in Fedora Commons.
  • Representation Information is maintained, and contains information on the CDO's file format, version, and a reference to a format registry in order to provide information on how to interpret the file. See: registry of file formats

Preservation Description Information (PDI)

  • Reference Information - Identifiers are stored for each object identifying it globally (e.g. YUDL PID) and locally (e.g. URI).
  • Provenance Information - Provenance metadata is maintained for each object that provides a history of preservation events in the object's lifetime, beginning at ingest into the YUDL repository and referencing any preservation activities taken on the object (e.g., replacement due to corruption, format migration, etc.).
  • Context Information - As appropriate, information on how a CDO relates to other CDOs or to other conceptual entities. Examples of these relationships can include: a newer version of an object that supersedes an older one.
  • Fixity Information - Fixity information is generated at the time of ingest in order to later determine whether or not the item remains in the same state as when it was ingested. This information can be used to determine integrity of an object being copied within the system (as in the case of a change in storage location), or for periodic integrity checks.

Packaging Information (PI)

  • YUDL preservation metadata packages both the descriptive and preservation metadata together.

Descriptive Information (DI)

  • Depending on the type of CDO, the format of this descriptive metadata can vary (MODS or Dublin Core), but is selected to maximize findability. In all cases, the descriptive metadata will be recreated within the preservation metadata.

Acknowledgements

Adapted from and inspired by:

License

CC0

CC0 1.0 Universal (CC0 1.0) Public Domain Dedication

Definition of SIP

Definition of SIP

Submission Information Package (SIP)

  • The information package that is delivered to York University Digital Library for use in the construction of one or more AIPs.
  • The format of the SIP may vary from submitter to submitter, based on the submitters willingness and ability to provide the content and metadata in a specific format.
  • For a given Content Type, any requirements or restrictions on the type of content that can be contained in the SIP will be described in that Content Type's Preservation Action Plan.

Acknowledgements

Adapted from and inspired by:

License

CC0

CC0 1.0 Universal (CC0 1.0) Public Domain Dedication

Backup Plan

Backup Plan

1. Policy Statement

As part of York University Libraries (YUL) implementation of the Bit-stream Copying Preservation Strategy (as detailed in the Preservation Implementation Plan), YUL is committed to regular backup procedures of both data storage areas and its operational areas (e.g. databases, application files). These backups are intended to serve as the basis for restoration of York University Digital Library (YUDL) materials in the case of disaster recovery or corruption of data.

Data backup at YUL is coordinated through Library Computing Systems and University Information Technology. Since the data is stored on physical hardware located in the YUL data centre, it uses the same backup hardware and software as the general university systems.

2. Implementation

2.1 Database Backup

This backup strategy applies to content that is stored in a database. Primarily, this refers to objects located in YUDL's databases (MySQL).

2.1.1 MySQL Database Backup Strategy

  • Database dumps are backed up daily and taken off site. Each backup is kept for 60 days.

2.2 Application Backup

2.2.1 Fedora Commons Backup Strategy

  • Fedora Commons is backed up daily and taken off site. Each backup is kept for 60 days.

2.2.2 Drupal (Islandora)

  • Drupal is backed up daily and taken off site. Each backup is kept for 60 days.

2.2.3 Solr

  • Solr is backed up daily and taken off site. Each backup is kept for 60 days.

2.3 Objects

  • Fedora objects are backed up daily and taken off site. Each backup is kept for 60 days.

2.4 Verification

  • Quarterly disaster recovery drills coordinated between the Digital Assets Librarian and YUL Library Computing Systems to test system verification.

Acknowledgements

Adapted from and inspired by:

License

CC0

CC0 1.0 Universal (CC0 1.0) Public Domain Dedication

Designated Community Definition

Digital Preservation Designated Community Definition

The purpose of York University Libraries' digital preservation efforts is to ensure long-term access to a critical mass of materials of significance to our community. We will emphasize Canadian scholarship, knowledge produced by our primary user community, unique library and archival holdings from our collections, and born digital university records (see Collection Policy). Our aim is to mobilize research and knowledge in an environment that allows for ease of searching, browsing, retrieval, and reuse. With this work, we intend to make a significant and ongoing contribution to the global digital library.

York University Libraries' primary user community consists of, York University:

  • Faculty
  • Staff
  • Students
  • Affiliated researchers

Secondary user communities include:

  • Global community of scholarship
  • Elementary and secondary schools
  • Journalists
  • Data consumers
  • General public
  • Nonhuman users

Acknowledgements

Adapted from and inspired by:

License

CC0

CC0 1.0 Universal (CC0 1.0) Public Domain Dedication

Digital Preservation Rights Policy

YUL Digital Preservation: Rights Policy

As outlined in the Preservation Implementation Plan, York University Library’s digital preservation strategies are based around the transformation of ingested content into file formats conducive to long term preservation, in addition to standard bit-level preservation. In order to enable the full realization of this strategy, there is one right that need to be acquired from the content providers:

Transformation

Content providers granting this right have consented to having their content migrated to new formats for the purposes of long-term preservation. In the interest of long-term preservation, it is important that York University Library be allowed to transform files in its care from formats deemed as being in the danger of becoming obsolete into ones more suited to long-term interpretability. This is the right necessary to enable the “full” preservation level.

Without this right having been assigned, York University Library can only ensure file integrity, not the preservation of the intellectual contents of an archival unit.

Acknowledgements

Adapted from and inspired by:

License

CC0

CC0 1.0 Universal (CC0 1.0) Public Domain Dedication

Digital Preservation Implementation Plan

Preservation Activities

YUL’s preservation strategies are based around the preservation of the intellectual content of the digital objects contained in YUL’s digital repositories (YorkSpace, YUDL) through the transformation of these objects to delay or present file obsolescence. In the course of these transformations, priority is given to maintaining the information contained in an individual content object, as opposed to preserving its appearance or a specific question.

To this end, YUL utilizes the following approaches to preservation:

Archival File Formats: YUL is committed to the use of file formats that support long term sustainability. In general, the considerations for selecting file formats include the “openness” of the file format, its level of support as a preservation format in the academic/scholarly community, and its uptake among YUL’s Designated Community, as well as its well-suitedness to later format migration.

Normalization: As mentioned above, YUL works to identify file formats well-suited to its approach to preservation and access. Upon ingest, materials not conforming to YUL’s accepted standards will be converted to one of the previously identified formats. To the extent possible, YUL will attempt to preserve the essential characteristics of the object. In cases requiring compromise, transformations that maintain the content of the object will be prioritized over those that preserve the presentation.

Format Migration: When YUL perceives that a portion of its content is stored in a format that is at risk of obsolescence, a new version of this content will be created in a format more suited to long-term preservation and use. This transformation may consist of migration to a newer version of the content’s existing format, or transformation to a different format altogether. In all cases, preservation of the object’s intellectual content will be prioritized over the preservation of a specific presentation style.

Bit Stream Copying: YUL maintains regularly scheduled backups of all information contained in YUL’s digital repositories, for use in the event of data loss. In combination with regular fixity checks, which identify potentially damaged content, this process ensures the integrity of content in YUL’s digital repositories, and provides a foundation for its disaster recovery plans.

Fixity Checking: All materials in the repository are subject to regular fixity checks - comparisons of checksum values calculated at a given point in time with those generated at the material’s time of ingest. This activity, when combined with bit stream copying, mitigates the risk of objects becoming corrupt in the repository, as it enables the repository managers to identify damaged or corrupted content, and to revert to a valid version of the object from a previous point in time.

Documentation of File Formats: Upon ingest, every file in the repository is subject to identification of its file format and other significant characteristics. Also generated is a reference to the file format’s entry (if it exists) in PRONOM, the National Archive’s online format registry. This association ensures that information is always available on the internal structure of the file, and can be further used to determine when the format migration activity should take place (if allowed by the object’s preservation level) in order to mitigate the risks posed by the obsolete file formats.

Preservation Levels

These preservation activities are applied to materials in the repository according to the material’s designated Preservation Level, as described in the Digital Preservation Strategic Plan.

Bit-level Preservation: Items preserved at the Bit-level Preservation level will be subject to Bit Stream Copying, Fixity Checking, and Documentation of File Formats preservation activities. This is a baseline level of preservation activity which ensures that object, once ingested into the repository, can be maintained in a valid and uncorrupted state. It also attempts to provide representation information for the object through documentation of its file format, though at this level no migration activities will take place. This preservation level should be considered less robust than the “Full Preservation” level, and should only be considered in situations where Full Preservation is not a viable strategy. Common issues of unsuitability include lack of privilege to perform migration activities on the material, the presence of material in unknown or unsupported file formats, or the material’s failure to conform to a valid format.

Full Preservation: Items preserved at this level will receive the benefit of all of the above-mentioned preservation activities, as appropriate. Upon ingest into the repository, the material will undergo file format identification and normalization/transformation to archival file formats. As time goes on, these formats will be monitored by YUL staff, and should the criteria for format migration be met, the files will be migrated to a new format. In addition, all activities associated with the “Bit-level Preservation” preservation level will be carried out.

No Preservation: In rare cases, the repository may contain material for which YUL is unable or unwilling to accept preservation responsibility. Although incidental preservation activities may take place upon this material, YUL accepts no responsibility for its long-term accessibility or validity. This level is an exception to YUL’s digital preservation activities, rather than a part of its preservation strategy, but is included here for completeness. Examples of materials at this level would include objects that are known to be corrupt or not authorized for preservation at any level but which, for some reason, cannot be deleted immediately.

Acknowledgements

Adapted from and inspired by:

License

CC0

CC0 1.0 Universal (CC0 1.0) Public Domain Dedication

Digital Preservation Strategic Plan

YUL Digital Preservation: Strategic Plan

Preservation is not a place into which content is put for safe-keeping, but rather, it is a process in which content evolves proactively and reactively through the application of strategy-embodying services.

The purpose of the York University Library Digital Preservation Plan is to outline the digital preservation strategy used by York University Library to ensure continued access to its digital collections by the Designated Community.

Objectives: The primary focus of the YUL’s digital preservation activities is on preserving the intellectual content of the materials digitized by the library, materials deposited into YorkSpace, and born digital materials acquired by Clara Thomas Archives and Special Collections. This means that YUL will prioritize the preservation of the content of all materials ingested, as opposed to the look and feel of the document.

The following properties are those which will be prioritized in all preservation activities:

  • The intellectual content of the object in the repository. This will be defined on a collection-level, type-by-type basis and includes all supplemental materials and the relationship between these objects, as can be determined from metadata or other context at the time of ingest.
  • Metadata included with the object at the time of ingest, especially that which relates it to other objects within the repository, or to the universe of its collection type overall.
  • The intellectual rights to the object held by YUL and members of its designated community. While these properties are used to control access to the content and to determine its preservation level, they are also preserved themselves.

Secondary considerations in preservation include the following items. While not strictly a part of the intellectual content of the preservation object, these properties are necessary to ensure its preservation and as such must be tracked as well:

  • The object’s chain of custody, starting as early as possible but at the very least from the time it entered the repository. This information is necessary in order to understand the history of the object, and to denote any transformations or changes that have occurred to the content.
  • Information on the object’s representation. For every digital object, some level of interpretation is necessary in order to transform the object from binary data into a human interpretable item.
  • Fixity information. The repository will keep sufficient metadata on the object to ensure at any point in the future that the object remains in a complete and uncorrupted state.

The preservation of the above properties will be carried out using a transformative approach. That is, the formats (both at the file level and the metadata level) used in the repository will be constantly monitored (per the Environmental Monitoring of Preservation Formats policy) in order to ensure their suitability to long-term preservation. In instances where a format is deemed to present an unacceptable level of risk to the long-term viability of the content, an appropriate successor format will be chosen, with input from the Designated Community, and all materials in the existing format in question will be migrated over. Given YUL’s Strategic Plan (Steward York’s research assets), additional transformation may be made on the material in order to increase its findability. Such transformations will never be made in such a way as to endanger the long-term preservation of the material, and in situation where this would occur, the material so transformed will not be considered as part of the preservation plan.

Scope: YUL commits to preserving the materials for which it has accepted responsibility to the greatest degree possible. However, there are a number of criteria necessary to the repository’s ability to carry out this mission. In order to provide some level of preservation on materials for which not every criteria is met, YUL has defined multiple preservation levels, which indicate a level of preservation behaviours that YUL will use upon the content in question. For additional information on preservation levels, see the Preservation Implementation Plan.

The criteria to be assessed when determining preservation level include:

  • Rights: YUL should have appropriate rights to preserve the material in a matter consistent with its Preservation Strategies. At a minimum, YUL should have the right to locally load the content for archival purposes. In most cases, this should also include the ability to transform the content into new formats in the event that an existing one should become obsolete.
  • Appropriate metadata: Content to be ingested into the repository should be accompanied by metadata sufficient to provide a meaningful context to the content, as understood by the Designated Community. This can include information situating the content withing its universe (e.g., keywords, bibliographic metadata) or information contributing to the object’s usability (e.g., dataset codebooks). The criteria for acceptability under this measure will be defined on a Content Type basis.
  • Validity: The content object must be a well-formed and valid instance of the type of object it purports to be.
  • Format Appropriateness: YUL will, for each Content Type, maintain a list of formats which will be deemed as acceptable for long-term preservation. This list will be based on the needs of the Designated Community, as well as the format’s future prospects for migration.

Compliance with all of these criteria is necessary for an object to be subject to the full extent of preservation activities., as defined in the Preservation Implementation Plan. Failure or partial compliance does not necessarily mean that an object cannot be ingested into the repository, but an inability to meet these criteria will result in the use of a less robust preservation plan.

Review and Revision: YUL commits to the review and revision of its preservation practices and the corresponding documentation, as outlined in the Review Cycle for Documentation Policy.

Acknowledgements

Adapted from and inspired by:

License

CC0

CC0 1.0 Universal (CC0 1.0) Public Domain Dedication