Open Digital Preservation Glossary

Active Management
The continuous monitoring of digital objects for changes in their bitstream. The monitoring of a file on a file system for bit rot.
Ada Lovelace
Mathematician known for what is widely seen as the first computer program on Babbage’s analytical engine. Ada immediately recognised the potential of computers and their applications beyond calculation.
Adobe
Not strictly an organisation involved in digital preservation per se, their involvement in the standards for TIFF, XML, and PDF, means that their name will come up often enough. Adobe Systems Incorporated are based in San Jose California, and are also responsible for software such as Adobe Photoshop.
Agency
A governmental department that transfers records to a national/government archive.
Alan Turing
Mathematician and computer scientist. Formalised the fundamentals of computer science, and developed the rules by which computers can be assessed as being artificially ‘intelligent'.
Algorithm
A formal, or codified description of a set of rules for determining an outcome from one or more inputs.Any set of rules combined to generate an outcome can be described as an algorithm, for example, the set of rules for baking a certain type of cake.An algorithm could be created to sort a set of numbers in the most optimal way possible.Algorithms are utilized in many of today’s online services, for example, YouTube, to determine the type of content that may be most interesting to its viewers.
Andrew W. Mellon Foundation
The Andrew W. Mellon Foundation endeavours to strengthen, promote, and, where necessary, defend the contributions of the humanities and the arts to human flourishing and to the well-being of diverse and democratic societies. To this end, it supports exemplary institutions of higher education and culture as they renew and provide access to an invaluable heritage of ambitious, path-breaking work. The Foundation makes grants in five core program areas: Higher Education and Scholarship in the Humanities; Arts and Cultural Heritage; Diversity; Scholarly Communications; and International Higher Education and Strategic Projects.
Apache Software Foundation
Provides support for the Apache Community of open-source software projects, which provide software products for the public good.By collaborative consensus based processes, an open, pragmatic software license and a desire to create high quality software.Not simply a group of projects sharing a server, but rather a community of developers and users.
Apache Tika
A tool maintained by the Apache Software Foundation capable of extracting metadata and content from a range of file formats including PDF, Microsoft Office, Rich Text Format, and XML.
API
Application Programming Interface. A description of a software library or web service and how users and software agents are expected to interact with it and retrieve, or contribute data.
Appraisal
Appraisal is the analysis of an organisation’s business context, business activities and risks. This will determine what information and records need to be created, what are of high risk/high value, and how long they need to be managed to meet business and community needs and expectations.
ARANZ
Archives and Records Association of New Zealand. An incorporated society, established in 1976, with the aim of promoting the understanding and importance of records and archives in New Zealand. ARANZ is administered nationally by a Council of elected members. Branches are established in Auckland, Otago/Southland, Central Districts, Waikato/Bay of Plenty, and Wellington.
Archival Silences
Voices, and memories, of individuals, groups, or communities, and their activities – their marginalization, that don’t appear in the archival/documentary record due to the biases, unwittingly or otherwise, of the collecting institution; the nature of custodial archives; and/or an inability to access, and make available, the content of the record in meaningful ways.
Archive Management System
An archive management system commonly wraps functions that are not part of a digital preservation system. An archive management system enables the description of digital records and the provision of context relevant or one or more archival models. A connection must exist between the archive management system and a digital preservation system to allow digital records to be retrieved by their catalogue references.
Archivematica
Open source digital preservation system maintained by Artefactual. Younger than Preservica and Rosetta, Archivematica has a growing user-base, and a different support model to the two mentioned.
Arkivum
A UK based data archiving solution with a high level of reported integrity. Arkivum offer digital preservation guidance and have collaborated with systems such as Archivematica to provide digital preservation solutions.
Artefactual
Artefactual Systems develops free and open software made available under the AGPLv3 open-source software license. The software and user community is supported through release management, public technical and user documentation, and community forum support.To pay the bills, and to continue to develop and update the software, paid services are also offered. Artefactual are responsible for the Archivematica (digital preservation) and Access to Memory (archival description) systems.
ASA
Australian Society of Archivists. The peak professional body for archivists in Australia. It was formed in response to the growth of archival, record keeping and heritage preservation services in Australia, and the increasing demand for archival and record keeping skills in community organisations, corporate entities and government. ASA is responsible for the journal Archives and Manuscripts.
ASCII
American Standard Code for Information Interchange is the mapping of computer control signals, and Latin alphanumeric characters to the 255 numbers that can be represented using a single byte. ASCII is heavily biased toward western writing systems and as such Unicode was created to make it easier to work with other writing systems in a computing environment.
Assessment (AV Preserve definition)
An assessment looks at how a system is succeeding and where there are gaps.The outcome of an assessment is feedback for growth;identification of gaps in practice; and spotlights strengths.Assessments completed regularly may lead to a successful pass on an audit (toward Trusted Digital Repository Status) in future.
Atomicity (Databases)
A group of operations that occur together in a database and are committed (saved) in one group as a single transaction. If one of the operations fails, the database record is not written and the database is reverted back to its state before the interaction began.
Audit (AV Preserve)
A review of a system to ensure compliance. The outcome of an audit is binary – either you pass or fail.
Audit Trail
Related to the concept of provenance, an audit trail describes the agents and actors that have accessed a digital resource and the processes (read, write, transform) that have been applied to that resource.
Authenticity (UNESCO, 2003, definition)
Quality of genuineness and trustworthiness of some digital materials, as being what they purport to be, either as an original object or as a reliable copy derived by fully documented processes from an original.
Authenticity and Integrity
Checksums can prove data hasn’t changed which can help us to prove a record's authenticity and integrity from the point of transfer.In UNESCO memory of the world terms, integrity is the quality of being ‘uncorrupted and free of unauthorized and undocumented changes’ (UNESCO 2003).
Automation
Checksums are unique to a data stream and thus can become unique, fixed-length, identifiers for those files. We can keep track of our files through various automated workflows through the use of checksums.
AV Preserve
AV Preserve is a data management consulting and software development firm focused on leveraging a deep understanding of technology, information, business, and people to advance the ways in which data is used for the benefit of individuals, organizations, and causes.The AV Preserve team consists of internationally recognized experts with years of experience working with academic, media and entertainment, government, museum, broadcast, and corporate organizations.AV Preserve tools include Fixity for continuous monitoring of digital checksums, and MediaSCORE a media preservation prioritization application.AV Preserve also perform assessments of digital repositories to help them gauge how close they are to Trusted Digital Repository status, ISO 16363.
AV Preserve Fixity
AV Preserve Fixity is a software agent for scheduling the scanning and checking of checksums for a given directory or directories of files. If a comparison fails, that is a file that is expected to match doesn’t, then an email is sent prompting users about the error enabling them to initiate procedures to return original data from backups. The tool is maintained by AV Preserve.
AV Preserve ISO 16363 Assessment
An assessment which determines how close a digital archive is to passing the benchmark for a Trusted Digital Repository (TDR). AVPreserve are one of the companies offering this service.
Bagit
BagIt is a hierarchical file packaging format designed to support disk-based storage and network transfer of arbitrary digital content. A "bag" consists of a "payload" (the arbitrary content) and "tags", which are metadata files intended to document the storage and transfer of the bag. A required tag file contains a manifest listing every file in the payload together with its corresponding checksum. A bag can potentially be used as a SIP (Submission Ingest Package) in a digital preservation workflow.
BAVC
Bay Area Video Coalition. is a community hub and resource for media makers in the Bay Area and across the country, serving over 7,500 freelancers, film-makers, job-seekers, activists, and artists every year. BAVC provides access to media making technology, storytelling workshops, a diverse and engaged community of makers and producers, services and resources. BAVC advocates for those whose stories aren't being told, and provides the resources for anyone to create and share, and amplify their stories and those of their communities. BAVC's diverse, innovative programs lead the field in media training for youth and educators, technology and multimedia focused workforce development, visually-driven new media storytelling and audio-visual preservation. One of the nation’s longest-standing non-profit video and audio preservation organizations, BAVC is a leader in the field, developing the highest quality preservation standards and practices while working with individuals and cultural, academic, and media organizations to meet a range of needs for preserving historically and artistically important video and audio materials.
Big Data
A popular definition of big data is data that requires a certain amount of computational power to be able to ask research questions and draw sensible conclusions from.A working definition of 'big data' seems to be a range from Gigabytes (querying portions of your collection for keywords, for example), to the generation of a Gigabyte of data a second (the amount of data produced by the Large Hadron Collider in Switzerland)
Binary (Base 2)
A number system of two digits, zero, and one through which all numbers can be represented. In computer systems binary numbers are collected into groups of 8-bits called a byte. In computer electronics binary can be created through signals that are either on, or off.
Bit
A single binary signal – on or off – represented as a one or a zero.
Bit-level preservation
The process of correcting a digital file that has suffered bit-rot, identified through the process of active management.
Bitcurator
The BitCurator initiative develops and supports open source digital forensics tools for use in libraries, archives, and museums. Their projects are:The BitCurator Access project that focuses on technologies that simplify access to raw and forensically-packaged disk images; allowing collecting institutions to provide access environments that reflect as closely as possible the original order and environmental context of these materials.The BitCurator project was a joint effort led by the School of Information and Library Science at the University of North Carolina, Chapel Hill (SILS) and the Maryland Institute for Technology in the Humanities (MITH) to develop a system for collecting professionals that incorporates the functionality of many digital forensics tools. The project was originally (2011-2014) funded by the Andrew W. Mellon Foundation. Community support and ongoing software development are managed by the BitCurator Consortium.The BitCurator Consortium, supported by Educopia, provides ongoing support for BitCurator software products and the BitCurator community.
Bitrot
The loss of data through degradation of a carrier medium, e.g. the loss of magnetic resonance on a floppy disk leading to the file allocation table (file directory) becoming unreadable.
Bitstream
A contiguous stream of bytes that requires further interpretation by a user or computer.
BOF
Beginning of File (BOF). A file will often have a magic signature in its very first few bytes and so we’ll often be looking at beginning of file sequences.
Brunnhilde
Brunnhilde is a reporting companion tool for Siegfried created by Tim Walsh. Brunnhilde is part of BitCurator implementations and also integrates reports from sources such as ClamAV virus checker.
Byte
An encoding of 8 binary signals – eight ones or zeros – represented as a single integer which then needs further interpreting by a computer, or user, by looking up an appropriate encoding scheme.
Cardigan
Knitwear. A preferred sartorial choice made by the discerning archivist.
Character Encoding
The mapping of binary numbers to a lexical or numerical character. Numerous character encodings exist including ASCII, and EBCDIC. The widest range of characters can be represented using a standards called Unicode.
Characterization
Characterization is whereby metadata crucial to the preservation of the digital object is recorded.This information may describe the object itself or part of its technical environment.
Checksum
A string generated by a hash algorithm/hash function that can allow us to determine changes to a stream of data, i.e. by comparing result of a hash algorithm after data transfer to one we generated before data transfer.In digital preservation we tend to use the term checksum interchangeably with the word hash – the fixed length string generated by something called a cryptographic hash function (MD5, SHA1, SHA256, etc.).Checksum may also refer to the process of comparing two checksum values – checking the sum – for changes in the data stream.A checksum will usually be made up of hexadecimal characters 0-9 and A-F, e.g.d41d8cd98f00b204e9800998ecf8427e
Checksums are just large numbers
Checksums are just really big numbers. Computers are really good at working with numbers that is why they are good for automated processes and comparisons. If we convert hexadecimal: d41d8cd98f00b204e9800998ecf8427e to a decimal number in Google we get 2.8194977e+38
Checksums vs. Fixity
If a checksum should fail for any reason, archivists also have the concept of fixity. The concept of ‘remaining fixed in state'. We can observe file date ranges, e.g. modified and creation date. We can also look at the content and clues in the content for features that help us to prove a digital file is what it purports to be. There is only one Domesday Book – we have many ways of proving this is what it is without a checksum value per se.
Claude Shannon
Mathematician responsible for the creation of the field of studies known as Information Theory; the study of the quantification, storage, and communication of information.
CMS/ECMS
Content Management System/Enterprise Content Management System are contemporary methods of maintaining digital records in an organisation.Systems manage storage and retrieval of records across the organisation for all users.Systems will implement retention and disposal schedules, as well as wrapping records in suitable record keeping management and discovery metadata.
Collection Policy
A statement that describes the collecting principles of the archival institution which may include: legal obligations to collect; jurisdiction; geographical areas; chronological period; media-type; and collection methodology, plus other logical grouping of content relevant to the policy.
Collisions
A collision happens when two different data streams result in the same checksum value.This is a big concern when a checksum is used for security purposes (e.g. in password applications).A collision is computationally difficult to engineer but not impossible.Collisions could of course be incidental.An engineered collision for SHA1 recently took knowledge of the algorithm, plus 9,223,372,036,854,775,808 SHA-1 computations, 6,500 years of CPU (Central Processing Unit) time, and 110 years of GPU (Graphics Processing Unit) time, to create.Collisions are not a huge concern in digital preservation because multiple checksums may often be created for a single file to avoid such a situation.Archivists also have the concept of fixity.Collisions are a bigger concern when workflows require on just a single checksum to align large amounts of data.
Compression (General)
The use of entropy (redundancy) in a digital object to enable it to be re-encoded in a way such that the resulting bitstream is smaller than the original, but that the original file, or an approximation of the original file can still be presented back to the user. A file that has been compressed must be uncompressed to be rendered or used.
Compression (Lossless)
A method of encoding data so that the resulting bitstream is smaller than the original, e.g. for transmission or storage, but when the data is uncompressed it is exactly the same as the original byte-for-byte.
Compression (Lossy)
Lossy compression is a term usually applied to files that have been transformed into something, smaller, through the removal of information, but which can be replayed back to the user in a way that is approximately the same.The MP3 algorithm ‘compresses’ audio streams by removing high-frequency signals that, theoretically, human beings cannot hear, transforming the signal, and then re-encoding it.The loss of high-frequency signals equates to a loss of information, and is therefore lossy.Should a user attempt to then recompress a lossy file, the file will be compressed even further resulting in even more information loss – think photocopy of a photocopy.N.B. it is a myth that simply opening a lossy file, e.g. JPG can make it lose even more information. The user must actively choose to resave the file, and even then, choose lossy options when doing so.
Container Signatures
Container signatures require the tool to first uncompress the file. Container signatures only exist for OLE2 type files (Microsoft family, plus a few others), and ZIP type files (Microsoft family, Open Office, plus a few others). First a trigger is discovered, and that trigger maps to a set of rules for identification in the container signature which may include the specification of files or folders that must exist, and optionally magic number byte sequences inside specific files.
Continuum Model
The recognition that records are multi-dimensional and develop and gain new context and value across a continuous period of time (a continuum) across a number of organizations and activities as they are moved about and are reused.Records do not simply have a beginning and end of life with deprecating value as may be in a life-cycle approach to records and information management.
COPTR
A technical registry that describes tools useful for long term digital preservation.Acts primarily as a finding and evaluation tool to help practitioners find the tools they need to preserve digital data.COPTR collates this knowledge in one place instead of organisations competing against each other with their own registries.
Create Maintain
The principle in records and information management that asks that records are created with a long-term view to their maintenance with consideration given to the record keeping obligations of the organization. Organizations will consider application of disposal classes at point of creation, as well as considering other legislative requirements, obligations to transparency and stakeholders, amongst other dimensions of the records and information management continuum.
Cryptographic Hash Function
A cryptographic hash function is a one way function such that the original data cannot be determined from the hash value itself – it is infeasible to invert the function. Cryptographic hashes are considered quick. The cryptographic hash functions employed in digital preservation have wide application as well and so are considerably well tested and there are many tools that can support their use in our workflows.
d41d8cd98f00b204e9800998ecf8427e
The MD5 checksum of a zero byte file. Other checksums capable of generating a hash from a zero-byte file MD5: `d41d8cd98f00b204e9800998e`; SHA1: `da39a3ee5e6b4b0d3255bfef95601890afd80709`; SHA256: `e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855`.
Dark Archive (for access)
A gamut of archival or preservation systems can be used to give users access to metadata and catalogue entries about records, but there is zero public access to the records themselves. A dark archive configuration like this may be used because information content demands it to be protected in such a way.
Dark Archive (for storage)
The use of offline, or near-line storage for digital archives. Little or no immediate access is provided. Files are preserved through bit-level preservation unless a concerted intervention is made by the archival institution.
Data
Signals that can be transmitted and can potentially be interpreted by a computer or a human. Data is commonly a stream of binary information that needs further processing to be understood.
Data Encoding
A method of structuring information in a way that can be processed further by user, or computer, e.g. Extensible Markup Language, JavaScript Object Notation (JSON) and Comma Separated Values (CSV). File formats such as Microsoft Excel, Microsoft Word, or JPEG are also data encodings, albeit quite a bit more complex.
Data Obfuscation
When information is hidden or modified in a way that cannot be easily read then it is said to be obfuscated. Redaction, password protection, and encryption are three such methods of obfuscation. The latter two pose risk to successful digital preservation as they impact the ability to read the information in a record. Encryption impacts the ability to read the binary content of a file entirely unless an encryption key, and a known algorithm is available to decrypt the file's contents.
Data vs. Filename
Checksums are calculated on the data inside a file. If a filename changes, the checksum of the value is still the same because the data inside hasn’t been changed. If a file is copied, and given another filename the checksum of the two files will be identical.Checksums only operate on the data inside the file.
De-duplication
Because a data input will always output the same checksum value, checksums are great for de-duplication, that is removal of duplicate files with the same information.In an archival context this may be more complicated where a duplicate record has multiple contexts.In some storage systems, checksums can be used to store no more than one copy of an object that can then be referenced from multiple contexts.
Dependencies
The components of our computer architectures that make it possible to achieve an outcome or result. For example, to run Microsoft Word, a dependency may be Microsoft Windows. Configured in Microsoft Windows may be a number of software libraries that enable it to interact with your computer’s hardware configuration. As we dissect a file, or piece of software, we begin to understand what other technology it depends on to be able to run.
Deterministic but Unpredictable
Cryptographic hashes are deterministic meaning for a given piece of data, the same output will always be generated. That is, the same checksum value.Output is, however, unpredictable between inputs meaning that similar (not the same) output results in a radically different looking checksum value so the original data cannot be predicted.
dfir.training
A registry of digital forensics tools and training courses developed in 2016 that will prove useful for finding tools for dissecting and interpreting digital files for preservation and access.
Digest
A fixed length string. The output of a hash function.
Digital Archive (DIN (German Instituit for Standardization) Definition)
An organisation (consisting of people and technical systems) which has assumed responsibility for the long-term preservation and long-term availability of digital data and its provision for a specified designated community.
Digital Continuity
The idea that preservation of digital records happens in situ in the organisation creating the records. Digital continuity focuses on the value of creating and maintaining the digital records in your organization. Principles of digital preservation apply to this material but the custodianship of an archival institution is not needed to apply them.
Digital Humanities
The use of digital techniques to support the scholarly study of the humanities (Literature, Archaeology, Architecture etc.).
Digital POWRR
Preserving digital Objects with Restricted Resources.From 2012-2014, the Digital POWRR Project, an Institute of Museum and Library Services (IMLS)-funded study investigated, evaluated, and recommended scalable, sustainable digital preservation solutions for libraries with smaller amounts of data and/or fewer resources. The project realized that many information professionals felt overwhelmed by the scope of the problem. Team members created a workshop curriculum based off the findings of the study and it has since been delivered to many institutions across the States. POWRR created the digital preservation tool grid, now maintained in collaboration with the COPTR tool registry.
Digital Preservation (AV Preserve definition)
Digital preservation is a function of digital curation, in which digital content is prepared and actively managed for long-term access.Digital content requires constant, active management.At the most basic level, this includes managing multiple copies in different geographic locations, ongoing and consistent comparison of the same files in multiple locations to ensure that no changes have occurred to them (this is called fixity checking).It also involves performing healing procedures when files no longer match up, and maintaining audit logs from the time of ingest into the archival system that tracks all activities, like access and changes to the files over time.
Digital Preservation (Library of Congress definition)
Digital preservation is the active management of digital content over time to ensure ongoing access.
Digital Preservation System
A system, or set of systems and tools, that enable digital preservation.A system may be contrived of components for ingest, storage, preservation management, and access, as well as other functions.Industry examples include RODA, Archivematica, Preservica, and Rosetta.
DIN
Deutsches Institut für Normung (German Institute for Standardization) Responsible, for example, for DIN 31644 'Criteria for trustworthy digital archives'
DIN 31644
EN: Information and documentation - Criteria for trustworthy digital archives DE: Information und Dokumentation - Kriterien für vertrauenswürdige digitale Langzeitarchive
Disposal
A means of controlling records at an organization through: retention, destruction, transfer, and in some cases, the sale of records to another organization, providing there is authorization to do so. An example disposal action may be transfer from a government agency to a government archive.
Disposal Authority
Description of appraised classes of records and the disposal action that is supposed to be applied to them and after what time. A disposal authority is signed by a suitable jurisdictional authority.
Donor
An individual or organisation that donates records to an archival institution or library.
DPLA
Digital Public Library of America connects people to the riches held within America’s libraries, archives, museums, and other cultural heritage institutions.All of the materials found through DPLA—photographs, books, maps, news footage, oral histories, personal letters, museum objects, artwork, government documents etc. are free and immediately available in digital format.The cultural institutions participating in DPLA represent the richness and diversity of America, from the smallest local history museum to the nation’s largest cultural institutions.The DPLA’s core work includes bringing new collections and partners into DPLA, building technology, and managing projects that further the mission through curation, education, and community building.
DROID
DROID was the first client tool to make use of PRONOM signatures. DROID stands for Digital Record and Object Identification. It uses PRONOM signatures to return a unique identifier for files that contain binary patterns matching those described by PRONOM. DROID can be used via GUI, command line, or API, i.e. programatically.
DROID Signature File
A DROID signature file is an XML file that contains a snapshot of PRONOM in its current state.Split into two, or three sections (for container signatures), the signature file’s two main components are a list of file formats and metadata, e.g. format MIMEType, and then a mapping to a list of signatures.A container signature file contains a third section of ‘trigger PUIDs’ that is, PUIDs that trigger container identification when a match is found.
DROID-list Google Group
An open community that is a good first place to start for discussing new file format signatures for PRONOM. Being open, folks are invited to contribute to other’s identification issues. Signatures can be shared and the workload in fixing them shared too. PRONOM development is aided when there is as much information as possible about a file format and its potential signature. This work would all have to be done by their developers otherwise.
droidsfmin
A tool by Martin Hoppenheit to reduce the number of signatures in the DROID signature file, e.g. for the purpose of quicker identification in image format only digitization workflows.
Dublin Core
The Dublin Core Metadata Element Set is a vocabulary of fifteen properties for use in resource description.The name "Dublin" is due to its origin at a 1995 invitational workshop in Dublin, Ohio; "core" because its elements are broad and generic, usable for describing a wide range of resources.The fifteen element "Dublin Core" described in this standard is part of a larger set of metadata vocabularies and technical specifications maintained by the Dublin Core Metadata Initiative (DCMI).The full set of vocabularies, DCMI Metadata Terms [DCMI-TERMS], also includes sets of resource classes (including the DCMI Type Vocabulary [DCMI-TYPE]), vocabulary encoding schemes, and syntax encoding schemes.The terms in DCMI vocabularies are intended to be used in combination with terms from other, compatible vocabularies in the context of application profiles and on the basis of the DCMI Abstract Model [DCAM].
Dublin Core Metadata Initiative
The Dublin Core Metadata Initiative (DCMI) supports shared innovation in metadata design and best practices across a broad range of purposes and business models. DCMI does this by:Managing long term curation and development of DCMI specifications and metadata terms namespaces;Managing ongoing discussion of current DCMI-wide work themes;Setting up and managing international and regional events;Curation and open availability of meeting assets including proceedings, project reports and meeting minutes;Creation and delivery of training resources in metadata best practices including tutorials, webinars and workshops;and Coordinating the global community of DCMI volunteers.The Dublin in the name comes from Dublin, Ohio following an early workshop about the initiative in 1995.
EBCDIC
EBCDIC is a legacy character encoding used in the past on IBM computers in the 1960s. EBCDIC could be used internationally through the use of code pages. Code pages by any other name were simply EBCDIC-like, that is, other character-encodings. One would need to look up Japanese, code page 930, CCSID 930, to understand how to decode an EBCDIC message encoded using this variant.
Educopia
The Educopia Institute’s mission is to build networks and collaborative communities to help cultural, scientific, and scholarly institutions achieve greater impact.They believe in the power of connection and collaboration.Educopia encourages knowledge sharing and network building across institutions, communities, and sectors.Their strengths include training, neutral community facilitation, and administrative backbone support services for collaborative communities.Educopia also develops and manages applied research projects that benefit affiliated communities and the broader information fields of libraries, archives, and museums.Educopia helps information stakeholders including researchers, archivists, curators, publishers, and students to establish common ground, work toward shared goals, and ultimately achieve system-wide transformations.
Emulation
The recreation of legacy, or current, computer architecture in software (an 'emulator') such that it can then be used to run the operating system and software of said original hardware. Emulation is a potential method of delivery of ‘preserved’ content to users. Those who want to access content will do so by interacting with the computer system as-was, or as-is.A popular JavaScript emulator called JSMESS is used in the Internet Archive to enable full interaction and playability or retro pc/dos computer games archived by the service.
Enterprise Solution
A buzzword (jargon) for a piece of software, or a system, that has the potential to satisfy the needs of all, or a group of users, across an organization. An enterprise content management system is named as such as it is expected to be interacted with by all of a company’s employees.
EOF
End of File (EOF). A good file format signature will also be anchored to another piece of data in the file, this will often be the very end, e.g. PDF provides an end of file sequence that can be used. Programs, while not very efficient at reading every byte in a file, can easily look at the head and tail of an object within a certain threshold of bytes.
Ex-Libris
Ex Libris is an Israeli based company responsible for the digital preservation system Rosetta, library management system Alma, and Description and Discovery layer, Primo.
Executable
A synonym for program, a file that can be run, or executed by a user of a computer system.
explainshell.com
Not strictly for digital preservation, but useful nonetheless, explainshell.com will annotate Linux commands for users and enables those annotations to be shared.
False Positive
A false positive occurs when a format is matched incorrectly, or imprecisely in DROID or Siegfried. This can happen when the amount of scanning done by the tool is limited and the format has some similarities to another, e.g. PDF/A files require more bytes to be scanned than regular PDF. A false positive can be hard to spot because a match, is after all, a match. False positives can impact workflow routing and future preservation planning.
ffmpeg
A free and open source tool for working with audio and video.ffmpeg can characterize multimedia, even output visual analyses.ffmpeg can transcode it into other file formats, and perform many other manipulations.Developed and maintained by the ffmpeg team.
ffmprovisr
An online resource of community contributed ‘recipes’ (commands) for processing audio visual files through the open source audio visual transcode and characterization tool ffmpeg.
Fido
Fido was the second client tool to make use of a subset of the PRONOM signatures. Fido was created in Python and utilized traditional regular expressions to match file formats with signatures. This meant converting the PRONOM signatures into a format that could be understood by a standard regular expression matching engine. Fido is used in Archivematica and is still maintained as part of the Open Preservation Foundations stewardship.
File (Archival) (Society of American Archivists, definition)
A group of documents related by use or topic, typically housed in a folder (or a group of folders for a large file). The plural, files, is the whole of a collection of records.
File (Digital)
An information encoding that can be retrieved from digital storage and interpreted by a computer to be presented to the user in a suitable way.
File (Utility)
File is a linux based tool for identifying file formats. Unlike DROID and Siegfried it does not return unique identifiers for what it finds. FIle uses a different mechanism and different corpus of information to identify the format of a digital object.
File Classification Scheme
or Business Classification Scheme, is the folder and directory layout adopted in a business, organized by function, client, activity etc. Such that records are contextualized, described appropriately, and retrievable.
File Format Extension
A file extension is part of a file’s name. The extension is commonly three characters in length and prefixed with a dot. For the file name, ‘example.txt’ the extension is .txt. Registries of extensions exist that they can be searched and an ID asigned to a file. A file extension has no bearing on the content of a file, as such, a file that has the file extension .pdf is not guaranteed to be "Portable Document Format" (PDF). A file may not have the right extension for a number of reasons, including for circumventing information security measures (e.g. certain upload types on a website).Users may also adopt a temporary naming scheme e.g. renaming a file .backup, or .tmp.Users may not know the appropriate extension and so might provide another, e.g. assigning .xls (Microsoft Excel) to a .csv (comma-separated-values table format).
File Format Signature
A file format signature is a sequence, or sequences of bytes inside a digital file. Bytes become human readable when looked at through a hex (hexadecimal) editing tool. One may find the hexadecimal values 0xD0 0xCF 0x11 0xE0 (DOC FILE) at the beginning of a Microsoft Word file – 0x denotes hexadecimal. Taken verbatim, these four bytes can be used by a tool to categorise any files that also begin with the same sequence. The skill in crafting a good file format signature is finding a set of sequences unique enough to group all files belonging to a single file format; broad enough so as not to miss a single file; and narrow enough not to falsely identify other files – a false positive. File format signatures are often described in file format specifications but they may still need crafting into something more useful that can be used by tools such as DROID.
Fixity
Fixity is the property of remaining fixed. That is, the features of a record that we can use to determine that a record has not been changed. Context, date ranges, materials used, and in the case of digital records, checksums enable us to demonstrate mathematically that a record has not been modified. Fixity is key in demonstrating authenticity and integrity.
Folder (Archival) (Society of American Archivists, definition)
A sheet of cardboard or heavy paper stock that is used as a loose cover to keep documents and other flat materials together, especially for the purposes of filing.
Folder (Digital)
A delineation of digital storage that organizes files into groups. Synonymous with the term directory. A directory inside another is often called a sub-directory.
Format Identification
Format identification has previously been accepted as the first step of digital preservation ‘knowing what you’ve got’ the volume of material that some organisations are responsible for, however, makes this an ideal, but not necessarily a practicality. File format identification means looking at a digital file’s data (it’s binary content) for patterns that match the structures of specific file formats. Reading the pattern "PDF- 1.4" at the beginning of some files, may for example, be a good indication that the file is going to be a "Portable Document Format" file (PDF). Where a binary pattern cannot be ascribed to a digital file, either one isn’t known, or the file doesn’t conform to one, then other clues may be used. File extension may be a clue as to a file format e.g. CSV (Comma Separated Values).File name may be another, e.g. consistently named files, DS_Store, or Thumbs.db.
Fuzzy Hashes
Having understood checksums, one might also be interested in fuzzy hashes. These are used in an alternative way to the checksums discussed here.Fuzzy hashes are used to determine the similarity of content – e.g. to determine when only small changes have been made to a data stream.This property of fuzzy hashes can be exploited to perform content sentencing, or to point users to similar content if there is a record available.
Git
An example of a version control system, well known because of the cloud based implementation of the tool - GitHub.
GitHub
A cloud-based version control and storage mechanism for source code, datasets, and other forms of publishing. The command line tool Git allows users to interact with it. Users can use Git and GitHub to create, clone, branch, and contribute to open source projects.
Graph Databases
A NoSQL database option. That is, a database that doesn't rely on SQL (Structured Query Language). Graphs are connected networks of inforamtion. Vertices are connected by edges. Vertices are often called resources (an identifier for a person, place, record) whose edges then describe it - edges are the properties belonging to a resouce. Edges can be resources in their own right as an edge may have its own meaning and semantic rules. A simple graph may be:Subject (Resource) -> Predicate (property name) -> Value (property value)The USA -> hasStates -> 52A graph database is queried through a language called SPARQL (SPARQL Query Language). Graph databases are extensible meaning it is easy to add and connect more properties and resources.
Hash Function
A mapping of data of arbitrary length to a fixed length string, the output of a hash function can be called a hash value, hash code, digest, or simply a hash. A checksum in digital preservation is a hash of the data inside a file.
Heritrix
Heritrix is a web crawler created by the Internet Archive and was designed purposely for web archiving. The last stable release of Heritrix was in 2014.
Heuristic
A set of rules or principles that can be used to derive an outcome. For example, if asked to determine which direction is east or west, one might look at the time of day, and the position of the sun, and estimate thusly.Heuristics are often employed in programming where a formal algorithm does not exist, but which an outcome still needs to be derived, e.g. in reverse engineering a file format from a sample corpus.
Hex Editor
Hex (Hexadecimal) editor, e.g. HxD, are tools for representing the binary content of a file in hexadecimal form, usually in contiguous rows of bytes.A hex editor is often split into two view panes. The left pane showing the hexadecimal form of the binary content of a file. The right, showing the characters that can be rendered using the ASCII encoding scheme, or another scheme supported by the tool such as EBCDIC.Hex editors are impotant tools for developing file format signatures.
Hexadecimal (Base 16)
A number system of 16 characters, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, A, B, C, D, E, F. Hexadecimal can represent all numbers. Its primary application is the representation of binary numbers in the form of two digit bytes. Hexadecimal makes binary easier to read, for example, the number 255, in binary is, 0b11111111, and in hexadecimal is 0xFF. A hexadecimal number is often prefixed with the number zero and letter ‘x’ to indicate that the following characters are hexadecimal.
Human Readable
A delineation in the understanding of a data structure where the core components are easily understandable without calculation or computation. For example, XML is commonly understood, despite complexity in its structure, to be human readable, because readers can look at its elements, attributes, and data values, and understand what the data is and how it is represented.
Hydra
A community grounded organisation (Stanford University, DuraSpace, plus archival and education institutions across the US and UK) that provides a repository solution. Now also working in collaboration with the Digital Public Library of America to release Hydra-in-a-Box.
Infeasible to Invert
Means it is computationally difficult and time consuming to reverse engineer the output of a cryptographic hash function. The one mechanism to do it would be to try all possible combinations of input, yet, original data size is not known, and there are no clues to the original data type or content.
Information
Information may be synonymous with data. It could also be argued that information is data that can be interpreted meaningfully, that is, it is more than just a nonsense stream of bytes. It has tangible meaning to the computer or human being interpreting it.
Information and Records Management
The management of information in an organization throughout its contexts.
Information Asset (The National Archives, UK)
An information asset is a body of information, defined and managed as a single unit so it can be understood, shared, protected and exploited efficiently. Information assets have recognisable and manageable value.
Information Asset Register
A register of information assets that describes what they are, their value, their life-cycle, and risk profile.
Information Theory
The study of the quantification, storage, and communication of information. The application of information theory is crucial to compression techniques used in various hardware and software applications such as the transmission of signals, or JPEG compression.
Inspect Container File Contents
Different from container ‘identification’ if DROID or Siegfried encounter a file that is legitimately a container or ‘archive’ file format, such as ZIP or TAR (Tape Archive File), then setting 'Inspect Container File Contents' can make the tool look inside the file and return PUIDs for the container's contents as well.
Integer
An integer is a whole number from zero to positive, or negative, infinity, e.g. 0, 1, 2, 4, 8, 16, 32, 64, 128.
Integrity (UNESCO, 2003, definition)
The state of being whole, uncorrupted and free of unauthorised and undocumented changes.
Intellectual Entity
An item, or groupsing of items that constitute a record in a digital repository, e.g. a book is an intellectual entity made up of many pages. Different mechanisms of displaying or looking after this book may be called representations.
Intellectual Entity (PREMIS definition)
An Intellectual Entity is a distinct intellectual or artistic creation that is considered relevant to a designated community.For example, a particular book, map, photograph, database, or hardware or software.An Intellectual Entity can include other Intellectual Entities; for example, a web site can include a web page and a web page can include an image.An Intellectual Entity may have one or more digital or non-digital Representations.
Internet Archive
The Internet Archive is a San Francisco–based non-profit digital library with the stated mission of "universal access to all knowledge". It provides free public access to collections of digitized materials, including websites, software applications/games, music, movies/videos, moving images, and nearly three million public-domain books. The Internet Archive is responsible for the Wayback Machine, a mechanism for retrieving and viewing snapshots (Mementos) of websites from the past.
Intrinsic Value
Value intrinsic to a record that cannot alone be captured through copying or transcription.According to the Society of American Archivists the value intrinsic to the record may come from: form, layout, materials, or process. It may also be based on an item's direct relationship to a significant person, activity, event, organization, or place.
ISAAR(CPF)
The International Standard Archival Authority Record for Corporate Bodies, Persons and Families ISAAR(CPF). A content model and a companion standard to General International Standard Archival Description (ISAD(G)). ISAAR(CPF) provides guidelines for recording authority data for entities associated with archival materials. The model defines 27 elements in four areas of an authority record (Identity, Description, Relationships, and Control).
ISAD(G)
An ICA standard. ISAD(G) (General International Standard Archival Description) defines the elements that should be included in an archival finding aid. There are 26 elements, of which 6 are mandatory: Reference code, Title, Name of Creator, Dates of Creation, Extent of the Unit of Description, Level of description
ISDF
International Standard for Describing Functions (ISDF). Provides guidance for preparing descriptions of functions of corporate bodies associated with the creation and maintenance of archives. Function definition includes subfunction, business process, activity, task, transaction or other term in international, national or local usage. The standard states that analysis of the functions of corporate bodies is important as the basis for many record keeping activities. Functions are recognised as generally being more stable than administrative structures. Description of functions plays a vital role in explaining the provenance of records.
ISDIAH
International Standard for Describing Institutions with Archival Holdings (ISDIAH). Provides general rules for the standardisation of descriptions of institutions with archival holdings to enable:the provision of practical guidance on identifying and contacting institutions with archival holdings, and accessing holdings and available servicesthe generation of directories of institutions with archival holdings and/or authority liststhe establishment of connections with authority lists of libraries and museums and/or developing common directories of cultural heritage institutions at a regional, national and international leveland the production of statistics on institutions with archival holdings, at a regional, national or international level.
ISO
International Organization for Standardization (ISO) is an independent, non-governmental international organization responsible for creating standards that support innovation and provide solutions to global challenges. Made up of 163 national standards bodies. ISO standards are specifications for products, services and systems, to ensure quality, safety and efficiency. They are instrumental in facilitating international trade. ISO headquarters are in Geneva, Switzerland.
ISO 14721:2012
OAIS
Space data and information transfer systems -- Open archival information system (OAIS) -- Reference model - a reference model for what is required for an archive to provide long-term preservation of digital information)
ISO 15489-1:2016
Information and documentation -- Records management – Part 1: Concepts and Principles. ISO 15489 is the first standard devoted specifically to records management; providing an outline for comprehensive assessment of full and partial records management programs.
ISO 15489-2:2016
Information and documentation -- Records management – Part 2: Guidelines. ISO 15489 is the first standard devoted specifically to records management; providing an outline for comprehensive assessment of full and partial records management programs.
ISO 16363:2013
Audit and certification of trustworthy digital repositories – sets out comprehensive metrics for what an archive must do, based on OAIS
ISO 16919:2014
Requirements for bodies providing audit and certification of candidate trustworthy digital repositories – specifies the competencies and requirements on auditing bodies
Iterative Development
The development of a project, product, or software in a series of steps, with each step developing a minimal viable product, and each step providing a functional or behavioural improvement on the last.
itforarchivists.com
A companion website for Siegfried that has drag and drop functionality for identifying individual files. Itforarchivists is pretty cool and retro and a great resource for introducing folks to the concept of format identification.
JHOVE
JHOVE uses the concepts of well-formedness and validity to return statistics about different file formats. JHOVE currently supports over 14 different file formats including PNG, TIFF, and WAVE Audio. JHOVE can tell users whether their file is well-formed; valid; both; or neither. Rules that determine this are encoded in the tool and have been elicited from the file format specifications for those file types. JHOVE is currently maintained by the Open Preservation Foundation. It was originally developed by Harvard University Libraries (HUL) and Gary McGath.
jpylyzer
A tool written in Python that characterizes JPEG2000 (JP2) files. Important in digitization workflows where JP2 is now taking a place for the savings in storage space over TIF.
Just Solve the File Format Problem
A wiki style registry of file formats that can be edited by all users. It differs from PRONOM in the regard that anyone can add information, and so it is a good idea to submit something to this wiki first, or in concert with PRONOM, for the benefit of the community.Just Solve It, is an initiative of the Internet Archive.
KryoFlux
USB controller and write blocker for legacy floppy disk drives. It allows us to use 3.5-inch and 5.25-inch disk drives on modern computer hardware. 8-inch floppy disk support is also feasible but more difficult to attain. KyroFlux reads the magnetic flux of a disk - the magnetic signals that are then converted to binary information. Alternatives to KryoFlux are available such as the SuperCard Pro.
Levels of Preservation
NDSA Levels of Preservation (LOP), a rubric that can provide guidance for institutions that want to do digital preservation, or are doing digital preservation. The rubric sets out a minimal set of standards an insitution can aim for and progressions that they can then seek to achieve. Levels of Preservaiton is an understandable and pragmatic guide for folks in the industry.
Library Carpentry
The development of practical digital skills, such as programming basics, for use in the GLAM sector. Library Carpentry is also a set of open source tutorials and lessons available on GitHub to help teach librarians and archivists digital literacy skills required in this era.
Linked Open Data
Data created and made available using the strengths of the web-technology stack. Four main principles are followed:Use URIs (Universal Resource Identifiers) to name (identify) things.Use HTTP URIs so that these things can be looked up (interpreted, ‘dereferenced' e.g. via web browser).Provide useful information about what a name identifies when it's looked up, using open standards such as RDF, SPARQL, etc.Refer to other things using their HTTP URI-based names when publishing data on the Web.
Linux
A free and open source operating system (OS) developed in the 90s by Finland Computer Scientist Linus Torvalds and based on Unix-like principles. Android smartphones are Linux based, as are a number of commodity devices such as digital video recorders. Linux is characterized by its ‘kernel’ which provides the core control of the underlying computer system. Distributions add features to the operating system and are as well known as the OS itself, e.g. Ubuntu, Raspbian, Debian, and Red Hat.
LOC
Library of Congress. A research library that officially serves congress and is the de-facto National Library of the United States of America. Library of Congress Digital Preservation initiatives includes the National Digital Stewardship Residency programme.
LOCKSS
Lots of Copies Keeps Stuff Safe. An idiom and a program, based at Stanford University Libraries, that provides libraries and publishers with low-cost, open source digital preservation tools to preserve and provide access to persistent and authoritative digital content.
Magic Number
✨Magic✨ number is often used as a synonym for file format signature.The etymology for the term dates back to the seventh version of the Unix operating system (1979).The use of magic numbers grew as requirements for them did. The use of the phrase file format signature seems to have come about through the maturisation of the field of digital preservation.
Management of Uncertainty
The use of data, about an uncertain topic, or event, to simulate a range of potential outcomes that can be used to manage projects; risks; and costs; by giving stakeholders an evidence based projection about what may happen.
MD5
Message Digest 5.32 character string.Theoretically, 21 quintillion files needed for a collision.
Memento
A standard for accessing and interacting with various web archives across the globe. Memento is a project led by the Los Alamos National Laboratory and Old Dominion University. Rather than expecting people to know about the growing number of Web archives, and to guess which archive might hold an older version of the resource they’re looking for, Memento proposes to make archived content discoverable via the original URL that the searcher already knew about. Memento adds a time dimension web-archives, and perhaps its most well-known implementation is archive.org, aka. the internet archive.
Metadata
Metadata describes the properties or context of another object. for example, the number of pages in a book, and the number of words. Metadata can be associated with physical or digital objects. Records may be self-describing; meaning that the metadata can be read from the file, e.g. author is sometimes encoded in a file separately from the content. Metadata can be derived from the digital object’s content, e.g. image resolution, audio length. The file system also has metadata which the operating system uses to describe a digital object; modification and creation dates are two such examples.
Metadata Extraction
The extraction of metadata from a digital object, often using tools that can read the file and export the information in a machine-readable form such as XML or JSON.
Metadata Mapping
The process of selecting metadata about a digital object and encoding it into an alternative schema, e.g. for archival description, or preservation.
Migration
The transformation of a file format into another file format with the same, or similar properties, e.g. migration of a JPEG image to PNG. Migration is a potential method of preserving digital information.The key for many users is the measurement and quality analysis of properties apparent in the data before, and after migration, and an understanding of what may be lost along the way.
Nanite
The Nanite project builds on DROID and Apache Tika to provide a rich format identification and characterization system. It aims to make it easier to run identification and characterisation at scale, and helps compare and combine the results of different tools. Nanite provides an API (application programming interface) for DROID where DROID currently doesn’t have an easy to work with API. Nanite was developed by Andy Jackson of the British Library and the UK Web Archive.
NASA
National Aeronautics and Space Administration. Part of a committee of space organisations called Consultative Committee for Space Data Systems (CCSDS), responsible for the first draft of the OAIS (Open Archival Information System) model CCSDS 650.0-R-1.1
NDSA
National Digital Stewardship Alliance is a consortium of organizations (212, including, Digital Public Library of America, Yale University Library, and New York Public Library (NYPL)) committed to the long-term preservation of digital information.
NDSA Level 1: File Fixity and Data Integrity
Check file fixity on ingest if it has been provided with the content. Create fixity info if it wasn't provided with the content.
NDSA Level 1: File Formats
When you can give input into the creation of digital files encourage use of a limited set of known open formats and codecs.
NDSA Level 1: Information Security
Identify who has read, write, move, and delete authorization to individual files. Restrict who has those authorizations to individual files.
NDSA Level 1: Metadata
Inventory of content and its storage locations. Ensure backup and non-collocation of inventory.
NDSA Level 1: Storage and Geographic Location
Two complete copies that are not collocated. For data on heterogeneous media (optical disks, hard drives, etc.) get the content off the medium and into your storage system.
NDSA Level 2: File Fixity and Data Integrity
Check fixity on all ingests. Use write blockers when working with original media. Virus-check high risk content.
NDSA Level 2: File Formats
Inventory of file formats in use.
NDSA Level 2: Information Security
Document access restrictions for content.
NDSA Level 2: Metadata
Store administrative metadata. Store transformative metadata and log events.
NDSA Level 2: Storage and Geographic Location
At least three complete copies. At least one copy in a different geographic location Document your storage system(s) and storage media and what you need to use them.
NDSA Level 3: File Fixity and Data Integrity
Check fixity of content at fixed intervals. Maintain logs of fixity info; supply audit on demand. Ability to detect corrupt data. Virus-check all content.
NDSA Level 3: File Formats
Monitor file format obsolescence issues.
NDSA Level 3: Information Security
Maintain logs of who performed what actions on files, including deletions and preservation actions.
NDSA Level 3: Metadata
Store standard technical and descriptive metadata.
NDSA Level 3: Storage and Geographic Location
A least one copy in a geographic location with a different disaster threat.Obsolescence monitoring process for your storage system(s) and media.
NDSA Level 4: File Fixity and Data Integrity
Check fixity of all content in response to specific events or activities. Ability to replace/repair corrupted data. Ensure no one person has write access to all copies.
NDSA Level 4: File Formats
Perform format migrations, emulation and similar activities as needed.
NDSA Level 4: Information Security
Perform audit logs.
NDSA Level 4: Metadata
Store standard preservation metadata.
NDSA Level 4: Storage and Geographic Location
At least three copies in geographic locations with different disaster threats. Have a comprehensive plan in place that will keep files and metadata on currently accessible media or systems.
NDSA Levels of Preservation, Level 1
Protect your data
NDSA Levels of Preservation, Level 2
Know your data
NDSA Levels of Preservation, Level 3
Monitor your data
NDSA Levels of Preservation, Level 4
Repair your data
NDSR
An initiative created by the Library of Congress. The mission of the National Digital Stewardship Residency (NDSR) is to build a dedicated community of professionals who will advance the nation's capabilities in managing, preserving, and making accessible the digital record of human achievement. This will enable current and future generations to fully realize the potential of digital resources now and for years to come.
nestor Seal
An extended self-assessment process based on standard DIN31644 recognizing the trustworthiness of a digital archive.If a nestor assessment yields a positive result they are entitled to publicise this by using the nestor Seal for Trustworthy Digital Archives.
Ngram
A sequence that describes the occurrence of N ‘terms’ (syllables, words, names, etc.) retrieved from a corpus, or corpora, of information for the purpose of research.
OAI-PMH
The Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). OAI-PMH is a protocol for exposing to the outside world what is held inside a digital repository, and has numerous applications such as enabling the transmission of metadata about items for an archival catalogue.
Obsolescence
Obsolescence is the process of becoming obsolete. Obsolescence is identified as a cause for data potentially becoming unreadable. It is synonymous with the terms out-dated and no-longer used. Given the number of dependencies on which a piece of technology relies: Operating system; memory type; mains power voltage; creating application,etc. There are a number of components that we’re monitoring in digital preservation for obsolescence. We can mitigate obsolescence through a number of means, but there is no one-size fits all solution.
Offset
Offsets are important to the functionality of a signature, that is, where in a file will certain byte patterns (signature patterns) are expected to be found. DROID and Siegfried both offer customisations which limit the size of an offset. These customisations can be used to speed up format identification e.g. by scanning less data a scan can finish quicker, but this has its trade-offs.
One Way Function
A transformation of data such that the result cannot be transformed back into the original.
Open Preservation Foundation
Formerly, Planets, and Open Planets Foundation. (OPF) Founded in 2010 to sustain the results of EU-funded research and development, OPF currently steward a portfolio of open-source digital preservation software and enable the development of best practice through interest groups, community events, and training. Their vision is shared solutions for effective and efficient digital preservation.
Open Provenance Model
The Open Provenance Model is a model of provenance that is designed to meet the requirements:To allow provenance information to be exchanged between systems, by means of a compatibility layer based on a shared provenance model.To allow developers to build and share tools that operate on such a provenance model.To define provenance in a precise, technology-agnostic manner.To support a digital representation of provenance for any 'thing', whether produced by computer systems or not.To allow multiple levels of description to coexist.To define a core set of rules that identify the valid inferences that can be made on provenance representation.
Organizational Alignment (AV Preserve concept)
Beyond OAIS-like digital preservation (IT) systems; the coordination of digital preservation needs to happen 'outside-the-box' across the organization.Organizations may be complex and there is a recognition that content of value may not be managed consistently across it and in-line with digital preservation principles.Organizational alignment is the operationalization of digital preservation.
Original Order (Society American Archivists definition)
The organization and sequence of records established by the creator of the records. FR: l'ordre primitif, respect de l'ordre intérieur.
Other Cryptographic Hashes...
BLAKE-256; BLAKE-512; MD5; SHA-1; SHA-256; Whirlpool.
Participatory Archives
An archive or collection that seeks active engagement from the community, causes, organisations, and activities that it represents. Custodial control and maintenance of the archive may support the development of archival skills within that community to develop and maintain the collection further. A post-custodial view would see the control of the archive passed from the archival professionals to the community.
Physical Carrier
The medium on which digital records have been transferred to an organization.
PLANETS
Planets (Preservation and Long-term Access through Networked Services) project ended on 31 May 2010. Their work is now maintained by the Open Preservation Foundation. Planets was a four-year project co-funded by the European Union to address core digital preservation challenges. The primary goal for Planets was to build practical services and tools to help ensure long-term access to digital cultural and scientific assets.
PREMIS
Preservation Metadata: Implementation Strategies (Working Group), initially developed the PREMIS data dictionary as a specification with the goal of creating an implementable set of "core" preservation metadata elements, with broad applicability within the digital preservation community. The PREMIS Editorial Committee coordinates revisions and implementation of the standard. PREMIS is system and encoding, agnostic. It defines "what a preservation repository needs to know". As such the repository that implements PREMIS may find itself doing so in any way that is fit for that institution. PREMIS is designed to support the long-term preservation, and usability, of digital objects.
Preservation Watch
The monitoring of the technical aspects of a digital object for obsolescence., e.g. the current long-term support status of a piece of software, or the introduction of a newer specification of a file format or standard.
Preservica
Originally called Safety Deposit Box, Preservica is an OAIS compliant digital preservation system maintained by Preservica in Abingdon, Oxford, UK.
Prioritization
The DROID signature file contains more semantics than the signatures alone. To avoid two PUIDs being returned for a single file as much as possible, prioritization of signatures has to take place, that is, if a signature matches with higher priority over another, then that is returned in favour of the other. When looking at the records of signatures in PRONOM prioritizations are listed on the front page of the record, not the signature page. All this information is included in the signature file, so that is how DROID finds it all in one place.
Programming Language
A set of instructions and rules that can be combined to perform a computational task or set of tasks. A programming language is just a flavour of instructions that all need to be boiled down to something that the processor can understand – usually machine language. Programming languages usually differ in terms of abstraction, meaning that low-level languages work much closer to the hardware (closer to machine code) than high-level languages.
PRONOM
PRONOM is a digital preservation technical registry. It is maintained by The National Archives, UK. PRONOM’s purpose in the community is to be a centralised service for file format signatures. File format signatures are consumed by tools such as DROID, Siegfried, and Fido. A unique identifier is assigned to every file format that can be identified through these tools.
PRONOM Release
A PRONOM release happens when a publishing job is run by The National Archives, UK. Importantly, the draft information in the database is published onto the web, and a signature file is created via database stored procedure and uploaded to a location where it can be accessed via web service.
PRONOM Release Notes
The PRONOM release notes are released in XML form and are available from the PRONOM index page on the web. Each release it summaries in terms of:New Records: New records for file formats that now have PUIDsUpdated records: Format records in PRONOM that have had their information updated in some way, including signature changesNew Signatures: File formats that now have signatures associated with them and can be identified via PRONOM
PRONOM web services
PRONOM delivers signature files to tools via web services. DROID for example will first use a web-service to check for new signatures. If they exist it will then communicate with a second web service to download those signatures in the form of a 'signature file'. A second type of signature file, Container Signatures, are downloaded via more traditional web based techniques utilizing a web-page’s Last-modified date, to seek new data.
PRONOM XML
PRONOM can be accessed via XML making it possible to download and remix. The links look like: http://www.nationalarchives.gov.uk /PRONOM/fmt/{no}
pronom@nationalarchives.gsi.gov.uk
The email address to send format requests to at The National Archives, UK.
ProQuest
ProQuest LLC is an Michigan-based global information-content and technology company. It was founded in 1938 as University Microfilms by Eugene B. Power. ProQuest provides solutions, applications, and products for libraries. Its resources and tools support research and learning, publishing and dissemination, and the acquisition, management and discovery of library collections. Ex Libris currently sits within the ProQuest portfolio of companies.
Provenance
The background of a record that reveals its context, but which also allows the demonstration of the authenticity and integrity of the record’s content and meaning.
PTAB
Primary Trustworthy Digital Repository Authorisation Body Ltd. Responsible for three ISO standards related to establishing an internationally recognises and certified set of trustworthy digital repositories.ISO 14721:2012 also known as CCSDS 650.0-M-2 (OAIS – a reference model for what is required for an archive to provide long-term preservation of digital information)ISO 16363:2013 also known as CCSDS 652.0-M-1 (Audit and certification of trustworthy digital repositories – sets out comprehensive metrics for what an archive must do, based on OAIS)ISO 16919:2014 also known as CCSDS 652.1-M-2 (Requirements for bodies providing audit and certification of candidate trustworthy digital repositories – specifies the competencies and requirements on auditing bodies)
PUID
PRONOM Unique Identifier (PUID) which are assigned to all formats in the PRONOM registry. There are two primary types, fmt and x-fmt. The latter is the result of a historical error when x-fmt identifiers were made available to the public. A subsequent decision to maintain x-fmt was made in favour of continuity as a standard. There is no longer a semantic difference between identifier types – that is the x- is no longer experimental, it is equivalent to the other type.
Purpose of a Checksum
A checksum algorithm calculates a fixed length string based on the data in a file alone.A file with the letters USA has MD5 checksum: f75d91cdd36b85cc4a8dfeca4f24fa14will always have the check sumf75d91cdd36b85cc4a8dfeca4f24fa14.If a single bit changes, it will be unrecognisably otherwise.A file with the letters USB (USA to USB, a change of two-bits) has checksum: 7aca5ec618f7317328dcd7014cf9bdcf Checksums are great for spotting data integrity errors – the key to digital preservation.Bit level preservation is simply about checking the checksums – constantly.
RDF
Resource Description Framework (RDF) is a family of World Wide Web Consortium (W3C) specifications originally designed as a metadata data model. It is also used as a general method for conceptual description or modelling of information that is implemented in web resources. It is used in knowledge management applications.
Record (ISO 15489-1:2016 definition)
Information created or received and maintained as evidence and as an asset by an organization or person, in pursuit of legal obligation or in the transaction of business.
Record Group
A record group is a collection of records, organized by transferring, or depositing agency, sometimes divided into organizational division.
Recursion
A method of repeating a process where the result of the process gets fed back into itself as an input. Recursion is a feature of some programming languages that don’t have loop constructs which allow data to be processed over and over until a certain exit condition is met.
Redaction
The obfuscation of sensitive, or secretive information in a document, e.g. to provide access e.g. to satisfy a freedom of information request. The information redacted may be protected through various legislation, e.g. data protection acts. Redaction protects individuals and organisations in appropriate circumstances.
Representation (PREMIS definition)
A Representation is the set of files, including structural metadata, needed for a complete rendition of an Intellectual Entity.For example, a journal article may be complete in one PDF file; this single file constitutes the Representation.Another journal article may consist of one HTML file and two image files; these three files constitute the Representation.A third article may be represented by one TIFF image for each of 12 pages plus an XML file of structural metadata showing the order of the pages; these 13 files constitute the Representation.
Respect des Fonds
...or respect pour des fonds. Translated literally as respect of background (context). A principle for grouping records by the entirety of their context, by administration, organization, individual, entity, etc.
RESTful API
A method of retrieving data from a web service. The request is in the form of a suitable HTTP request. A HTTP response is sent back that contains the data requested by the user agent.
Retention and Disposal
Organizational awareness of the types of record being maintained and how long they need to be kept to meet business needs, legal requirements, and in the case of government, the obligation to open and transparent government.
RiC-CM
Records in Context Content Model (RiC-CM) is a proposed standard from the International Council of Archives. It is a descriptive standards that incorporates the four existing ICA description standards, ISAD(G), ISAAR(CPF), ISDF, and ISDIAH. The standard recognises changing technologies and embraces graph technologies for description. The standard emphasises the preservation of the context in which records live. Overall, the standard exists to support the management of records; preservation of records and their context; and the reuse of records.
Risk
From risk management, the formal statement of a risk is as follows: "Because of x there is a risk that y which will result in z. "The statement enables us to think about risk in terms of its impact and therefore steers us away from the concept of risk as in fear. Impacts should be measurable, and real.
Robots.txt
A mechanism that web-sites can employ to communicate with web-crawlers to prevent them from accessing them.Robots.txt can be employed to prevent spurious requests from non-altruistic bots, or other practical reasons like the domain only having a limited amount of bandwidth available to it per month.Robots.txt can be configured for all- or parts- of a web site. Crawlers may not always cooperate with the protocol.The Internet Archive ignores Robots.txt for Government Archives.
RODA
RODA is an open-source digital repository designed for preservation developed in Portugal. The repository supports all the main functional components of the OAIS model.
Rosetta
A large scale, OAIS (Open Archival Information System) compliant, system that implements large pieces of the digital preservation workflow from ingest to delivery. Rosetta is maintained by the company Ex Libris.
Roy
A companion tool for Siegfried that allows the data source (signature file) used by Siegfried to be customized. Customizing includes the option of not using the entire corpus of signatures available to it. An example might include creating a singature file to just identify images, e.g. as a result of a digitization workflow. A smaller singature file can, theoretically, be quicker than the entire corpus.
rsync
A command-line utility for transferring data across file systems while maintaining key file system properties such as last-modified date, and user's permissions. A good combination of flags to use in rsync to preserve important metadata may be:-rlptDv.
Safety Deposit Box
The first four implementations of the Preservica digital preservation system went under the name Safety Deposit Box, organisations such as The National Archives, UK, and Swiss Federal Archive, were some of the first to adopt this system.
Scan Web Archives
WARC (Web Archive) files are complex and can contain any number of any other file format. DROID and Siegfried can scan the contents of a WARC file returning PUIDs for every matching file inside.
Schema
Description of a data model, restrictions, and rules by which to validate against for translation into a data encoding, e.g. XML document, JSON, or database.
Scripting Language
A high level language that is compiled at run-time via a program called an interpreter. By storing a large number of more complex, yet common functions and procedures in an interpreter, the user can be free to call those function using fewer commands in a ‘script'. A user can interact with data for example, without having to worry about underlying memory models of the computer. A scripting language cannot be run in absence of an interpreter so a dependency of running such code, for example, Ruby, or Python, is that their interpreters must be pre-installed on the host machine.
Semantic Versioning
A method of controlling the version numbers of software in a way that both makes it clear to users what changes they can expect, but also, in a way that makes software developers more accountable for the breadth and depth of their changes in any one release.MAJOR version when you make incompatible API changes,MINOR version when you add functionality in a backwards-compatible manner, andPATCH version when you make backwards-compatible bug fixes.
Sentencing
The application of disposal actions to a group of records in accordance with a suitable disposal authority.
Series
A functional grouping of records.
Series System
A collection of records organized by function, replacing the record group where the same function may be managed by different agencies over time.The series model was developed in Australia by Peter Scott in the 1960s.The system emphasises multiple and dynamic context for records and enables those contexts to be assigned to archival records.Contextual entities described by Cunningham (2012), include individuals, families, organisations, project teams, government agencies and portfolios, governments themselves, functions and activities.
SHA-1
Secure Hash Algorithm 1.40 character string.Theoretically 1 septillion files needed for a collision.
SHA-256
Secure Hash Algorithm 256.64 character string.Theoretically 400 undecillion files needed for a collision.
SHA1DEEP
A useful tool available for Linux and Windows for generating checksums recursively for a directory or directories of files. SHA1DEEP has compatriot tools MD5DEEP and SHA256DEEP.
Shell
A command line or terminal (text-based control mechanism of an operating system) available in Linux, as opposed to, DOS in the Windows Environment. Bash is an example of a shell available in Linux.
Shell Script
A method of chaining commands and variables (e.g. in Bash) into something called a ‘script’ to perform a set of operations that together meet a user’s processing needs.
Shine
A search engine for the UK web archive at the British Library that enables both trend analysis, and content search and retrieval.
Siegfried
Siegfried is a tool that performs a similar function to DROID. Developed independently by Richard Lehane.Siegfried is free and open source. It Incorporates a number of file format signature and identification mechanisms; including PRONOM.
Signature Development Utility
A website (ffdev.info) that enables folks outside of TNA to create individual signature files for testing. A signature developed and tested in anticipation of a submission to PRONOM may make the turnaround to it being published as part of PRONOM much quicker.
Significant Properties
Properties of individual records or groups of records that may be prioritised for preservation, and used as a measure of a successful ‘preservation action', e.g. if the number of pages in a record is considered to be important, it is a significant property we need to monitor and measure. Examples of significant properties may be, word count, colour profile, interactivity, etc. Significant properties are not universal. They are speciifc to the record and the community the record belongs to. Strategies for preservation should be developed on the basis of a full analysis of the user(s) requirements.
Skeleton File
A skeleton file is a mechanism for testing DROID or Siegfried when a file cannot be shared. A corpus of skeleton files can be created when other examples do not exist. Skeleton files contain only the bytes relevant to a potential signature match, and nonsense data in between to pad the format. The Skeleton Test Suite on GitHub is a good example of these files that enables developers to test for false positives and multiple matches, amongst other things.
SQL
Structured Query Language (SQL) is a standard mechanism for querying (getting results from) a relational database. Relational databases are made up of many tables SQL needs to be able to look at all of these and filter data at the same time to be able to answe a user's queries. Relational databases use a schema which is strict and fixed. An atlernative is a graph database where the structure is easily extended - they are extensible.
Standard Signatures
Standard signatures are signatures which look at the byte stream as read by the program, that is, without uncompressing it, or manipulating it in any other way first. What you see is what you get.
Standards
A standard is commonly a set of recommendations and principles, that may or may not require absolute compliance.From the International Organization for Standardization's (ISO) perspective, standards provide specifications for products, services, and systems, that help ensure quality, safety, and efficiency.In the digital preservation community, standards help to create a lingua franca as a platform to communicate upon.
State
A previous, current, or future representation of a computer program and its variables in memory.
Sub-series
A further grouping of records underneath a series, for example, by type, form, or content.
SuperCard Pro
A USB controller and write blocker for legacy floppy disk drives, specifically 3.5-inch and 5.25-inch disk drives. One of a handful of alternatives to KryoFlux.
Symbolic Link/Shortcut
The most common use of either a symbolic link (symlink) or shortcut, on Windows or Linux is to point to a file, or executable at some other location on the hard disk than where the symlink is positioned, e.g. to make it easier to run a given application from a particular location.
Tableau
Tableau is a digital forensics solution provider who provide a range of hardware based write blockers for the transfer of digital information.
Tessella
and now Preservica. Tessella is based in Abingdon in Oxford in the United Kingdom and is responsible for the Preservica System, formerly known as Safety Deposit Box. Preservica also created PRONOM and DROID in partnership with The National Archives, UK, in the early 00s.
The DPC
Digital Preservation Coalition. A UK based organisation that exists to make the digital memory accessible tomorrow. Enabling its members to deliver resilient long-term access to digital content and services, helping them to derive enduring value from digital collections and raising awareness of the attendant strategic, cultural and technological challenges they face. Achieving their aims through advocacy, workforce development, capacity-building and partnership.
The OPF Blog
A blog hosted by the Open Preservation Foundation (OPF) that invites free and open discussion of digital preservation issues and the tools we use in the community.The blog has a low-barrier to authoring and is free to sign-up to; as such it has a wide and varied range of contributors, and blogs to read through.
The Signal
The Library of Congress blog whose basic intent is to discuss digital stewardship.The blog covers other aspects of computer technology, most especially management, transmission and use of data.It covers new developments that have an impact on digital preservation and access.Contributors come from across the archives and digital preservation community.
TNA
The National Archives, UK. Formerly the Public Records Office. The National Archives us the UK’s government archive. It is responsible for the PRONOM database and compatriot tool DROID.
Trove
A portal, search engine, and API that connects metadata about content at Australian GLAM institutions.TROVE makes this information findable.Trove is a collaboration between National Library, Australia's State and Territory libraries.
Trusted Repository
A repository certified as trusted following an audit using the measures defined in ISO standard 16363:2012.A trusted repository will conform to measures surrounding Organizational Infrastructure. Digital Object Management. Infrastructure and Security Risk Management.Bodies providing audit and assessment must also conform to ISO standard 16919:2014.
TWARC
Twitter’ archiving (twarc) is a command line tool and Python library for archiving Twitter JSON data.Each tweet is represented as a JSON object that is exactly what was returned from the Twitter API.In addition to letting you collect tweets Twarc can also help you collect data on users, and trends.
Twitter
A useful way for those in digital preservation to connect with the community. An active forum with lots of branches out to other resources.
UK Government Web Archive
The web archive of UK government maintained by The National Archives UK.The archive is an exemplar of why we archive the web, and good case-studies appear, for example, during a machinery of government change.The UK Government Web Archive also archives UK Government Twitter feeds.
UK Web Archive (UKWA)
The UK Web Archive is hosted by the British Library and supported by a number of partners in the UK. Part of the collection is searchable and can be found online. Web archives collected under legal deposit law in the UK have their access restricted to various reading rooms in the UK.
Uncertainty
The state of being uncertain, for example, not knowing when a project is expected to be completed by.
UNESCO
UNESCO is responsible for coordinating international cooperation in education, science, culture and communication. It strengthens the ties between nations and societies, and mobilizes the wider public so that each child and citizen:has access to quality education; a basic human right and an indispensable prerequisite for sustainable development;may grow and live in a cultural environment rich in diversity and dialogue, where heritage serves as a bridge between generations and peoples;can fully benefit from scientific advances;and can enjoy full freedom of expression; the basis of democracy, development and human dignity.UNESCO creates initiatives, and provides guidance, central to maintaining digital archives as part of their mission.
UNESCO definition of integrity
The state of being whole, uncorrupted and free of unauthorised and undocumented changes. (UNESCO, 2003)
Unicode
Unicode (maintained by the Unicode Consortium) is a computing industry standard for the consistent encoding, representation, and handling of text expressed in most of the world's writing systems. The Unicode Standard contains a repertoire of more than 128,000 characters covering 135 modern and historic scripts. Unicode can be implemented by different character encodings including UTF-8 and UTF-16.
Uniform Distribution
A feature of a cryptographic hash function that makes it difficult to reverse engineer. The range of outputs for any given input is uniformly distributed meaning every possible output has an equal chance of occurring – you won’t see chunks of similar checksums output for similar (not the same) chunks of data.
Unit Testing
The automated testing of source code by breaking it down into its smallest functional components – units. Testing is done by controlling inputs and testing the output and state of the program at various stages.
Unix
A precursor to Linux. Unix is an early operating system that was developed in the 1970s to provide higher-level control of a computing system, e.g. for programming, or for users to script, and run, various sets of commands.
UTF-16
UTF-16 is an encoding for Unicode and uses one 16-bit unit for the characters that were representable in a prior character encoding called UCS-2 and two 16-bit units (4 × 8 bits) to handle each of the additional characters in the Unicode standard.
UTF-8
UTF-8 is an encoding for Unicode and uses one byte for any ASCII character, all of which have the same code values in both UTF-8 and ASCII encoding, and up to four bytes for other characters.
Validity
A file format and its informational content that follows a set of rules defined in its specification is considered to be valid.
VAR
Variable sequence (VAR). Some file format signatures have a moveable sequence specified. These sequences can be anywhere in the file and often require the tool to scan every byte which is slow. A key optimization of file format signatures is trying to remove variable sequences to replace them with fixed byte sequences (BOF or EOF) with larger ranges in which to find them.
Vera PDF
A free and open source tool for the validation of PDF/A files. Vera PDF Provides some support for other PDF variants.
Version Control
A mechanism for the storage of text based digital files and all subsequent changes made to them – literally – their versions. Version control systems such as Git, Subversion, and Mercurial, are key to software development workflows. Version control enables users to create ‘branches’ on which to work, and create ‘releases’ to aid in the the maintenance of software released to the public.
W3C
World Wide Web (W3) Consortium. Responsible for a large body of the Internet’s Standards, including HTML and PNG specifications, and RDF. The W3C also engages in education and outreach, develops software and serves as an open forum for discussion about the Web.
WARC
Also known as ISO 28500:2009.A standardised file format for storing the result of a web crawl – the output of a web archiving effort.WARC files many aggregate WARC records.WARC can encode any other file format – as you’d expect of any potential digital object on the web.
Wayback Machine
A search engine, and API for the archived web. Hosted by the Internet Archive, based in San Francisco.
Web Archiving
Automation of the web-archiving process. A tool crawls a website by looking at all of the links stemming from it and then visiting those one-by-one, potentially doing the same at the next site - figuratively, crawling. Tools distributed with Linux such as Wget can crawl websites and a common tool used in the digital preservation community is called Web Recorder.
Well-Formed
A file format that conforms to a structure defined by its specification is considered to be well-formed.
Wget
GNU Wget is a free utility for downloading files from the web. It supports HTTP, HTTPS, and FTP protocols, as well as retrieval through HTTP proxies. Wget is useful for scripting the download of files on the web via shell scripting tools such as Bash.
What do checksums look like?
Fixed length strings. Hexadecimal characters 0-9, A-F. E.g. `d41d8cd98f00b204e9800998ecf8427e`
WiebeTech
WiebeTech is a digital forensics brand that provides a range of hardware based write blockers for the transfer of digital information.
Write Blocker
Hardware or software based protection of the storage system such that content can be read but cannot be written to. Write-blocking is a forensics technique. Write blockers are central to digital forensics where the material collected from storage devices such as hard drives have important evidentiary value and must not have been tampered with.
XMP
Extensible Metadata Platform (XMP) created originally by Adobe, but now an ISO standard. The standard can encode any set of metadata properties. A common use is for the encoding is to record the activities that have been performed on a file, for example, on an image, one could record post-digitization efforts to crop (remove excess parts of the image) and de-skew (straighten the image). The XMP can then be looked upon as an audit trail for the file.
Zero-byte Files
A zero byte file is a pointer to a location in storage where it is recorded in the filesystem a file will exist, but at said location, the file has not yet been written to, and/or had its content cleared.