Preserving and Curating Software

by Scott Wilson on 5 November 2014

Introduction

The content of this article was originally published by the Software Sustainability Institute in two articles released under a CC-BY-NC license as Digital Preservation And Curation - The Danger Of Overlooking Software and Sustainability and preservation framework. It has been combined into a single briefing and updated by Scott Wilson of OSS Watch.

From preserving research results, to storing photos for the benefit of future generations, the importance of preserving data is gaining widespread acceptance. But what about software?

It’s easy to focus on the preservation of data and other digital objects, like images and music samples, because they are generally seen as end products. The software that is needed to access the preserved data is frequently overlooked in the preservation process. But without the right software, it could be impossible to access the preserved data - which undermines the reason for storing the data in the first place.

This briefing paper is targeted at people who are responsible for preserving data and digital objects on the behalf of others. These people typically work for libraries, museums and archives.

Our goal in this paper is to explain why long-term software preservation is necessary, what needs to be understood before software can be preserved and how to get started with the preservation process.

When should you consider software preservation?

Software is used to create, interpret, present, manipulate and manage data. You should consider software preservation whenever one or more of the following statements is true:

1. The software can’t be separated from the data or digital object

In an ideal world, data can be isolated and preserved independently of the software used to create or access it. Sometimes this is not possible. For example, if the software and the data form an integrated model, the data by itself is meaningless. This means that the software must be preserved with the data.

If data is stored in a format that is open and human-readable, then any software that follows that format can be used to read the data. If the data is stored in a format that is closed and arcane, then you must also preserve the software that is used to access it.

2. The software is classified as a research output

The software could fall under a Research Council’s preservation policy. This means that the software must be preserved as a condition of its funding.

For example, the preservation policy of the Bodleian Libraries at the University of Oxford identifies “Digital research outputs such as publications, research data and software produced by researchers at the University of Oxford” as within scope of preservation.

3. The software has intrinsic value

Software can be a valuable historical resource. If the software was the first example of its type, or it was a fundamental part of a historically significant event, then the software has inherent heritage value and should be preserved.

In addition, you may find the following questions helpful when considering whether software should be preserved:

  • Is the software covered by a preservation policy / strategy?
  • Is there a clear purpose in preserving the software?
  • Is there a clear time period for preservation?
  • Do the predicted benefit(s) exceed the predicted cost(s)?
  • Is there motivation for preserving the software?
  • Is the necessary capability available?
  • Is the necessary capacity available?

What are the issues?

Software presents some challenges to those who curate, preserve and archive. In particular, software preservation is difficult, because software is sensitive to changes in its environment.

If there is a change to the computer or operating system on which the software runs, the software will often stop working properly. What’s more, this change might not cause a catastrophic failure. Although serious, this kind of failure is at least easy to spot. A change to the computer or operating system might only cause a subtle, yet important, change in results. Expert knowledge is needed to fully understand how a software component works and the effect that a change may have.

There is a lot of variation in software: it comes in many different forms, it is written in a bewildering range of languages and it can be licensed in many different ways. Further difficulties can arise from the increasing use of web services and the cloud. This is where your software is hosted by external organisations - a practise that is becoming increasingly popular. It generally takes a team of experts to understand all the different facets of software and choose the best route for its preservation.

How should I approach software preservation?

Software preservation should be part of a broader preservation strategy. This strategy should provide a guide of what needs to be preserved, and for how long.

The same considerations that apply to digital preservation also apply to software (intellectual property, choice of media, backup and recovery, etc.), so the basic considerations of software preservation are similar to those of digital preservation.

The approaches to preservation can be summarised as:

  • Technical preservation (techno-centric) - Preserve original hardware and software in same state
  • Emulation (data-centric) - Emulate original hardware and operating environment, keeping software in same state
  • Migration (functionality-centric) - Update software as required to maintain same functionality, porting/transferring before platform obsolescence
  • Cultivation (process-centric) - Keep software ‘alive’ by moving to more open development model bringing on board additional contributors and spreading knowledge of process
  • Hibernation (knowledge-centric) - Preserve the knowledge of how to resuscitate/recreate the exact functionality of the software at a later date
  • Deprecation - Formally retire the software without leaving the option of resuscitation/recreation
  • Procrastination - Do nothing

At OSS Watch we often focus on the “cultivation” option - increasing the sustainability of software projects by engaging a community. However, this is not always possible or desirable.

Preserving the knowledge behind software is as critical as the software itself. Good documentation is important, as is having access to the developers of the software.

A project undertaken by the STFC identified a set of significant properties of software, which can be used as a structured framework to elicit key information from the development team. Developers are keen to contribute to this framework as it helps them organise their documentation and enables their software to live for longer.

Purposes, benefits and scenarios

A key challenge in digital preservation is being able to articulate, and ideally prove, the need for preservation. A clear framework of purposes and benefits facilitates making the case for preservation. The table below shows a range of scenarios for each purpose to give some illustrative examples of where the purpose and accompanying benefits might be relevant.

We recommend that these purposes and benefits be combined with preservation plans regarding data and hardware: digital preservation should be considered in an integrated manner. For example, media obsolescence and recovery is often as much a part of a software preservation project as a data preservation project.

Note that if at all possible, especially where the software is an enabler, it’s advisable to turn a software-preservation problem into a data-preservation problem. These problems are invariably easier to handle.

The table below shows a range of scenarios for each purpose to give some illustrative examples of where the purpose and accompanying benefits might be relevant.

Purpose Benefits Scenarios

Encourage Software Reuse

  • Reduced development cost
  • Reduced development risk
  • Accelerated development
  • Increased quality and dependability
  • Focused use of specialists
  • Standards compliance
  • Reduced duplication
  • Learning from others
  • Opportunities for commercialisation
  • Continuing operational use in institution
  • Increasing uptake elsewhere
  • Promoting good software

Achieve legal compliance and accountability

  • Reduced exposure to legal risks
  • Avoidance of liability actions
  • Easily demonstrable compliance lessens audit burden
  • Improved institutional governance
  • Enhanced reputation
  • Maintaining records or audit trail
  • Demonstrating integrity and authenticity of data and systems
  • Addressing specific contractual requirements
  • Addressing specific regulatory requirements
  • Resolving copyright or patent disputes
  • Addressing the need to revert back to earlier versions due to IP settlements
  • Publishing research openly for transparency
  • Publishing research openly as a condition of funding

Create heritage value

(Heritage value is generally considered to be of intrinsic value)

  • Ensuring a complete record of research outputs where software is an intermediate or final output
  • Preserving computing capabilities (software with or without hardware) that is considered to have intrinsic value
  • Supporting the work of museums and archives

Enable continued access to data and services

For research data and business intelligence:

  • Reproducing and verifying research results
  • Fewer unintentional errors due to increased scrutiny
  • Repeating and verifying research results (using the same or similar setup)
  • Reduced deliberate research fraud
  • Reanalysing data in the light of new theories
  • New insight and knowledge
  • Reusing data in combination with future data
  • Increased assurance in results

For systems and services:

  • Current operations maintained
  • Opportunity for improved operations via corrective maintenance
  • Reduced vendor lock-in
  • Improved disaster recovery response
  • Increased organisational resilience
  • Increased reliability
  • Reproducing and verifying research results
  • Repeating and verifying research results (using the same or similar setup)
  • Reanalysing data in the light of new theories
  • Reusing data in combination with future data
  • ‘Squeezing’ additional value from data
  • Verifying data integrity
  • Identifying new use cases from new questions
  • Maintaining legacy systems (including hardware)
  • Ensuring business continuity
  • Avoiding software obsolescence
  • Supporting forensics analysis (eg for security or data protection purposes)
  • Tracking down errors in results arising from flawed analysis

Software preservation and Open Source

For future researchers to be able to use the software, its critical that they have permission to access the code, to make any changes necessary to get it working (particularly as the rest of the software ecosystem will have moved on), and also for them to be able to do so without any sort of implied warranty from the original researchers. Making software available as Free Software or Open Source Software is a simple way of ensuring this is the case; otherwise, it may be unclear to future researchers whether or not they can actually make use of the software in a useful way.

In terms of licensing, using one of the most well-known licenses is likely to be the best strategy when considering long term preservation.

If the software cannot be made open source, for whatever reason, the copyright owner still needs to be clearly identified.

Preserving software revision history

As well as preserving the source code of the software, its worth considering whether the history of changes to the software is also in scope for preservation. Most modern software is developed using a revision control system (such as Git or Subversion) that keeps track of every change made to the software over time. This means its possible to preserve not just the state of software at the time of preservation, but also its entire history up to that point. This may have future cultural heritage value, as it makes it possible to see the contributions of individuals, as well as the responses of the software developers to external events as expressed through changes to the code. It also makes it clearer when and why particular changes to the code were made.

Virtualisation and software containers

One approach to technical preservation is to preserve not only the software source code and documentation, but to provide a completely configured software stack ready for deployment. This should make running the software in the future much more reliable and involve less manual effort.

With Docker for example, its now possible to share and preserve a complete environment, including all required dependencies and infrastructure, without the overhead of a complete Virtual Appliance.

Key questions to ask yourself

When considering software preservation, you should consider the following questions:

  • Is there still knowledge and expertise to handle and run the software?
  • How authentic does the preserved software need to be?
  • How adequate does the preserved software need to be: should it perform exactly as the original, the same but with only minor deviations, or perform the core functionality only?
  • How much access do you have? (Owner, developer, access to source code, access to hardware, user)
  • Do you have the necessary Intellectual Property Rights (IPR)?
  • What are you needing to preserve? (A few major pieces of functionality, Most of the functionality, but tolerant of minor deviations, All functionality, but fixing errors when found, Must perform exactly as original)
  • What is your likely effort profile? (Something or nothing now, something or nothing in the future)
  • What is the maintainability of underlying hardware?
  • Is maintaining integrity and/or authenticity an important requirement?
  • How long do you want to preserve it for?
  • Can you afford it?
  • Are you also interested in further development or maintenance?
  • What development effort has been invested into the software so far?
  • Is the software open source? Could it be made open source?
  • Are there any barriers to making it open source?
  • Is the proposed approach appropriate to every purpose?
  • What are the relative advantages and disadvantages of each approach under consideration?

Further Reading

Links: