Why doesn’t your Organization already have an Integrated Framework for the
Publication and Preservation of all Data Products Necessary to support a Corporate Memory?
1. Introduction
An organization should provide a forum to explore the issues surrounding the preservation, identification and citation of data artifacts which have become crucial as lessons learned so as to not have history repeat itself negatively on TTS projects, or any projects for that matter within the corporation. This would be a small but critical step towards the ultimate goal of identifying the right practices, resources, and incentives to ensure that the entire data lifecycle within TTS projects are properly captured and described in the coming era of data-intensive TTS projects.
2. The Missing Data
Today I find that data published over the past decades, uncovers several examples of teams putting together various data stores, and giving out access to the datasets they have collected or created. While this shows a commendable willingness on the part of these teams to share their work and further disseminate their lessons learned, we have seen too many examples of such data stores which disappear or go in disrepair as people move due to the nature of corporate realignment. In some cases we have seen entire data stores lost, surely we can do better than this.
The first question to answer is: why are we as a corporation doing this? I believe that there is a desire for organizations to be able to package and present our work in a way that we feel is appropriate. Given the high level of technical savvy of our TTS engineers, there’s never been a short-term barrier to putting up data stores, nor has anyone in the TTS community perceived that as being tricky, or worth reward. This has probably made it difficult to persuade people that there is an unmet curatorial challenge that we need to confront, and a preservation need which is not currently matched by our existing infrastructure.
Even when data is stored in what one would consider authoritative, trusted archives, there is currently no guarantee that their location will persist in the long run. Shifting technologies, economic realities and organizational changes often force resources to be moved or, even worse, mothballed if they are not considered to be essential in today’s environment. Thus, we can only realistically take implicit promises of long-term data archival as what they are: well-intentioned plans which are contingent on a number of factors, some of which are out of our control. At the same time we should take steps to ensure that our system of archiving, sharing and linking resources, is as resilient as it can be while we keep a realistic view of the technological and economic environment supporting efforts.
3. Preservation, Persistence and Versioning
One of the first steps we should take in organizing our network of project data is to
future proof our nomenclature system by assigning persistent data identifiers
to data artifacts that we want to preserve but whose archival and curation are expected to
change. Coming to concrete decisions on how this should be implemented is not as straightforward. There are several questions surrounding the technical and social aspects related to mining
persistent data identifiers:
• If we take the broadest view of preservation as an essential step in support of the
repeatability of the project process, then we should archive and assign a unique
persistent data identifier to each data artifact. If the dataset is recreated as a result of an updated
milestone or lessons learned, this should be considered a new version
of the data artifact and be assigned a different persistent data identifier. Thus under this scenario, an archive would need to freeze and uniquely identify all versions of data artifacts
it stores, a very costly and unlikely scenario for most archives.
• Another option, preserving each possible data artifact version, puts serious burdens on archives and is not necessarily a realistic model, but freezing a particular version when we know that it is being cited seems possible. The issue then becomes: how do we know that data artifact
X, downloaded from archive Y at time T should be frozen if the group or software accessing
it won’t happen in the future for another two years?
Archive managers and curators of large datasets have repeatedly mentioned the practical
difficulty of the first approach, and have pointed out that requests for older versions
of data products are few and far in between. On the other hand, derived datasets such
as catalogs, and data mashups, are updated and versioned on a regular basis.
One additional issue when considering mining persistent data identifiers is what the right level of granularity should be for an identifier. Should we assign a persistent data identifier to a group of artifacts, or individual artifacts? To a metadata data store or the individual data store? To an aggregation of data artifacts used, or to the hundreds of individual files in it?
It is probably the case that there is no one-size-fits-all answer to the questions of
granularity, versioning and preservation. For the time being, as long as there is a mechanism which supports the citation of complete data stores, then it’s deployable. And as long as it can be subsequently refined when experience provides concrete demand, then we should not delay its use.
4. Citing and Linking Data
What should data citations look like? :
• Cite data as we cite articles: assign basic metadata to data stores (project idnum, tspr num),
a persistent identifier, and list them in the reference section.
• Cite data stores as we cite websites: find out what their (hopefully persistent) URI is and
mention it in your data store (unc name etc.).
• Have a data references section: similar to a bibliographic reference
list, such a section would list in an unambiguous way (and using standard formatting)
all the data artifacts that were used in the project. This option would make it easy for publishers, curators and aggregators to identify data citations in a way similar to how we identify bibliographic citations today.
If we establish a consistent nomenclature for referring and linking to data artifacts then this will be a much easier task. Hence it is essential that we as a corporation agree on well-defined standards for data identifiers and enable the creation of a corresponding registry to support this effort. This will enable citations to data artifacts to be recognized and identified.
5. Conclusions
There is a need for curated datasets, preserved indefinitely for corporate memory purposes. The details of a fully-fledged data preservation environment are difficult to discern at this point, but there are sensible places to begin, and from such places the mechanisms can evolve. There are many fruitful opportunities, for interested parties to engage in discussions about the promises and the challenges of broadly-scaled data preservation.