The Macrosite for News, Analysis and Opinion about the Future of the Internet
Comments
Current display:       newest comments first       chronological order   threaded
AllenFromMinneapolis
IQ Crew
Friday December 9, 2011 10:34:04 AM
no ratings

Why doesn’t your Organization already have an Integrated Framework for the

Publication and Preservation of all Data Products Necessary to support a Corporate Memory?

 

1. Introduction

An organization should provide a forum to explore the issues surrounding the preservation, identification and citation of data artifacts which have become crucial as lessons learned so as to not have history repeat itself negatively on TTS projects, or any projects for that matter within the corporation. This would be a small but critical step towards the ultimate goal of identifying the right practices, resources, and incentives to ensure that the entire data lifecycle within TTS projects are properly captured and described in the coming era of data-intensive TTS projects.

 

2. The Missing Data

Today I find that data published over the past decades, uncovers several examples of teams putting together various data stores, and giving out access to the datasets they have collected or created.  While this shows a commendable willingness on the part of these teams to share their work and further disseminate their lessons learned, we have seen too many examples of such data stores which disappear or go in disrepair as people move due to the nature of corporate realignment. In some cases we have seen entire data stores lost, surely we can do better than this.

 

The first question to answer is: why are we as a corporation doing this?  I believe that there is a desire for organizations to be able to package and present our work in a way that we feel is appropriate.  Given the high level of technical savvy of our TTS engineers, there’s never been a short-term barrier to putting up data stores, nor has anyone in the TTS community perceived that as being tricky, or worth reward. This has probably made it difficult to persuade people that there is an unmet curatorial challenge that we need to confront, and a preservation need which is not currently matched by our existing infrastructure.   

 

Even when data is stored in what one would consider authoritative, trusted archives, there is currently no guarantee that their location will persist in the long run.  Shifting technologies, economic realities and organizational changes often force resources to be moved or, even worse, mothballed if they are not considered to be essential in today’s environment. Thus, we can only realistically take implicit promises of long-term data archival as what they are: well-intentioned plans which are contingent on a number of factors, some of which are out of our control. At the same time we should take steps to ensure that our system of archiving, sharing and linking resources, is as resilient as it can be while we keep a realistic view of the technological and economic environment supporting efforts.

 

3. Preservation, Persistence and Versioning

One of the first steps we should take in organizing our network of project data is to

future proof our nomenclature system by assigning persistent data identifiers

to data artifacts that we want to preserve but whose archival and curation are expected to

change.  Coming to concrete decisions on how this should be implemented is not as straightforward. There are several questions surrounding the technical and social aspects related to mining

persistent data identifiers:

 

• If we take the broadest view of preservation as an essential step in support of the

repeatability of the project process, then we should archive and assign a unique

persistent data identifier to each data artifact. If the dataset is recreated as a result of an updated

milestone or lessons learned, this should be considered a new version

of the data artifact and be assigned a different persistent data identifier. Thus under this scenario, an archive would need to freeze and uniquely identify all versions of data artifacts

it stores, a very costly and unlikely scenario for most archives.

 

 

• Another option, preserving each possible data artifact version, puts serious burdens on archives and is not necessarily a realistic model, but freezing a particular version when we know that it is being cited seems possible. The issue then becomes: how do we know that data artifact

X, downloaded from archive Y at time T should be frozen if the group or software accessing

it won’t happen in the future for another two years?

 

Archive managers and curators of large datasets have repeatedly mentioned the practical

difficulty of the first approach, and have pointed out that requests for older versions

of data products are few and far in between. On the other hand, derived datasets such

as catalogs, and data mashups, are updated and versioned on a regular basis.

 

One additional issue when considering mining persistent data identifiers is what the right level of granularity should be for an identifier. Should we assign a persistent data identifier to a group of artifacts, or individual artifacts?  To a metadata data store or the individual data store? To an aggregation of data artifacts used, or to the hundreds of individual files in it?

 

It is probably the case that there is no one-size-fits-all answer to the questions of

granularity, versioning and preservation.  For the time being, as long as there is a mechanism which supports the citation of complete data stores, then it’s deployable. And as long as it can be subsequently refined when experience provides concrete demand, then we should not delay its use.

 

4. Citing and Linking Data

What should data citations look like? :

 

• Cite data as we cite articles: assign basic metadata to data stores (project idnum, tspr num),

a persistent identifier, and list them in the reference section.

 

• Cite data stores as we cite websites: find out what their (hopefully persistent) URI is and

mention it in your data store (unc name etc.). 

 

• Have a data references section: similar to a bibliographic reference

list, such a section would list in an unambiguous way (and using standard formatting)

all the data artifacts that were used in the project.  This option would make it easy for publishers, curators and aggregators to identify data citations in a way similar to how we identify bibliographic citations today.

 

If we establish a consistent nomenclature for referring and linking to data artifacts then this will be a much easier task. Hence it is essential that we as a corporation agree on well-defined standards for data identifiers and enable the creation of a corresponding registry to support this effort. This will enable citations to data artifacts to be recognized and identified.

 

 

 

5. Conclusions

There is a need for curated datasets, preserved indefinitely for corporate memory purposes.  The details of a fully-fledged data preservation environment are difficult to discern at this point, but there are sensible places to begin, and from such places the mechanisms can evolve.  There are many fruitful opportunities, for interested parties to engage in discussions about the promises and the challenges of broadly-scaled data preservation.



The ThinkerNet does not reflect the views of TechWeb. The ThinkerNet is an informal means of communication to members and visitors of the Internet Evolution site. Individual authors are chosen by Internet Evolution to blog. Neither Internet Evolution nor TechWeb assume responsibility for comments, claims, or opinions made by authors and ThinkerNet bloggers. They are no substitute for your own research and should not be relied upon for trading or any other purpose.
a moderated blogosphere of internet experts
Ron Miller
Ron Miller   5/17/2013   15 comments
Recently, the Obama administration has been of two minds where privacy rights are concerned. On one hand, you have an administration that vowed to veto CISPA and mandated open data for government websites. On the other hand, you have an increasingly out-of-control Department of Justice on a fishing expedition at AP and demanding legislation to let the FBI wiretap private, encrypted communications and levy fines if a company fails to comply.
Alan Reiter
Alan Reiter   5/16/2013   30 comments
The apartment and house sharing service, Airbnb, now requires members to verify their identities by demonstrating a presence on the web, and by either scanning a government ID or entering detailed personal details. Other enterprises should take a close look at Airbnb's verification policies.
Harry Hawk
Harry Hawk   5/15/2013   20 comments
Facebook advertising is a lightning rod. It seems neither brands nor consumers are 100 percent happy about the social media site's policies, placement, or procedures. But the real controversy about Facebook ads and promotions is over whether they work.
Rasheen A. Whidbee
By now, you've most likely heard about the 3D-printed gun that Texas-based Defense Distributed demonstrated last week. But we haven't heard the last about the censorship war that began soon afterward.
IETV: the thinkerNet on film
5
of
Paul J. Fleuranges
Digital Signage Keeps NYC Subway Straphangers on Track

5|6|13   |   3:51   |   No comments


New York's Metropolitan Transit Authority is conducting a pilot test of digital kiosks to guide subway users to where they want to go more efficiently and at lower cost.
Kim Davis
Fast Forward to the Future

4|23|13   |   2:29   |   20 comments


A look back at tech writing in the 90s makes us wonder where enterprise IT will be 20 years from now.
Mitch Wagner
Google Launches Its Most Depressing Service Yet

4|15|13   |   2:59   |   10 comments


Google's new Inactive Account Manager lets you control how Google disposes of your accounts when you die.
Second Shooter
Argument Over Top-Level Domains Is 'Stupid'

4|11|13   |   2:07   |   3 comments


The whole Amazon.reader debate is a double-stupid. It's stupid to think that there's any e-book buyer who doesn't know Amazon's URL, and it was stupider to let ICANN launch the whole free-form TLD initiative to start with.
Kim Davis
Ladies, Your Tablet Awaits

3|21|13   |   2:22   |   37 comments


ePad Femme is the world’s first tablet “made exclusively for women.”
Wisdom of the Big Chair
NFC Moves Into the Mainstream

3|20|13   |   2:16   |   No comments


While NFC's original goal was to enhance mobile commerce applications, it is finding its way into a number of other uses, which is creating both opportunity as well as challenges for IT departments.
Wisdom of the Big Chair
Integrating Security Into Your Cloud Contract

3|19|13   |   3:35   |   No comments


Enterprises would like to move to cloud computing but are hesitant because they are concerned about providers’ ability to secure company data. Here are some tips that help to ensure that if breaches occur, the business is not left holding the bag.
Brian Baron
How Edmunds.com Collects Customer Information

3|18|13   |   1:15   |   No comments


Edmunds separates customers into segments based on the info it collects on its site and from partners, and uses that to push out custom content, said Brian Baron, director of business analytics for Edmunds.com, at Predictive Analytics Innovation Summit.
Brian Baron
How Edmunds.com Uses Analytics to Customize Site

3|14|13   |   0:47   |   No comments


The automotive website uses propensity modeling to target ads and customer registration forms, said Brian Baron, director of business analytics for Edmunds.com, at Predictive Analytics Innovation Summit.
Second Shooter
Locked Handsets Aren't the Problem – Subsidies Are the Problem

3|13|13   |   2:09   |   10 comments


Subsidized handsets, rather than locked handsets, should be the focus of regulators. We're not getting good deals, not fostering innovation, and weakening our power as buyers.
an IBM information resource
sponsored content
big blue blog
Todd Watson
Todd Watson   5/17/2013   1 comment
It's been 17 years since I've visited the city of Dublin, but I still have some very distinct impressions from my one and only visit.
an IBM information resource
sponsored content
Expert Integrated Systems: Changing the Experience & Economics of IT
In this e-book, we take an in-depth look at these expert integrated systems -- what they are, how they work, and how they have the potential to help CIOs achieve dramatic savings while restoring IT's role as business innovator.

READ THIS eBOOK
your weekly update of news, analysis, and
opinion from Internet Evolution - FREE!

REGISTER HERE
Wanted! Site Moderators
Internet Evolution is looking for a handful of readers to help moderate the message boards on our site – as well as engaging in high-IQ conversation with the industry mavens on our thinkerNet blogosphere. The job comes with various perks, bags of kudos, and GIANT bragging rights. Interested?

Please email: moderators@internetevolution.com
Internet Evolution – not for thickies
Keep Critical Data With a Knowledge Management System
Taimoor Zubair
Fortune 500 companies lose at least
$31.5 billion a year by failing to share knowledge. A Knowledge Management System (KMS) can help companies significantly reduce these costs.

CLICK FOR MORE
IT Suffers From Obama Admin's Jekyll & Hyde Approach to Privacy Rights
Ron Miller
Recently, the Obama administration has been of two minds where privacy rights are concerned. On one hand, you have an administration that vowed to
veto CISPA and mandated open data for government websites. On the other hand, you have an increasingly out-of-control Department of Justice on a fishing expedition at AP and demanding legislation to let the FBI wiretap private, encrypted communications and levy fines if a company fails to comply.

CLICK FOR MORE
IT Suffers From Obama Admin's Jekyll & Hyde Approach to Privacy Rights
Ron Miller
Recently, the Obama administration has been of two minds where privacy rights are concerned. On one hand, you have an administration that vowed to
veto CISPA and mandated open data for government websites. On the other hand, you have an increasingly out-of-control Department of Justice on a fishing expedition at AP and demanding legislation to let the FBI wiretap private, encrypted communications and levy fines if a company fails to comply.

CLICK FOR MORE
IT Suffers From Obama Admin's Jekyll & Hyde Approach to Privacy Rights
Ron Miller
Recently, the Obama administration has been of two minds where privacy rights are concerned. On one hand, you have an administration that vowed to
veto CISPA and mandated open data for government websites. On the other hand, you have an increasingly out-of-control Department of Justice on a fishing expedition at AP and demanding legislation to let the FBI wiretap private, encrypted communications and levy fines if a company fails to comply.

CLICK FOR MORE
Websites Should Consider Tougher ID Verification Policies
Alan Reiter
The apartment and house sharing service,
Airbnb, now requires members to verify their identities by demonstrating a presence on the web, and by either scanning a government ID or entering detailed personal details. Other enterprises should take a close look at Airbnb's verification policies.

CLICK FOR MORE