Tuesday, February 7, 2012

dSRCI - Citations

Part 1 - Introduction
Part 2 - .sci Top Level Domain
Part 3 - Citations

In my past two posts I have introduced the distributed Scientific Research Collaboration Infrastructure (dSRCI) and then discussed my proposal for a new, perpetual Top Level Domain (TLD), called .sci, for scientists to use to uniquely identify their contributions on the internet. The second  piece of this puzzle (arguably more important than the .sci TLD) is a consistent data standard so posts, papers, and articles – what I am calling  “Artifacts of Collaboration” (AoCs) – written by scientists can be crawled, indexed, traced, analyzed, and rated, regardless of where they may be created, stored, moved, or distributed. This data standard will consist of the following parts:
  • A consistent, universal citation system
  • A standardized vocabulary for
    • Relationships between people
    • Relationships between artifacts
    • Relationships between the people and the artifacts.
    • Topics discussed.

In the rest of this post, I will discuss citation systems, how they apply to what we are trying to accomplish here, and a simplified citation system which I think will make it much easier for scientists to use regardless of where they may be posting.

Citation Encoding:
Ahh, a consistent, universal, useable citation system. I have been stalking this mythical beast for years now. Everyone I ask inevitably says, “Just use Dublin Core” (DC) as if DC actually was a consistent citation standard. To my mind – and for the purposes of this project – DC is meaningless because it can mean anything. DC is so “flexible” that anyone can use any of its tags and attributes for just about anything they want. I mean, what the heck does “creator” mean anyway. The author? The publisher? The producer of a film? What? For some people, every citation could reasonably have the same three-letter value for this attribute. I have hunted and searched for any explicit definition of how to use DC in a consistent manner that has been widely accepted, but to no avail. It seems everyone who is designing a system to use DC just makes up their own interpretation of what goes where and what it means. So, in the end, DC is about as useful as simply saying
Random garbledy gook that we let every individual program parse in its own way
If someone out there can prove me wrong, then please do so. I would be so ecstatic that I would ride my scooter all the way out to where you are and hug your neck.

With all the Library Scientists out there who know so much about what is necessary for a good citation, I would think someone would have done all this by now. And, with how darned cooperative librarians tend to be, I would have thought that any good system  would have become widely accepted by now as well. However, again, I haven’t found anything. If I have to, I can make up something myself. If people don’t like it they can suggest something different. But I am not in much of a mood for bickering back and forth on esoterics or theories. Some may suggest that I should at least base my new system upon Dublin Core. However, I now believe that DC has gone the way of Unix. There are so many deviations that it has now simply become a deviant.

I am also aware of the Zotero project and how they use special RDF formatted citation information (often using DC) to download citations from websites that provide it. However, I would prefer that the citations for this system not require an entire paragraph of RDF to cite a ten-word sentence. And, I would prefer that they also be relatively human readable when cut and pasted into a text file or other document.

Both of these existing types of citation systems are certainly in wide use but I still feel they fall short of what will be necessary to facilitate the kind of collaboration we desire. DC is too flexible - thus too ethereal - and current RDF standards are too verbose. So, I propose a new citation system be created. A system that does an end run around the problems of inconsistent citation standards and verbosity. In order for this new citation system to be successful, I believe it must meet the following requirements:
  • There must be a unique identification for each contributor and each of their individual contributions,
  • These citations should be embedded within the documents, contributions, and other AoCs (Artifacts of Collaboration),
I have covered the issue of creating a unique, permanent identifier for each scientist. Now I will address the issue of creating a unique identifier for each contribution from each scientist. Remember, all we are looking for is a clean and simple way to differentiate between all the different contributions made by a particular scientist. It seems to me that the simplest, easiest means to do this is to simply apply a date and time stamp with one-second resolution. (If you know of any scientists who can make more than one significant contribution in a single second let me know. I would really like to meet them.) So, the URI for a contribution would be something like “scientistnameYYYY.sci/YYYY-MM-DD_HH-MM-SS” or perhaps “” Now, the first impulse of many is to try to imagine using either of these URIs as URLs which then leads one to imagine all those thousands of subdirectories, one for each contribution. But remember, a URI is merely an identifier. It does not have to resolve to a URL. In other words, there does not need to be an actual web page for each of these URI. It is just a label. Also remember, I expect there to be copies of these documents/contributions/AoCs spread throughout the internet. The identifier is merely a means for search engines to … well … identify each copy of each contribution, index it, and make it available to researchers.

Sometimes people would like to refer to a specific section within a document. This can be accomplished using this simplified citation system simply by appending an XML “fragment” to the citation URI. A “fragment” is really not much more than an additional string that starts with a pound/hash/number symbol (#). So a citation indicating a specific paragraph of a contribution might look like this: “scientistnameYYYY.sci/YYYY-MM-DD_HH-MM-SS#paraXXX.” I will have to do some research to see if there is already a standard system for designating these types of within-document-locations. I know Adobe uses something similar for indicating the locations of annotations within the text of .PDF files. I will have to see how that works and if it is available to use.

Embedded Citations:
Another problem with current citation systems is that most of them do not embed the citation directly into the item being cited. The citation is applied as an external label. RDF points to a document and says, “That document has this citation.” Of course anyone can create another RDF tag that claims the document has yet a different citation. And if the document is moved or its server goes down, then all those RDF tags become worthless. When the citation is embedded within the artifact itself (whether it be a .PDF document, word processing document, web page, or just a comment on a blog) then that artifact can be moved or copied almost anywhere and it can still be found and indexed by search engines. (Naturally, if the only copy lies behind a paywall then we have a problem.)  Before search engines began digging into the actual document contents, this would not have been a viable solution. But now that Google and Bing index every word in nearly every document posted on the internet, there is no excuse to still rely on labels that are metaphorically merely laid down near to documents rather than being permanently attached. Pictures have EXIF data, PDF files have embedded metadata, and so do most word processing file formats. Currently, there is no means to easily embed the appropriate unique citation within these documents other than manually going to the metadata dialog within the software for each and every individual file. And, sure, people can manually type out one of these citations in their blog or forum posts, but who wants to go to that trouble? That is what computers are for. Later, in my post on software, I will address a means to make this process easier to do without even really thinking about it.

What about all that other citation information?
That is a good question. Do we really need it in the internet age? Regular citation information is primarily designed to make it possible – though not necessarily easy – for people to find that document in a regular library. Have you been in a library lately? Even at the library, everyone uses computers to look things up. But they have to type in the various bits and pieces of the “legacy” citation, sort through all the false hits, try FirstName, LastName, then  LastName, FirstName, then see if there was a middle initial, then hope the index they are using indexed that document, but never be sure if the document was there but they just didn’t use the right search terms. And this is when they already know exactly which document they are looking for. What a relief it would be to just type in “scientistnameYYYY.sci/YYYY-MM-DD_HH-MM-SS” – or better yet, scan a QR code – and go right to the desired document!

I am not claiming that we should do without “legacy”  citations altogether. Merely that they should be supplementary. I also feel strongly that these “legacy” citations should be consistent, easy to read, and embedded within the document just like the abbreviated citations I proposed above. There is still the problem of a consistent standard. I will work on that some other time, perhaps.

I believe this new citation system, as simple as it is, can really go a long way toward creating  the web of interconnected collaboration we are looking for. But it can only do that if it is consistently inserted in every Artifact of Collaboration as it is created. I will discuss how to ensure this happens in my post on the software necessary to make all this happen. First, however, I need to  discuss the terms – or vocabulary – that can be used to describe the relationships between all the collaborators and these “artifacts.” This is what I will cover in my next post.

The contents of this post is Copyright © 2012 by Grant Sheridan Robertson.

No comments:

Post a Comment