Monday, September 24, 2012

As-Built Document Flow

Here is a Microsoft Visio diagram I created to illustrate the convoluted processes by which documents move around within my department and finally end up in what is called an "As-Built Packet." This is a more visually appealing - and more visually accessible - version of a classic 'swimlane' diagram. Each horizontal bar is a 'lane' which represents either a department within the organization or a specific role within a department. Most of the shapes were found in the selection provided with the software. However, some of them I had to create myself, either from scratch or by modifying shapes provided. Specifically, I created the filing cabinets, the blueprint, the clock, the notebook, the brown accordian file, and I added the hard-hat to the contractor. I also created the custom conveyor-belt line type. I strongly believe that visual metaphors such as these make it much easier for people to understand and internalize the message of a diagram as opposed to just a bunch of text in boxes.

(Click on the image to go to a full-size .PDF file of the diagram.)

The content of this post is Copyright © 2012 by Grant Sheridan Robertson.
This diagram is posted with the permission of my boss.

Wednesday, June 27, 2012

Wood Carving & OneNote

Way back in 2003-2004 I fell in love with Microsoft OneNote. The first version, OneNote 2003, was a little tricky to use when using handwriting. There were lots of questions about it in the newsgroups. After many hours of trial and error, spread out over several weeks, I finally figured out how use the handwriting feature in a manner that would produce consistent results. Being the helpful newsgroup maven that I was, I sat down to write up an explanation for all of this. When I was finished, I realized that I had enough for a magazine article. I hunted around for a magazine that would publish my article and found one which would take it for no pay, but at least I would then be a "published author." All they asked is that I add some photographs to illustrate my techniques. So, I sat down and took a bunch of screen shots, took a picture of my hand holding a stylus, and used PhotoShop to put them together to make it look as if I had taken a perfect picture of my hand over the screen. The publisher loved it.

An example illustration from the article.

Unfortunately, the magazine went out of business before they had a chance to publish my article. And soon after, Microsoft came out with an update that completely changed the way the handwriting features worked, so my article was moot. But it is still a really great article, and a wonderful example of my skills as a technical writer, photographer, and illustrator. Rather than post the article here, and letting all my beautiful formatting get all munged up by Blogger's HTML handling, I have posted the article up on here.

The contents of this post is Copyright © 2012 by Grant Sheridan Robertson.

Tuesday, February 7, 2012

dSRCI - Citations

Part 1 - Introduction
Part 2 - .sci Top Level Domain
Part 3 - Citations

In my past two posts I have introduced the distributed Scientific Research Collaboration Infrastructure (dSRCI) and then discussed my proposal for a new, perpetual Top Level Domain (TLD), called .sci, for scientists to use to uniquely identify their contributions on the internet. The second  piece of this puzzle (arguably more important than the .sci TLD) is a consistent data standard so posts, papers, and articles – what I am calling  “Artifacts of Collaboration” (AoCs) – written by scientists can be crawled, indexed, traced, analyzed, and rated, regardless of where they may be created, stored, moved, or distributed. This data standard will consist of the following parts:
  • A consistent, universal citation system
  • A standardized vocabulary for
    • Relationships between people
    • Relationships between artifacts
    • Relationships between the people and the artifacts.
    • Topics discussed.

In the rest of this post, I will discuss citation systems, how they apply to what we are trying to accomplish here, and a simplified citation system which I think will make it much easier for scientists to use regardless of where they may be posting.

Citation Encoding:
Ahh, a consistent, universal, useable citation system. I have been stalking this mythical beast for years now. Everyone I ask inevitably says, “Just use Dublin Core” (DC) as if DC actually was a consistent citation standard. To my mind – and for the purposes of this project – DC is meaningless because it can mean anything. DC is so “flexible” that anyone can use any of its tags and attributes for just about anything they want. I mean, what the heck does “creator” mean anyway. The author? The publisher? The producer of a film? What? For some people, every citation could reasonably have the same three-letter value for this attribute. I have hunted and searched for any explicit definition of how to use DC in a consistent manner that has been widely accepted, but to no avail. It seems everyone who is designing a system to use DC just makes up their own interpretation of what goes where and what it means. So, in the end, DC is about as useful as simply saying
Random garbledy gook that we let every individual program parse in its own way
If someone out there can prove me wrong, then please do so. I would be so ecstatic that I would ride my scooter all the way out to where you are and hug your neck.

With all the Library Scientists out there who know so much about what is necessary for a good citation, I would think someone would have done all this by now. And, with how darned cooperative librarians tend to be, I would have thought that any good system  would have become widely accepted by now as well. However, again, I haven’t found anything. If I have to, I can make up something myself. If people don’t like it they can suggest something different. But I am not in much of a mood for bickering back and forth on esoterics or theories. Some may suggest that I should at least base my new system upon Dublin Core. However, I now believe that DC has gone the way of Unix. There are so many deviations that it has now simply become a deviant.

I am also aware of the Zotero project and how they use special RDF formatted citation information (often using DC) to download citations from websites that provide it. However, I would prefer that the citations for this system not require an entire paragraph of RDF to cite a ten-word sentence. And, I would prefer that they also be relatively human readable when cut and pasted into a text file or other document.

Both of these existing types of citation systems are certainly in wide use but I still feel they fall short of what will be necessary to facilitate the kind of collaboration we desire. DC is too flexible - thus too ethereal - and current RDF standards are too verbose. So, I propose a new citation system be created. A system that does an end run around the problems of inconsistent citation standards and verbosity. In order for this new citation system to be successful, I believe it must meet the following requirements:
  • There must be a unique identification for each contributor and each of their individual contributions,
  • These citations should be embedded within the documents, contributions, and other AoCs (Artifacts of Collaboration),
I have covered the issue of creating a unique, permanent identifier for each scientist. Now I will address the issue of creating a unique identifier for each contribution from each scientist. Remember, all we are looking for is a clean and simple way to differentiate between all the different contributions made by a particular scientist. It seems to me that the simplest, easiest means to do this is to simply apply a date and time stamp with one-second resolution. (If you know of any scientists who can make more than one significant contribution in a single second let me know. I would really like to meet them.) So, the URI for a contribution would be something like “scientistnameYYYY.sci/YYYY-MM-DD_HH-MM-SS” or perhaps “” Now, the first impulse of many is to try to imagine using either of these URIs as URLs which then leads one to imagine all those thousands of subdirectories, one for each contribution. But remember, a URI is merely an identifier. It does not have to resolve to a URL. In other words, there does not need to be an actual web page for each of these URI. It is just a label. Also remember, I expect there to be copies of these documents/contributions/AoCs spread throughout the internet. The identifier is merely a means for search engines to … well … identify each copy of each contribution, index it, and make it available to researchers.

Sometimes people would like to refer to a specific section within a document. This can be accomplished using this simplified citation system simply by appending an XML “fragment” to the citation URI. A “fragment” is really not much more than an additional string that starts with a pound/hash/number symbol (#). So a citation indicating a specific paragraph of a contribution might look like this: “scientistnameYYYY.sci/YYYY-MM-DD_HH-MM-SS#paraXXX.” I will have to do some research to see if there is already a standard system for designating these types of within-document-locations. I know Adobe uses something similar for indicating the locations of annotations within the text of .PDF files. I will have to see how that works and if it is available to use.

Embedded Citations:
Another problem with current citation systems is that most of them do not embed the citation directly into the item being cited. The citation is applied as an external label. RDF points to a document and says, “That document has this citation.” Of course anyone can create another RDF tag that claims the document has yet a different citation. And if the document is moved or its server goes down, then all those RDF tags become worthless. When the citation is embedded within the artifact itself (whether it be a .PDF document, word processing document, web page, or just a comment on a blog) then that artifact can be moved or copied almost anywhere and it can still be found and indexed by search engines. (Naturally, if the only copy lies behind a paywall then we have a problem.)  Before search engines began digging into the actual document contents, this would not have been a viable solution. But now that Google and Bing index every word in nearly every document posted on the internet, there is no excuse to still rely on labels that are metaphorically merely laid down near to documents rather than being permanently attached. Pictures have EXIF data, PDF files have embedded metadata, and so do most word processing file formats. Currently, there is no means to easily embed the appropriate unique citation within these documents other than manually going to the metadata dialog within the software for each and every individual file. And, sure, people can manually type out one of these citations in their blog or forum posts, but who wants to go to that trouble? That is what computers are for. Later, in my post on software, I will address a means to make this process easier to do without even really thinking about it.

What about all that other citation information?
That is a good question. Do we really need it in the internet age? Regular citation information is primarily designed to make it possible – though not necessarily easy – for people to find that document in a regular library. Have you been in a library lately? Even at the library, everyone uses computers to look things up. But they have to type in the various bits and pieces of the “legacy” citation, sort through all the false hits, try FirstName, LastName, then  LastName, FirstName, then see if there was a middle initial, then hope the index they are using indexed that document, but never be sure if the document was there but they just didn’t use the right search terms. And this is when they already know exactly which document they are looking for. What a relief it would be to just type in “scientistnameYYYY.sci/YYYY-MM-DD_HH-MM-SS” – or better yet, scan a QR code – and go right to the desired document!

I am not claiming that we should do without “legacy”  citations altogether. Merely that they should be supplementary. I also feel strongly that these “legacy” citations should be consistent, easy to read, and embedded within the document just like the abbreviated citations I proposed above. There is still the problem of a consistent standard. I will work on that some other time, perhaps.

I believe this new citation system, as simple as it is, can really go a long way toward creating  the web of interconnected collaboration we are looking for. But it can only do that if it is consistently inserted in every Artifact of Collaboration as it is created. I will discuss how to ensure this happens in my post on the software necessary to make all this happen. First, however, I need to  discuss the terms – or vocabulary – that can be used to describe the relationships between all the collaborators and these “artifacts.” This is what I will cover in my next post.

The contents of this post is Copyright © 2012 by Grant Sheridan Robertson.

dSRCI - .sci Top Level Domain

Part 1 - Introduction
Part 2 - .sci Top Level Domain
Part 3 - Citations

In my last post, I outlined a new infrastructure (distributed Scientific Research Collaboration Infrastructure) which I believe will help facilitate a rich and wonderful new way for scientists to collaborate over the internet. In addition this infrastructure will enable future employers, granting agencies and connectome researchers to analyze the patterns of collaboration by bringing the metadata about these collaborations to the surface for easy indexing and searching. One of the requirements for that infrastructure is that each scientist have a unique identifier that they can use to tag all of their work and “Artifacts of Collaboration” (AoCs). This unique identifier will be based on one simple idea: a new .sci Top-Level-domain, under which unique domain names will be issued to scientists. These domain names will exist through perpetuity, even after the death of the scientist.

A New Domain:

Each scientist will be given their own perpetual domain name under a new .sci top-level-domain. This domain name would be provided either at no cost or for a single, one-time fee and would last forever, even after the death of the scientist. This will provide that single, unique, perpetual identifier for each and every scientist in existence. While most, if not all, current scientists now have a “home page” on the web site of their current institution, the URL associated with that home page is subject to the whims of every web-master who will ever work on that site now or in the future. In addition, most scientists do not work at one institution all of their lives, and many are now forced to work at more than one just to earn a decent income.

Many may say that the Semantic Web allows for multiple different ways to refer to the same resource – even if one has to go through many RDF linkages to arrive at the original or primary URI for that resource – and therefore a special, assigned URI is not necessary. However, the ethereal  nature of RDF provides no means to prevent inappropriate duplication. Two different scientists may choose to call themselves JoeSmith and then RDF mining software would need to incorporate extraordinary measures to differentiate the two – or the hundreds of – duplicate URIs. I believe it is this very ethereal nature of RDF that causes scientists, and most others to shy away from using it. What good does it do to assign “my URI” to something if that URI may change in just a couple of years or may be accidentally mistaken for someone else’s URI. Yes, software is becoming more and more powerful, but do we really want to give it ten-thousand times more work to do just to avoid the “restrictiveness” of an assigned domain for use with URIs? Actors live with this “restriction” all the time, many even have to legally change their names in order to avoid duplication with any other actor who has ever been a member of The Actors Guild. If actors can change their names, then scientists can register for a unique domain name. In the future, I expect it to become a badge of honor. Something bestowed upon a scientist when they receive their PhD or other credentials.

Of course there would have to be some rules to ensure that the domain names were actually meaningful and easily discoverable. I don’t think “joethedinosaurhunter.sci” would be appropriate. Though many female scientists may not like this idea, I think the best system would be to simply use someone’s full, given name along with the year they were born. This should provide enough uniqueness (within the narrow scope of scientists) that there would only need to be a few alternates to avoid duplication. These alternatives could consist of using the scientist’s middle name, appending the month or even day of their birth. I would like to avoid things like simply tacking on an A or B to the end of their names as this leads to ambiguity. This naming scheme would also provide valuable information indicating the era in which a scientist has lived. 300 years from now, it will be important to be able to easily spot the difference between alberteinstein1879.sci and alberteinstein2275.sci. Hey, it could happen.

Mentioning Albert Einstein brings me to another point: All past scientists will be assigned their own domain names as well, following the same naming convention as for living scientists. Then, every time someone mentions a scientist, living or dead, they can insert the URI for that scientist within a metadata tag. Then search engines can index that reference so anyone looking for any references to that particular scientist anywhere on the internet can have one single, unique search term to look for.  (I will address possible abuses of the system in yet another post.)

URIs, of course, can also be used as URLs. URLs under the .sci TLD will be the perfect place for scientists to place web pages about themselves and their work. Here, too, it would be helpful to have some consistent structure. So I propose a basic hierarchy of directory names to contain some basic info about a scientist. For instance ScientistNameYear.sci/cv or ScientistNameYear.sci/bio, ScientistNameYear.sci/currentwwork, etcetera. I / we can work out a full structure later. Sure, scientists could follow any structure they want, but consistency makes them easily discoverable. Plus why reinvent the wheel? Everyone can just download and copy the standard template and away they go. And, there is no need for anyone to design their web page to look just like anyone else’s. All that is necessary is to embed the proper RDF tags on the proper pages for people and search engines to find. Everything else is gravy.

Just as any other domain name can be hosted on any server, these .sci domain names can be hosted anywhere the scientist chooses. They can be on the scientist’s university’s or company’s server or on a personally maintained server. The “site” can then be moved to any server in the world, as necessary, and the infrastructure will remain undisturbed. The question now arises as to who would host the domains for scientists who are no longer “with us,” either dead or retired. I expect  that certain famous scientists will have many institutions clamoring to host those domains, if only for the recognition. Therefore I propose a bidding process. Institutions would bid against each other for the privilege of hosting the sites of these famous scientists. However, rather than bidding money, they will offer to host the sites of less popular scientists. So, an institution that wants to host AlbertEinstein1879.sci may need to host the sites of tens of thousands of other dead scientists in exchange. Remember, it is not as if these “charity” sites will take up a lot of space or bandwidth, so it shouldn’t really be much of a problem.

I understand that other, non-scientist, people may want to collaborate with scientists as well. However, I do not think it would be appropriate for just anyone to be allowed to register for a .sci domain name. Only individuals with a certain level of bona fides should be allowed to register. Whether that should include only those with PhDs or also allow others established in their fields, I cannot say. I will leave it up to the scientists to hash out the particulars of what qualifies as a real scientist within their particular fields. There is one thing I am adamant about here: Corporations are not people and, therefore, they cannot be scientists. Even though a corporation may own the intellectual property of the scientists who work for them, it is the individual scientists who have made the contributions, that is what we want to track, and so only the scientists should be able to get a .sci domain name.

I understand that this new top-level-domain, with its special considerations, would require both an act of congress as well as international treaties. However, the potential value gained from it would make it worth the trouble. Some may argue that the cost of maintaining such a long list of domain names would be too expensive. Seriously?! Just keeping a domain name in a list on a few servers would cost too much? The importance of the advancement of science is enshrined in our constitution. The USPTO and Library of Congress cost billions per year. A little bit of bandwidth on a few servers spread out throughout the world would amount to less than a Higgs Boson within an atom in a molecule in a drop in that bucket. Besides, the revenues from the exponentially growing ranks  of new scientists registering for their domains will easily pay for the exponentially shrinking costs of maintaining the lists of all the previous scientists.

Now, the entire dSRCI system is not utterly dependent upon the approval of this new top-level-domain. Though it would certainly make things much easier. Scientists could register domains under the .name TLD. Or simply choose any domain name they, personally, control. The problem with this is the impermanent nature of these registrations. If the registrants or their heirs do not keep up the yearly payments, then the domain name is up for grabs by anyone who wants to capitalize on the scientists’ good names. Perhaps some registrars could be persuaded to offer perpetual registrations for a large enough up-front fee. Unfortunately, without an adequate legal contract, I would still be suspicious as to the actual longevity of said domain name registration. This is an issue for another blog post, but perhaps we could get some lawyers to  figure out the proper language to ensure that a registrar – and any entity that ever receives their assets – will be required to maintain said contracted registrations for perpetuity. Perhaps something similar to liens on property. Heck, if corporations can be people, and simple, obvious ideas can be inviolable property, then domain names can be property to be protected in perpetuity too by gum it!

Another alternative to the new .sci TLD would be for scientists to simply start using these scientistname.sci URIs in their citations and in the metadata on their web sites. The DNS system would not resolve these URIs to actual URLs until the .sci TLD was approved, but search engines would still be able to index the citations. If it turns out there are legal issues with using the .sci suffix in these temporarily imaginary URIs, then it would also be possible to use instead. If “dSRCI” were trademarked then the dSRCI organization would be able to deal with abusers within the regular legal system. I would recommend against the dSRCI organization hosting any web pages pointed to by these URIs, however. I would not want any one organization to have that much control or to become a potential choke-hold for oppressive governments to use for censorship. In this context, I believe a search engine based “replacement” for DNS may be more robust and more resilient to change than the current DNS system. But that is yet another topic for yet another separate post.

Yet another alternative, though my least favorite, would be for scientists to take the string which would be used as their domain name under this system and start using it as the parent folder for their professional web site. For instance: If the university where they work provides them with a folder such as then the scientist could create a folder called and place all their content under there. The file would simply redirect to . This way, the scientist could move that folder anywhere he or she wanted and search engines would still be able to find it when people search on the “scientistNameYear.sci” string.

So, I guess all I need to do now is form a non-profit to lobby for a new law creating the perpetual .sci TLD as well as the treaties necessary to make it international. Anyone want to help with that?

In my next post I will discuss the data standards and citation format necessary to bring all this data to the surface for ease of analysis.

The contents of this post is Copyright © 2012 by Grant Sheridan Robertson.

dSRCI - distributed Scientific Research Collaboration Infrastructure - Introduction

Part 1 - Introduction
Part 2 - .sci Top Level Domain
Part 3 - Citations

I recently had the privilege to watch a TED talk by Michael Nielsen about what people are now calling “open science.” I have been acutely aware of the problem of scientists hoarding data and ideas for some time now. Nielsen’s talk, however, drove home the point that the primary reason for this hoarding and secrecy was basic academic survival. Nielsen made clear that scientists do not share their work and – more importantly – their ideas freely because they are afraid they will not get credit for any of these shared ideas. That some other scientist will scoop them by rushing to complete research on the same topic and publishing first.

This, well known, “publish or perish” culture is based on the unfortunate fact that publishing in established journals is the one and only means scientists now have for establishing their reputation. As explained by Nielsen, a count of published papers [along with the number of citations to those papers]  is currently all future employers have to go on when determining a scientists skill. For the last few hundred years, all of a scientist’s skills, innovation, and yes, ability to successfully collaborate have been boiled down to two numbers: published papers and citations. The result of this overly simplistic ranking system, in my opinion, is that the pace of science is limited to what each scientist can do one major – and, preferably, well-funded – project at a time. So, scientific progress is held up by the very scientists who are supposed to be advancing it because they are hoarding their data and ideas.

There are many, well-known, problems with the publish or perish system. One is that scientists tend to focus their research on projects that are likely to get published. If there is a trend then many scientists will chase that trend, consciously or unconsciously. If it appears that the  publishing organizations are shying away from publishing papers on certain controversial  topics, then many scientists will simply avoid those topics altogether. Another problem , brought to light by recent research, is that journals have a strong bias for “positive” results. Only articles that prove a hypotheses tend to get published. If a scientist has a hypothesis but his research proves him wrong, no one is likely to ever to see that research. This means other scientists are doomed to repeat the same research over and over again.

As Nielsen pointed out in his talk, there have been many attempts to solve this hoarding problem. The National Institute of Health (NIH) has implemented policies wherein all research they fund must be registered before the research begins (so no failed research can be hidden in a desk drawer) and – once completed – all research must be made available to the public along with all the data.  Unfortunately, I cannot see even this dramatic measure producing the true, free-flowing collaboration desired by many, including Nielsen and myself. Under the NIH policies, scientists will still keep all their ideas and research data secret until they publish. Yes, more data will become available, but only after either publication or the end of the grant period, which can take years, or even decades. No, the NIH policies are only the first jostle in a major transition that needs to take place.

Michael Nielsen also pointed out that many web sites have sprung up in an attempt to solve this problem, but, as he observed, all of them have failed miserably. All these sites are virtual  ghost towns now (pun intended). Where Nielsen and I diverge is in the reasons for these failures. Nielsen speculates that the problem is merely one of the culture of secrecy among scientists. He claims that policy changes, such as those instituted by the NIH, and … well … talking to your friends about the issue will solve the problem. But I believe the reasons these sites failed are deeper and more multi-faceted than that.

Yes, a big reason scientists have chosen to stay away from these sites is the “credit problem.” But we must look deeper if we are to find a solution that will work in the world where these scientists are forced to live. Even though posts on these web-sites are usually public, each site remains an island unto itself. Depending on user name and post formatting differences, it is very difficult to hunt down all the contributions made by a particular scientist on every blog or forum to which they contribute, should a future employer even choose to do so. I have specifically chosen to use the exact same user name on all the forums I contribute to just so all my contributions can be easily found. However, that goal has been stymied by the various user name requirements on each different site. So, someone would have to know to search for “GrantRobertson” on some sites, “Grant S Robertson” on others, “Grant Sheridan Robertson” on others, and “GrantSR” on still others. In some cases, for more important “contributions” I will repost that contribution to my personal blog. Other people may not go to that trouble.

There are other reasons for the failure of these “science collaboration” or “scientific social networking” web sites that Nielsen never even touched on. One problem is that all of these sites are trying to be the “one site to rule them all,” (with apologies to the Tolkien estate). Everyone wants to be the next Mark Zuckerburg or Jimmy Wales and own or control the one site where ALL scientists and/or academics go to collaborate. Yeah, that’s never gonna happen. For one thing scientists are likely hesitant to commit to a single site that may go away in just a few years. Scientists may also be hesitant to hand over all their data to one single corporation which likely holds the interests of its stockholders above those of scientific endeavor.

On the other hand there are also many problems with a proliferation of many different forums and blog sites, each covering a small niche of the scientific environment. First, there is the inconsistency of having hundreds of different user interfaces and data standards. Most blogs and forums are only concerned with enhancing minute by minute traffic, in order to enhance advertising revenue, so they often could care less about anyone’s ability to extract data from the site after the fact. This means data standards are essentially nonexistent in this realm. Yes, blogs are often packaged for an RSS feed. But, within that package, there are no standards for the information or topics that may be discussed within that blog post. Another problem is redundancy. The same topic may be discussed on several different blogs/forums. There is no easy,  machine-based means to check for this redundancy because the topics may be referred to  by different terms on the different sites. So, two or more different groups of people may be discussing the exact same thing and never know the other groups exist. This can stymie good collaboration.

Ask any business manager and they will tell you: If you want people to do something you have to reward that behavior. But you can only reward what you can measure. The ability to measure either a behavior or an output is key. As discussed earlier, for the past few hundred years our only metrics for scientific skill (which only peripherally includes collaboration)  have been total number of papers and total number of citations. And, as we have seen, all our current attempts at promoting collaboration have failed miserably at measuring what they were trying to promote, if they even  attempted to measure it at all. In addition, almost all recent  attempts have failed to enable true collaboration for one reason or another. Therefore, if we want to promote scientific collaboration we must do two things: We must make it possible. We must make it simple and easy to do, (preferably so easy that it is almost more difficult to not do it). And we must make it measurable, thus making it more desirable.

What can help us get from where we are to where we want to be? To use an analogy, we currently have a situation where no one wants to pile all their innovative Easter eggs into one proprietary basket which  they don’t trust. Scientists could spread their Easter eggs around by placing them under various bushes. But hiding them under lots of different bushes makes them harder to keep track of and harder to lay claim to if someone else should “find” them. What we need is a system that allows us to put our precious eggs anywhere we want, but to quickly and easily identify each and every one of them as our own, when necessary. To stretch the analogy further: What we need is a bunch of strings to tie to all of our eggs together in one big “web” so all we have to do is pull on one string and all our carefully crafted eggs come out for everyone to see how beautiful they are.

Tim Berners-Lee has a name for this bundle of strings: He calls it the “Semantic Web.”  The idea of the Semantic Web has been around for a long time. There are several protocols that have been devised to implement it. So, why haven’t scientists made use of this wonderful technology to enable them to collaborate and share their ideas in a manner that still allows them to get the credit they deserve? I have four words for you: “Pain In The Ass.” Yes, I’ll say it: The Semantic Web is a total pain in the ass for individuals to use. First people have to learn about the concept itself. Then they have to learn about the protocols, find and select taxonomies and ontologies (after they learn what the heck those words mean), and finally – most of the time – they would have to type these arcane tags into their blog and forum posts by hand. After all that work these collaboration-hopefuls must still hold their breath and hope those blogs and forums don’t filter out the tags they have so carefully inserted. Holy Insane Unnecessary Minefield Batman!

I know Tim Berners-Lee is a legend. And, from what I have seen, he is the sweetest guy, who only wants the best for the internet and the world that uses it. But I have to say, he kind of dropped the ball when it comes to implementation here. (Perhaps his plan was to allow others to pick up that ball and run with it.) Currently, creating content for the Semantic Web is like learning VI just so you can develop an AJAX web site that uses JSON to create dynamic application-like web sites, just so you can say “Hi” to your friends. Whether you know what I just said or not, you can see this is why current Semantic Web use is limited to the few people who can handle all the technical details. What we need is a way to make using the Semantic Web as easy as posting on Facebook or just clicking a “Like” button.

In the rest of this post I will outline – not just an new standard – but an entire new infrastructure that I am developing which I think will help facilitate the kind of deep, culturally-ingrained, natural collaboration that needs to take place among scientists (and interested amateurs) to help scientific progress keep pace with the needs of this century – and the next. I am currently calling this system the “distributed Scientific Research Collaboration Infrastructure” (dSRCI, pronounced dee-sear-see. I figured making it pronounceable would help people remember and easily talk about it).

dSRCI has the following goals:
  • To provide credit for all contributions no matter how minor or where they may be made, thus promoting collaboration and public sharing by measuring and rewarding said collaboration.
  • To be able to be applied anywhere and everywhere rather than on just one – or even a few – web sites.
  • To be consistent, even across various web sites and types of sites.
  • To be almost impossible to censor or shut down.
  • To be so easy to use it is almost more difficult to NOT use it.
Of course, as I discussed earlier, lofty goals do not a functional system make. To be successful, dSRCI must include the following specific features (which will all be discussed more thoroughly in subsequent posts):
  • A means to easily check and track contributions regardless of where they are made. This will require…
    • A single, unique identifier for each contributor which …
      • Can be indexed and searched for to find all contributions by that person.
      • Preferably, actually points to a reference about that contributor.
    • A unique identifier for each contribution which …
      • Can be easily searched for to find all copies of that contribution that may have been placed on other sites (either for archival or redundancy purposes or for further discussion).
      • Points to the original specific post within a blog or forum so it can be viewed in context if possible.
      • Does not rely on the original location to still be available.
      • … And, thus, allows for redundant copies to exist anywhere, preventing censorship of ideas or scientists.
    • A consistent and easy to use citation system which includes:
      • A consistent citation encoding scheme.
      • A consistent means for quoting and citation across all web sites (similar to the “Like” and “Share” buttons that festoon many blogs today). 
      • Automatic citations where possible (such as when a user replies to or quotes another post within a forum).
    • A means to indicate the topic of the conversation in a universally searchable manner, regardless of language.
    • A consistent and intuitive vocabulary to describe the multifarious relationships between all the people participating in the collaboration as well as  each individual “Artifact of Collaboration” (AoC).
      • I am still in the initial stages of working out this vocabulary. Though I refuse to be limited by the current restrictions of existing vocabularies, I also see no need to reinvent the wheel or to make the semantic web even more cluttered with sameas declarations. So, I will need time to study existing vocabularies and hash out ideas with others who have more experience designing RDF vocabularies. This part of the dSRCI infrastructure is guaranteed to be an ongoing project.
      • If you have ideas or suggestions: Please go to the blog post about the vocabularies and comment there. Thank you.
  • An easy to use and consistent means to create or format this content which could consist of:
    • Additional software on blog or forum servers.
    • Browser extensions for use on sites that do not have the requisite features.
  • Naturally, all of this should use existing technologies, when possible.

In order to meet these goals and provide these features, dSRCI consists of the following parts:
  • A new .sci top-level-domain under which all bona fide scientists will be assigned their own perpetual domain name for use as a unique identifier that will never (ever) expire.
  • A simplified AoC (Artifact of Collaboration) identifier format so each and every individual contribution made by a scientist can be indexed and searched for regardless of where it is posted, transferred, quoted, or archived.
  • A citation standard that encompasses both the scientist and the AoC identifiers as well as all the standard citation information but in a format that is more concise than current Resource Description Format (RDF) and far more precise than Dublin Core (DC).
  • A new vocabulary to supplement existing RDF standards and vocabularies.
  • Software to make all this so easy to use that people, including busy / distracted scientists will actually use it.
In subsequent posts, I will elaborate on each of these subsystems within the infrastructure. Please keep in mind that this system is not meant to be the be-all-end-all of scientific collaboration. It does not address the data access issues created by the lack of flexible yet extensible standards for data storage. I plan to tackle that problem at a later date. Nor does dSRCI provide the actual collaboration tools, such as forums or wikis. What dSRCI does is bring all the collaboration metadata to the surface so it can be examined and analyzed. dSRCI provides a means for employers and “connectome” researchers to easily pull on that bundle of strings and get a good look at all of a scientist’s beautiful Easter eggs. These “connectome” researchers will be able to analyze all the connections between scientists and their ideas. It will be possible to trace the evolution of an idea no matter where it sprouts up and follow that idea throughout its life, watching how it takes a little “DNA” from other ideas and finally grows into yet another solution for our world, all in real time. In addition - and perhaps more importantly - it will be possible to highlight ideas that may be at risk of “extinction” and show them to people who may have just the right expertise to bring them back to life. We won’t have to just hope that the right person happens to read the right blog post at just the right time. Software will be able to automatically spot potential matches between people and ideas and bring them together regardless of how scattered they may be. Now that is when scientific progress will really take off.

The contents of this post is Copyright © 2012 by Grant Sheridan Robertson.