Tuesday, February 7, 2012

dSRCI - distributed Scientific Research Collaboration Infrastructure - Introduction

Part 1 - Introduction
Part 2 - .sci Top Level Domain
Part 3 - Citations



I recently had the privilege to watch a TED talk by Michael Nielsen about what people are now calling “open science.” I have been acutely aware of the problem of scientists hoarding data and ideas for some time now. Nielsen’s talk, however, drove home the point that the primary reason for this hoarding and secrecy was basic academic survival. Nielsen made clear that scientists do not share their work and – more importantly – their ideas freely because they are afraid they will not get credit for any of these shared ideas. That some other scientist will scoop them by rushing to complete research on the same topic and publishing first.

This, well known, “publish or perish” culture is based on the unfortunate fact that publishing in established journals is the one and only means scientists now have for establishing their reputation. As explained by Nielsen, a count of published papers [along with the number of citations to those papers]  is currently all future employers have to go on when determining a scientists skill. For the last few hundred years, all of a scientist’s skills, innovation, and yes, ability to successfully collaborate have been boiled down to two numbers: published papers and citations. The result of this overly simplistic ranking system, in my opinion, is that the pace of science is limited to what each scientist can do one major – and, preferably, well-funded – project at a time. So, scientific progress is held up by the very scientists who are supposed to be advancing it because they are hoarding their data and ideas.

There are many, well-known, problems with the publish or perish system. One is that scientists tend to focus their research on projects that are likely to get published. If there is a trend then many scientists will chase that trend, consciously or unconsciously. If it appears that the  publishing organizations are shying away from publishing papers on certain controversial  topics, then many scientists will simply avoid those topics altogether. Another problem , brought to light by recent research, is that journals have a strong bias for “positive” results. Only articles that prove a hypotheses tend to get published. If a scientist has a hypothesis but his research proves him wrong, no one is likely to ever to see that research. This means other scientists are doomed to repeat the same research over and over again.

As Nielsen pointed out in his talk, there have been many attempts to solve this hoarding problem. The National Institute of Health (NIH) has implemented policies wherein all research they fund must be registered before the research begins (so no failed research can be hidden in a desk drawer) and – once completed – all research must be made available to the public along with all the data.  Unfortunately, I cannot see even this dramatic measure producing the true, free-flowing collaboration desired by many, including Nielsen and myself. Under the NIH policies, scientists will still keep all their ideas and research data secret until they publish. Yes, more data will become available, but only after either publication or the end of the grant period, which can take years, or even decades. No, the NIH policies are only the first jostle in a major transition that needs to take place.

Michael Nielsen also pointed out that many web sites have sprung up in an attempt to solve this problem, but, as he observed, all of them have failed miserably. All these sites are virtual  ghost towns now (pun intended). Where Nielsen and I diverge is in the reasons for these failures. Nielsen speculates that the problem is merely one of the culture of secrecy among scientists. He claims that policy changes, such as those instituted by the NIH, and … well … talking to your friends about the issue will solve the problem. But I believe the reasons these sites failed are deeper and more multi-faceted than that.

Yes, a big reason scientists have chosen to stay away from these sites is the “credit problem.” But we must look deeper if we are to find a solution that will work in the world where these scientists are forced to live. Even though posts on these web-sites are usually public, each site remains an island unto itself. Depending on user name and post formatting differences, it is very difficult to hunt down all the contributions made by a particular scientist on every blog or forum to which they contribute, should a future employer even choose to do so. I have specifically chosen to use the exact same user name on all the forums I contribute to just so all my contributions can be easily found. However, that goal has been stymied by the various user name requirements on each different site. So, someone would have to know to search for “GrantRobertson” on some sites, “Grant S Robertson” on others, “Grant Sheridan Robertson” on others, and “GrantSR” on still others. In some cases, for more important “contributions” I will repost that contribution to my personal blog. Other people may not go to that trouble.

There are other reasons for the failure of these “science collaboration” or “scientific social networking” web sites that Nielsen never even touched on. One problem is that all of these sites are trying to be the “one site to rule them all,” (with apologies to the Tolkien estate). Everyone wants to be the next Mark Zuckerburg or Jimmy Wales and own or control the one site where ALL scientists and/or academics go to collaborate. Yeah, that’s never gonna happen. For one thing scientists are likely hesitant to commit to a single site that may go away in just a few years. Scientists may also be hesitant to hand over all their data to one single corporation which likely holds the interests of its stockholders above those of scientific endeavor.

On the other hand there are also many problems with a proliferation of many different forums and blog sites, each covering a small niche of the scientific environment. First, there is the inconsistency of having hundreds of different user interfaces and data standards. Most blogs and forums are only concerned with enhancing minute by minute traffic, in order to enhance advertising revenue, so they often could care less about anyone’s ability to extract data from the site after the fact. This means data standards are essentially nonexistent in this realm. Yes, blogs are often packaged for an RSS feed. But, within that package, there are no standards for the information or topics that may be discussed within that blog post. Another problem is redundancy. The same topic may be discussed on several different blogs/forums. There is no easy,  machine-based means to check for this redundancy because the topics may be referred to  by different terms on the different sites. So, two or more different groups of people may be discussing the exact same thing and never know the other groups exist. This can stymie good collaboration.

Ask any business manager and they will tell you: If you want people to do something you have to reward that behavior. But you can only reward what you can measure. The ability to measure either a behavior or an output is key. As discussed earlier, for the past few hundred years our only metrics for scientific skill (which only peripherally includes collaboration)  have been total number of papers and total number of citations. And, as we have seen, all our current attempts at promoting collaboration have failed miserably at measuring what they were trying to promote, if they even  attempted to measure it at all. In addition, almost all recent  attempts have failed to enable true collaboration for one reason or another. Therefore, if we want to promote scientific collaboration we must do two things: We must make it possible. We must make it simple and easy to do, (preferably so easy that it is almost more difficult to not do it). And we must make it measurable, thus making it more desirable.

What can help us get from where we are to where we want to be? To use an analogy, we currently have a situation where no one wants to pile all their innovative Easter eggs into one proprietary basket which  they don’t trust. Scientists could spread their Easter eggs around by placing them under various bushes. But hiding them under lots of different bushes makes them harder to keep track of and harder to lay claim to if someone else should “find” them. What we need is a system that allows us to put our precious eggs anywhere we want, but to quickly and easily identify each and every one of them as our own, when necessary. To stretch the analogy further: What we need is a bunch of strings to tie to all of our eggs together in one big “web” so all we have to do is pull on one string and all our carefully crafted eggs come out for everyone to see how beautiful they are.

Tim Berners-Lee has a name for this bundle of strings: He calls it the “Semantic Web.”  The idea of the Semantic Web has been around for a long time. There are several protocols that have been devised to implement it. So, why haven’t scientists made use of this wonderful technology to enable them to collaborate and share their ideas in a manner that still allows them to get the credit they deserve? I have four words for you: “Pain In The Ass.” Yes, I’ll say it: The Semantic Web is a total pain in the ass for individuals to use. First people have to learn about the concept itself. Then they have to learn about the protocols, find and select taxonomies and ontologies (after they learn what the heck those words mean), and finally – most of the time – they would have to type these arcane tags into their blog and forum posts by hand. After all that work these collaboration-hopefuls must still hold their breath and hope those blogs and forums don’t filter out the tags they have so carefully inserted. Holy Insane Unnecessary Minefield Batman!

I know Tim Berners-Lee is a legend. And, from what I have seen, he is the sweetest guy, who only wants the best for the internet and the world that uses it. But I have to say, he kind of dropped the ball when it comes to implementation here. (Perhaps his plan was to allow others to pick up that ball and run with it.) Currently, creating content for the Semantic Web is like learning VI just so you can develop an AJAX web site that uses JSON to create dynamic application-like web sites, just so you can say “Hi” to your friends. Whether you know what I just said or not, you can see this is why current Semantic Web use is limited to the few people who can handle all the technical details. What we need is a way to make using the Semantic Web as easy as posting on Facebook or just clicking a “Like” button.

In the rest of this post I will outline – not just an new standard – but an entire new infrastructure that I am developing which I think will help facilitate the kind of deep, culturally-ingrained, natural collaboration that needs to take place among scientists (and interested amateurs) to help scientific progress keep pace with the needs of this century – and the next. I am currently calling this system the “distributed Scientific Research Collaboration Infrastructure” (dSRCI, pronounced dee-sear-see. I figured making it pronounceable would help people remember and easily talk about it).

dSRCI has the following goals:
  • To provide credit for all contributions no matter how minor or where they may be made, thus promoting collaboration and public sharing by measuring and rewarding said collaboration.
  • To be able to be applied anywhere and everywhere rather than on just one – or even a few – web sites.
  • To be consistent, even across various web sites and types of sites.
  • To be almost impossible to censor or shut down.
  • To be so easy to use it is almost more difficult to NOT use it.
Of course, as I discussed earlier, lofty goals do not a functional system make. To be successful, dSRCI must include the following specific features (which will all be discussed more thoroughly in subsequent posts):
  • A means to easily check and track contributions regardless of where they are made. This will require…
    • A single, unique identifier for each contributor which …
      • Can be indexed and searched for to find all contributions by that person.
      • Preferably, actually points to a reference about that contributor.
    • A unique identifier for each contribution which …
      • Can be easily searched for to find all copies of that contribution that may have been placed on other sites (either for archival or redundancy purposes or for further discussion).
      • Points to the original specific post within a blog or forum so it can be viewed in context if possible.
      • Does not rely on the original location to still be available.
      • … And, thus, allows for redundant copies to exist anywhere, preventing censorship of ideas or scientists.
    • A consistent and easy to use citation system which includes:
      • A consistent citation encoding scheme.
      • A consistent means for quoting and citation across all web sites (similar to the “Like” and “Share” buttons that festoon many blogs today). 
      • Automatic citations where possible (such as when a user replies to or quotes another post within a forum).
    • A means to indicate the topic of the conversation in a universally searchable manner, regardless of language.
    • A consistent and intuitive vocabulary to describe the multifarious relationships between all the people participating in the collaboration as well as  each individual “Artifact of Collaboration” (AoC).
      • I am still in the initial stages of working out this vocabulary. Though I refuse to be limited by the current restrictions of existing vocabularies, I also see no need to reinvent the wheel or to make the semantic web even more cluttered with sameas declarations. So, I will need time to study existing vocabularies and hash out ideas with others who have more experience designing RDF vocabularies. This part of the dSRCI infrastructure is guaranteed to be an ongoing project.
      • If you have ideas or suggestions: Please go to the blog post about the vocabularies and comment there. Thank you.
  • An easy to use and consistent means to create or format this content which could consist of:
    • Additional software on blog or forum servers.
    • Browser extensions for use on sites that do not have the requisite features.
  • Naturally, all of this should use existing technologies, when possible.

In order to meet these goals and provide these features, dSRCI consists of the following parts:
  • A new .sci top-level-domain under which all bona fide scientists will be assigned their own perpetual domain name for use as a unique identifier that will never (ever) expire.
  • A simplified AoC (Artifact of Collaboration) identifier format so each and every individual contribution made by a scientist can be indexed and searched for regardless of where it is posted, transferred, quoted, or archived.
  • A citation standard that encompasses both the scientist and the AoC identifiers as well as all the standard citation information but in a format that is more concise than current Resource Description Format (RDF) and far more precise than Dublin Core (DC).
  • A new vocabulary to supplement existing RDF standards and vocabularies.
  • Software to make all this so easy to use that people, including busy / distracted scientists will actually use it.
In subsequent posts, I will elaborate on each of these subsystems within the infrastructure. Please keep in mind that this system is not meant to be the be-all-end-all of scientific collaboration. It does not address the data access issues created by the lack of flexible yet extensible standards for data storage. I plan to tackle that problem at a later date. Nor does dSRCI provide the actual collaboration tools, such as forums or wikis. What dSRCI does is bring all the collaboration metadata to the surface so it can be examined and analyzed. dSRCI provides a means for employers and “connectome” researchers to easily pull on that bundle of strings and get a good look at all of a scientist’s beautiful Easter eggs. These “connectome” researchers will be able to analyze all the connections between scientists and their ideas. It will be possible to trace the evolution of an idea no matter where it sprouts up and follow that idea throughout its life, watching how it takes a little “DNA” from other ideas and finally grows into yet another solution for our world, all in real time. In addition - and perhaps more importantly - it will be possible to highlight ideas that may be at risk of “extinction” and show them to people who may have just the right expertise to bring them back to life. We won’t have to just hope that the right person happens to read the right blog post at just the right time. Software will be able to automatically spot potential matches between people and ideas and bring them together regardless of how scattered they may be. Now that is when scientific progress will really take off.

The contents of this post is Copyright © 2012 by Grant Sheridan Robertson.

No comments:

Post a Comment