Sunday, September 20, 2009

Self Healing Hyperlinks

I'm an avid user of Microsoft OneNote 2007. I keep all my notes in it. I even wrote the outline and first draft for this post using it. I upgraded to the 2007 version specifically because it allowed the creation of hyperlinks between documents. Unfortunately those hyperlinks aren't worth a darn because if you move a page then all the links to anything on that page get broken, even links from within the same page. Many links don't even work correctly when you first make them. It is incredibly frustrating.

So, I have been stewing on a way to create a note-taking application based on HTML rather than Microsoft's proprietary format. I quickly realized that links created within this new application would also break as soon as the user moved a page in the collection. Sure, I could require the user to always and only use the application to move the pages then have the app update all the links to a page whenever it is moved. However, this would only work if the user made sure to use the application to move the pages and never forgot and simply moved them manually. It would also make moving pages pretty darn slow because it would have to search through every page in the system to find links to update. Therefore, I have been trying to think of a way to quickly find the new page location and update the links as necessary.

In what seems like a separate issue, I have noticed that academic papers often exist in multiple different locations all over the internet. Sometimes the file is named appropriately but oftentimes it is not. Sometimes there is good descriptive text surrounding the link to the file but oftentimes not. Sometimes the original file can still be found exactly where you first referenced it five years ago but usually not. This means that finding a current reference to an original academic paper for which you only have old citation information can be quite daunting. So I have been also trying to think of a way so that one could use a single link to refer to any one of the multiple identical copies of that document no matter where it was actually located on the internet and instantly retrieve that document, even if the original was no longer in place.

I had been thinking about using some kind of indexing system to enable one (or one's browser) to find these moved web pages. This morning, as I was waking up it finally hit me how to solve both of these problems and eliminate the vast majority of 404 errors at the same time. I call this system "Self Healing Hyperlinks."

The basis of the system is to insert additional information into the URL in a link so that either the target web server or the user's browser can find that target even if it has been moved. This additional information consists of domain and/or globally unique HTML element ID values which are included as attributes in the elements of the link target. The system also requires an indexing engine to be installed as a plug-in for the web-server software in order to index and look up these element IDs. When a broken link sends a browser to the target web site, that web server can look up the new location in its index rather than return a 404 error. One or more global indexing servers would also be set up to crawl the internet looking for documents that contain these special element IDs. Then, when a browser cannot find a target that was linked to using this additional information and the target web server did not return a replacement page, then the browser can query the global link database and still find the document. The system does not require any additional scripting in the web pages or the on the server. The web server and browser plug-ins would do all the work.

Identification of the Link Target:

This section describes how the linked-to web-page will need to be modified in order to accommodate this system.

Each link target gets a unique element ID. Now, we are all used to element IDs being unique within a web page. For this system the IDs must be unique within either a domain or (optionally) globally unique. Because many links are simply to a specific web page, the body tag of each web page would need to have a unique ID. Once assigned, that ID shouldn't change even if the name of the page is changed or it is moved to a different location.

A domain-unique ID can be made globally-unique simply be prefacing it with the current domain name of the target. For instance: A link to a domain-unique ID could still be broken if the file were moved to an entirely different server. However, if the file also has a globally-unique ID then it could still be located via a global indexing server.

The globally-unique element IDs present a particular problem. How can we specify that an ID is intended to be globally-unique and then ensure that it truly is unique within the entire world? Two possibilities exist: A global registrar (we have seen how well this works) or simply prefacing a domain-unique ID with the current domain name of the target and an additional code to indicate that the ID should be indexed globally. Something like may suffice depending on how many domains in the world have a server named "global." Perhaps some other term could be chosen that doesn't currently conflict with any known server name. Once that globally-unique ID has been assigned to that element it should not be changed even if the element (or page) is moved to a different server.

Now, about those unique element IDs: These could be human readable text (without spaces, naturally) or simply a sequential alphanumeric code. Thirty characters of case-insensitive characters could encode almost 5 x 1046 different unique elements. Web designers could simply use sequential codes for most of the elements in their web pages and human-readable codes for elements that they expect to have a lot of links to. It might even be prudent to use a domain-unique code for each and every element in each and every page in an entire site. This would guarantee that whatever link anyone created to any part of any page on your entire site could be healed using this system. It would also make it easier for people to refer to specific parts of your web pages for citations and such. (I know I personally am getting really tired of seeing citations that only point to the home page of a huge web site.) Again, once these codes are assigned and published to the web, they should not be changed even if the HTML element is moved to a different location.

There are several commercial services which offer to assign a globally unique number to a document for a fee. The only thing special about these numbers are that they are guaranteed to be globally unique and can therefore be searched for within databases and on the web. Prefixing a code number with a domain name allows anyone to create their own list of globally unique codes, obviating the need to pay the exhorbitant fees. (This paragraph added 10/14/2010.)

Specification of link target in the linking document:

This section explains how to properly construct links to the targets marked with these unique element IDs.

All links would continue to have the standard href attribute with "domain name / file path # fragment" structure. However, additional link information would be included within the URL as a CGI name/value pair. Naturally, a standard name for the CGI parameter would have to be chosen. I would not presume to know what that should be. I'll leave that up to the W3C. The value would be the domain or globally unique element ID of the specific element that is the target of the link. If the whole page is the target then the element ID of the body tag is used. Optionally, the unique element ID given in the CGI name/value pair could be replaced with the entire XPATH within the document to get to the target element. Then, if the web server plug-in cannot find the exact element then it might be able to find the parent element where the target used to reside. At the very least it will be able to find the original web page, even if it has moved.

Resolving broken links:

Remember, these procedures do not come into play unless the standard link is broken.

Web servers will incorporate an indexing engine that will index all the ID codes in all the elements on the site and store them in a database. Then, rather than simply returning a 404 Error page, the target web server would follow the procedure below:

  1. First, the target web server would search its index of domain-unique element-IDs to find a match and return the correct page.
    • The returned header could indicate that the link was broken but healed and indicate the new target location. Browsers or plug-ins could read this header and offer to update any links that may be in the user's bookmark collection.
  2. If the specified target ID could not be found then the server would look for the parent element's ID in the index and recurse up to find whatever parent was available.
  3. If no parent element is available then a parent document could be substituted.
    • This would require the database to maintain a site map history so it knows what used to be where. This is far better than an HTML redirect because there is no need to maintain the old folder structure just to keep a huge collection of identical .htaccess files. In addition, it would work even if the site was transferred to another server without all those pesky .htaccess files. The indexing plug-in would simply reindex the site and away you go.
  4. If no appropriate parent can be found and the link URL contained a globally-unique ID then the web server has two choices:
    1. Simply send a 404 error and let the browser deal with it.
    2. Consult one of many global-search-engines to locate the appropriate page and redirect the browser to that page.

If the web server returns a 404 error or is down then the browser can check for a globally-unique ID in the URL and look that up in a global-search-engine. The browser would then retrieve that page instead.

Uses beyond simply eliminating 404 Error pages:

Automatically repairing broken links in your own web site.

Many web-site management applications have a feature to check all the links on a site for broken links. However, all they can do is simply report those broken links. It is up to the web designer to manually find the new link location and repair the links. By reading the link-update info returned in the response header, the link checker software can now simply repair the links with no interaction by the web designer or present a list of repairs to make and let the designer select which ones to automatically apply.

Creating a globally unique URI for important documents such as academic papers, books, and laws so they can always be automatically found even if the original file is moved or deleted, or even if the original web server is down.

Any instance of that document on the internet would be tagged with the same globally-unique ID. Then, even if that document was deleted from one web site, the global-search-engine could return the location of all other instances of that same document. It is important to note that the globally-unique IDs would be assigned by the creator of the document or the institution where they were created. Once assigned they would not be changed. Persons posting that same document in another location should use that same element ID so that it can be found and cross referenced by the global indexing engine.

Adobe Acrobat .PDF files could incorporate the globally-unique ID within the title page of the document so that the global indexing servers could find and index the document even if it is not referenced by an element that contains the correct globally-unique element ID. This ID could also be included in the XMP metadata embedded within the document itself. In fact, these unique IDs could even be embedded in the metadata of image files or sound files so that they too could be located regardless of where they have been moved to.

Can even be used within an application that maintains multiple HTML documents on a user's personal computer (for note-taking or what-have-you).

Currently, if a user has a large collection of HTML documents on their computer hard drive that have extensive links between the documents, moving a document could potentially break a lot of links. Some web design software offers to update all the links within a site if the user moves a page. However, this still leaves a lot to be desired. It only works if the user uses the software to move the page. If the user uses other means to move the page then the software looses track of the links. Updating all the links requires searching throughout the entire collection which can take quite a bit of time. It can only work within a single site. If the user has designated various folders on their hard drive as various sites then the software will not be able to accommodate changes across multiple sites.

Creative Commons License
Self Healing Hyperlinks by Grant Sheridan Robertson is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States License.
However, anyone is welcome to implement this idea in open source software. I will likely never be able to write plug-ins for web servers.

No comments:

Post a Comment