Persistent Citation Resolver Service
Motivation
Although persistent identifiers solve a range of problems to do with data management and accessibility, the current default web infrastructure privileges the immediate http: URI of a resource over other identifiers, whether that http: URI is persistent or not. For instance:
User Behaviour: even if a resource has a persistent identifier presented to the user, users using current generation browsers will normally cite resources through the http: URI through which the browser has retrieved the resource—whether that http: URI is persistent or not. Such citation includes bookmarking in the browser, and cutting and pasting the contents of the address bar.
Application Behaviour: if the persistent identifier is not an http: URI, normal browsers and environments tend not to be able to action it (custom plug-ins notwithstanding); so they are dispreferred for citation in a web context.
Redirection behaviour: a persistent http: URI redirects to a non-persistent http: URI. The non-persistent http: URI will be displayed in a browser, and that is what users will continue to cite.
This means that non-persistent identifiers are still being cited, because they are web-resolvable. While it is preferable to prevent them from being cited when a persistent alternative is available, that is not always practical.
Several solutions are possible for this issue:
Modify human behaviour so they only cite identifiers using recommended best practice persistent citation formats.
Prevent the non-persistent URI from being exposed to the user, e.g. through a proxy pre-fetching the content to be displayed. (Note that this would disrupt any relative hyperlinks in the retrieved resource.)
Providing universal browser support for persistent non-http: URI identifier schemes, and educate users accordingly.
These solutions may be desirable, but are certainly impractical in the short term. The service described here offers a distinct alternative:
Allow users to continue citing non-persistent URIs as they currently do, but provide a persistent citation service which resolves those non-persistent identifiers to their persistent counterparts. Once that resolution takes place, the user will be able to access the resource through its persistent identifier, even if the non-persistent identifier is no longer resolvable.
The solution as realised here uses Handle as the persistent identifier for the resource. It exploits the fact that Handle has a historical record of which non-persistent URLs the persistent identifier resolves to. This allows the solution to map back from the non-persistent URL to the persistent identifier. Because Handle as a protocol does not readily expose this historical information to outside consumers, the mapping is made possible within the Handle protocol by creating a second, “reverse lookup” Handle for each Handle–URL pair. This reverse lookup Handle uses the non-persistent URL as its label, and is aliased to the preferred persistent identifier.
Other realisations of this solution are possible, as are other solutions.
Description
Assume the normal use case: a URL points to an object on a repository, and a Handle resolves to that URL. So both the Handle and the URL allow retrieval of the object.
If we now move the object to a new URL, and update the Handle, the new URL and the Handle still allow retrieval of the object. But the old URL, which is likely what people have bookmarked, does not.
However, since we know that the Handle was formerly associated with the old URL, we should be able to map the old URL to the current Handle through a persistent citation service.
The persistent citation service therefore interrogates a database of historical Handle–URL pairings. Given the old URL, it retrieves the Handle associated with it.
The database of historical Handle–URL pairings is realised as a Handle itself. If Handle A/X has been resolvable to URL u, then we create a new, “reverse lookup” Handle RLS/U, aliased to the Handle A/X. The Handle server RLS is a Handle server containing these reverse lookup Handles: it may be the same Handle server as A, or it may be a dedicated Handle server (see discussion below). U is an encoding of the URL u which respects the label constraints of the Handle server RLS. In this solution, we do not require reverse lookup Handles to be URL-safe, so U can be identical to u.
This solution avoids the presumption that a Handle server can be queried to return a Handle given a URL: this is not directly possible in the default implementation, although it is possible if the server is deployed with a relational database backend for its Handle records.
To illustrate:
hdl:1159/312 used to redirect to http://example.com/a.pdf
hdl:1159/312 will now point to http://example1.com/x/a.pdf
The Reverse Lookup Handle server has the naming authority 102.rls.
The reverse lookup Handle hdl:102.rls/http://example.com/a.pdf is created, redirecting to hdl:1159/312
This handle may be created simultaneously with creating or updating hdl:1159/312.
Alternatively, the initial installation of the persistent citation service for the Handle server 1159 walks through all registered hdl:1159 handles, and create corresponding hdl:102.rls/ records.
The Handle hdl:1159/312 is updated to point to http://example1.com/x/a.pdf
The Handle hdl:102.rls/http://example1.com/x/a.pdf is created, redirecting to hdl:1159/312
The instance of the object at http://example.com/a.pdf is deleted.
Each reverse lookup Handle server defines its own instance of a persistent citation service. We can presume that each administrator of a normal persistent Handle service will define a corresponding reverse lookup Handle server to store the history of Handle–URL pairs on their server.
To resolve an obsolete URL to its current location involves the following steps:
(Optional) Attempt to resolve the obsolete URL and fail.
Determine the persistent citation service instance appropriate for the URL. Formulate a persistent citation query.
(Alternative) Intercept the attempt to resolve the obsolete URL. Redirect it to a service call to resolve the reverse lookup Handle based on the URL, through the persistent citation service instance appropriate for the URL.
This involves prefixing the reverse lookup Handle naming authority and Handle resolver URL to the obsolete URL. It may also involve encoding the URL into a form compatible with the reverse lookup Handle server.
The reverse lookup Handle resolves to its alias Handle, which is the persistent Handle identifier for the resource.
The persistent Handle resolves in turn to the current URL for the resource.
The interception of the obsolete URL may take the form of an Apache Redirect statement, so long as it is known that an entire class of URLs on the server are obsolete and are covered by a particular persistent citation service instance. For example, the following would be added to the http.conf file in an Apache installation, in order to redirect all requests for /newrep/… on the server with a persistent citation lookup on Handle server hdl:102.rls:
RewriteEngine on
RewriteRule ^/newrep/(.*) http://hdl.handle.net/102.rls/http://%{HTTP_HOST}%{REQUEST_FILENAME}
Note that if URLs use the HTTPS rather than HTTP protocol, that information is not always retrievable from the SERVER_PROTOCOL environment variable in Apache. Implementers may instead need to base their rewrite rules on the HTTPS environment variable.
The following workflow illustrates the use of the persistent citation service as described, with an end user manually triggering the appropriate persistent citation service:
A user tries to access http://example.com/a.pdf
example.com fails to retrieve an object at the URL http://example.com/a.pdf
The user somehow determines that the persistent citation service instance corresponding to http://example.com/a.pdf is at http://hdl.handle.net/102.rls/
The user formulates a query for the persistent citation service as http://hdl.handle.net/102.rls/http://example.com/a.pdf . (This may involve the user typing the query directly, or else it may involve some intermediate interface.)
The following workflow illustrates the use of the persistent citation service as described, in a way transparent to the user:
A user tries to access http://example.com/a.pdf
example.com intercepts the request for http://example.com/a.pdf , and redirects it to a persistent citation service instance, as the Handle resolution query http://hdl.handle.net/102.rls/http://example.com/a.pdf
Handle resolves this to hdl:1159/312 , and resolves that in turn to http://example1.com/x/a.pdf
The following alternative flows are possible if the redirection is not successful:
Alt 1. example.com is down. Query is not processed.
Alt 2. example.com is not enabled to intercept URLs. Query fails with ERROR 404.
Alt 3. The reverse lookup Handle http://hdl.handle.net/102.rls/http://example.com/a.pdf has not been created. User gets a Handle-branded error page from http://hdl.handle.net/102.rls. If hdl:102.rls is a dedicated reverse lookup service, the error page can be enhanced to indicate why the error failed. (e.g. “This resource used to be here, but example.com hasn’t updated its location.”)
Alt 4. example1.com times out or fails. Since the redirection has succeeded, the user should realise that any queries on where the object is should be directed to example1.com and not example.com (or hdl:102.rls .)
Applicability
This solution presumes that any URL in its scope is only ever associated with one Handle (i.e. Identifier universality within its context)—i.e. that the mapping of Handle to URLs has a functional inverse (one-to-one or one-to-many). If more than one persistent identifier has resolved to the same URL over its lifespan, then there is ambiguity about which persistent identifier should be returned by the persistent citation service.
This solution presumes that all relevant historical data about URLs associated with Handles can be retrieved from the service. This means that the reverse lookup Handles must be populated throughout the lifespan of the non-persistent URL.
This solution presumes that reverse lookup Handles are to be maintained more persistently than the non-persistent URLs it operates on. As a utility service, there is a high expectation of persistence and reliability on the reverse lookup Handle server, and on any persistent citation resolver service based on it.
This solution presumes that the persistent citation service instance specific to the given URL can be determined uniquely. If there are multiple reverse lookup Handle servers corresponding to the URL, and each records a different corresponding Handle alias, then the user will not be able to decide which persistent identifier to prefer. If there is no way for a user to determine which instance corresponds to the given URL, then the persistent citation service cannot be triggered.
The solution is more realisable the less choice of persistent citation service instances there is: either one instance per institutional Handle server (so the instance is predictable given the repository), or else one instance nationally or internationally, as a catch-all. The latter alternative imposes a significant administrative burden on the service administrator: all participating Handle administrators need to be given write access to the centralised reverse lookup Handle server.
If it is to be used transparently to the end user, this solution presumes that requests for an obsolete URL can be intercepted, and redirected to a persistent citation service. This is impossible if a host is no longer maintained at the URL site, or if the host is incapable of selective redirection.
This solution presupposes that the reverse lookup Handle server will be updated promptly whenever a Handle in a client Handle server is associated with a new URL. This means that creating reverse lookup Handles must be integrated with the normal workflows of a persistent identifier-enabled repository. The integration is easiest if the same authority manages both Handle servers.
This solution uncouples the reverse lookup Handle server from the normal Handle server populating it. As a result, this solution allows that a reverse lookup Handle server might be shared between several normal Handle servers: it can be managed centrally or else by a federation of identifier providers. Uncoupling the two servers also minimises administrative confusion.
This solution also allows the reverse lookup Handle server to be bound to the normal Handle server populating it. The two may be the same Handle server, so long as there is no risk of collision between the normal Handles hosted on the server and the reverse lookup Handles. The easiest way of ensuring this is imposing a label format policy on normal Handles which the reverse lookup Handles cannot satisfy; e.g. banning http:// prefixes from labels or making labels URL-safe.
This solution remains applicable if the Handle ends up replaced by another Handle (e.g. when the resource is transferred to a different institution), so long as all institutions involved in the chain of transfer of Handles maintain aliases for each other. In other words, the solution allows transitivity. For example:
A resource at institution A has Handle A/X, resolving to URL u1.
The resource is transferred to institution B, where it has Handle B/Y, resolving to URL u2.
As part of the transfer, A/X is aliased to B/Y.
The reverse lookup service maps URL u1 to Handle A/X.
Handle A/X resolves to B/Y, which resolves in turn to u2.
Therefore, given u1, the persistent citation service can resolve to u2, despite the change in Handle.




