A resolution service for IATI Organisational Identifiers

I've been prototyping a service to extend the data available about any organisation identifier used in IATI activity documents.  This includes data about the 'realm' or namespace of the identifier as well as about the referenced organisation itself.

http://opencirce.org/org

For example GB-COH-7676886 is a structured code used in IATI.  The URL

http://opencirce.org/org/code/GB-COH-7676886

or

http://opencirce.org/org/realm/GB-COH/code/7676886

provides information about the realm (GB Company Numbers) and sites which provide information about this realm, and information about the specific organisation ,"Publish What You Fund".  This includes the name (taken from the xml feed from OpenCorporates) and links to the relevant pages on CompanyHouse and OpenCorporates.

The application itself contains no organisation data. Instead it contains descriptors of the realms in an XML document, with templates to derive the URLs of related pages. Currently there are six 'realms' described with varying degrees of detail.

UK and US companies and charities can be linked quite well. However DAC identifiers are more difficult.  These codes are defined by the Development Assistance Committee of the OECD for their report and have been adopted for use in in IATI . They identify over 400 government departments and NGOs but the OECD data provides only the name and abbreviation of the organisation.  I've appealed for further information on these codes on getthedata.org but  have also started my own project to gather at least websites.  I'd be pleased to hear from anyone who could help with this task.

Of course none of this would be possible without the wonderful work of the data warriors behind OpenCorporates and OpenCharities.

Milestones and fragments in TEI with XQuery

Hierachical structures are excellent for organising complex information, but a specific hierarchy is a design choice.  Often there are multiple ways in which the same information can be structured. A typical example is encountered in document structures.  One might represent the content as a hierarchy of chapters, sections ,sub-sections and paragraphs.  One might also want to represent the physical structure into pages and lines.  The two structures conflict since a line may be split over page boundaries.  In Jackson Structured Programming, a program design method developed by Michael Jackson which influenced me greatly in the '70s, this is called a structure clash.

The problem of how to represent both views of the content in a single XML document is a well-known problem. Jeni Tennison discusses the alternatives in her excellent blog post.

In TEI (Text Encoding Initiative), the practice is to represent the chapter/section structure as the main XML structure and use milestones to mark the page boundaries using page-breaks (pb elements).  In addition to providing the page number, these elements also provide a place to link to facsimile images. The convention is for page breaks to refer to the following page, but sometimes its not clear, for example in the Punch corpus converted from Guttenberg Project text to TEI by the Oxford University Computing Services

I first encountered this problem last year when working on the Virginia Secession project with Chris Kemp at the University of Richmond in Virginia. I searched for prior work on the problem, but not knowing the right terminology, missed the algorithm written by David Sewell which was used for the Java function util:get-fragment-between () in eXist.

So I set about writing my own function. In addition to extracting a subtree of the full document, I needed to expand a couple of domain-specific attributes. I also thought it would be necessary to mark nodes which had been truncated so continuation marks could be inserted into the rendered page.

When I later discovered the eXist function, although it was faster, it did not allow the customisation necessary in this application.  David Sewell's algorithm seemed somewhat slower and I wasn't clear how to mark truncated nodes.

I've finally got round to documenting my algorithm.  To document the algorithms and compare timings, I've added this problem to a comparison application I'm playing with.

http://kitwallace.co.uk/Book/set/fragment-between

Timings need to be treated with caution but they seem to indicate performance ratios between the Java function, my XQuery algorithm and David Sewell's algorithm of 1: 4: 12.  I need a full set of tests for these algorithms to ensure that my algorithm is sound. It is used in my version of the Punch example ,with the zip of the full application.

 

 

 

 

 

 

 

 

An XQuery / TEI example - Punch 1914-1916 - revisited

Joe Wicentowski developed a very nice example of TEI processing using XQuery/exist-db which he has used in teaching. I've just resurrected my version. This was originally written to use with the eXist URL-rewriting architecture. I've used it here as another test of the approach to URL-rewriting I'm working on.

 Punch

Zipped Resources

It's all to easy to get Apache into infinite rewriting loops.  I found it useful to adopt the convention of a capitalised name for the application, uncapitalised for its collection:

   RewriteRule ^/Punch/(.*)$  /punch/xquery/content.xq?_path=${escape:$1} [QSA,P]

The general function in the URL library creates the Context object:

declare function url:parse-path($steps) {
     if (count($steps) = 0)
     then ()
     else if (count($steps) = 1)
     then element {$steps[1]} {()}
     else  (element {$steps[1]} {$steps[2]}, url:parse-path(subsequence($steps,3)))  
};

declare function url:path-to-sig($steps) {
     if (count($steps) = 0)
     then ()
     else if (count($steps) = 1)
     then $steps[1]
     else  ($steps[1],"*",url:path-to-sig(subsequence($steps,3)))  
};

declare function url:get-context() as element(context) { 
   let $path := request:get-parameter("_path",())
   let $path := if (ends-with($path,"/")) then substring($path, 1, string-length($path) - 1) else $path
   let $steps := tokenize($path,"/")
   let $signature := string-join(url:path-to-sig($steps),"/")
   return
     element context {
       for $param in request:get-parameter-names()
       let $value := request:get-parameter($param,())
       return element {$param} {$value},
       element _signature {$signature},
       url:parse-path($steps)
     }
};

which is used to guide the construction of the page body:

declare function phtml:page($context) as element(div) {
let $sig := $context/_signature
return
<div>
  {phtml:search-form($context/q)}
  {if (exists($context/q))
   then
      let $hits := punch:section-selection($context/q)
      return
        (phtml:breadcrumbs(()), phtml:hits-in-context($hits))
   else
   if ($sig = ("","issue"))
   then
        (phtml:breadcrumbs(()), phtml:corpus-toc())
   else
   if ($sig = "issue/*")
   then
      let $issue := punch:issue($context/issue)
      return
           (phtml:breadcrumbs($issue), phtml:issue-toc($issue))
   else if ($sig = "issue/*/section/*")
   then
      let $issue := punch:issue($context/issue)
      let $section := punch:section($issue,$context/section)
      return
            (phtml:breadcrumbs($section), phtml:section($section))
   else ()
   }
</div>
};

Further functions build parts of the page, eg to build the table of contents

declare function phtml:corpus-toc() as element(div){
   <div class="body"  id="corpus-toc">
       <ul>
            {
            for $issue in punch:all-issues()
            let $title := punch:issue-title($issue)
            let $issue-id := punch:issue-id($issue)
            order by $issue/@xml:id
            return
                <li>{$title}</li>
            }
        </ul>
   </div>
};

The main script builds the HTML, using ids in the returned page to distribute parts of the page around the web page:

import module namespace phtml = "http://kitwallace.me/punchhtml" at "../lib/punchhtml.xqm";
import module namespace url = "http://kitwallace.me/url" at "/db/lib/url.xqm";

declare option exist:serialize 'method=xhtml media-type=text/html indent=yes';

let $context := url:get-context()
let $page := phtml:page($context)
return
        <html>
            <head>
                <meta http-equiv="Content-Type" content="text/html;charset=utf-8"/>
                <title>Punch, or the London Charivari</title>
                <link rel="stylesheet" href="/punch/css/blueprint/screen.css" type="text/css" media="screen, projection"/>
                <link rel="stylesheet" href="/punch/css/blueprint/print.css" type="text/css" media="print"/>
                <!--[if IE ]><link rel="stylesheet" href="/punch/css/blueprint/ie.css" type="text/css" media="screen, projection" /><![endif]-->
                <link rel="stylesheet" href="/punch/css/screen.css" type="text/css" media="screen"/>
            </head>
            <body>
                 <div class="container">
                    <div class="span-24 last">
                        <div class="span-16">
                            <div class="banner">
                                <img alt="Punch" width="358" height="114" src="/punch/data/images/banner.png"/>
                            </div>
                        </div>
                        <div class="span-8 last">
                            {$page/div[@id='search-form']}
                        </div>
                    </div>
                    <div class="span-24 last">
                        <hr/>
                    </div>
                    <div class="span-24 last">
                        <div class="inner">
                            {$page/div[@id='breadcrumbs']}
                        </div>
                    </div>
                    <div class="span-24 last">
                        <hr/>
                    </div>
                    <div class="span-24 last">
                        <div class="inner">
                            {$page/div[@class='body']}
                        </div>
                    </div>
                    <div class="span-24 last">
                        <div class="inner">
                            <div class="footer bordered">
                                <div id="footerlinks">
                                    TEI Consortium |
                                    TEI@Oxford 2010 |
                                    Kit Wallace |
                                    eXist
                                </div>
                            </div>
                        </div>
                    </div>
                </div>
            </body>
        </html>


 

A Periodic Table of Visualization Methods - an XQuery, linked version

Back in 2007, Visual-Literacy.org  published a great Periodic Table of methods of visualisation. This displays around 100 diagram types arranged in the layout of the Periodic Table of Elements.

The web page uses a Javascript library to display an example of a diagram type when you mouse-over its box. A neat trick but perhaps not very accessible, so I took the liberty (with Ralph Lengler's subsequent approval) of scraping  this table to create a full listing of all the diagram types in alphabetical order.  The resultant application was used in teaching, allowing student to create their own classifications.  I see that application and my orginal blog post is still referenced on their site

Recently there has been a flurry of  tweets about the Periodical Table and coincidently I was moving the original application to a new server and changing to REST-style URLs as part of my work on URL-rewriting.

Here is the new version of that visualisation browser. There are only minor differences in the URL-rewiting approach to that used on the Maths site. Here the application is in a subdirectory of the kitwallace.co.uk virtual host, so a rewrite rule needed to be added to the Apache configuration of that domain and paths to CSS and Javascript files go one level further up.

The module code, particularly the dispatch function, is more readable than for the original query-string URLs. The main menu is also generated using the path 'signature'. The use of the path signatures makes it much clearer how pages are linked.  In this case I didn't parameterize the root directory.  I feel there has to be a better way to handle the problem of multiple versions.

 

URL-rewriting and dispatching in XQuery

I've always wanted to create websites with those cool REST-style URIs but have been stuck with HTML-form style query strings which also expose the script name.  High time to do something about this weakness.  I have an associated problem with dispatching, calling the appropriate XQuery function depending on the incoming parameters which gets messy.

Background

eXist provides a powerful framework for URL re-writing which also provides pipelining. Rewriting is configured in the eXist configuration files and  by controllers which can reside in the database. I have used this approach in the past but of itself it doesn't address the problem of matching resource paths to functions in the code and I have found it a bit tricky difficult to configure. At XML Prague this year, Adam Retter showed an alternative approach using the XQuery 3.0 feature of annotations.

This blog post describes the approach I'm experimenting with, just for HTTP GET operations at present. No doubt other developers have their own styles which it would be interesting to compare.

Example Site

Here is a toy example site developed by Maia (age 6) to improve her maths.  (ignore the funny domain - it was just one I had hanging around)

http://ourstreet.info/math

The application is intended to provide multiple sets of exercises with randomly-generated variants and some attempt at diagnosing mistakes. It is tailored to Maia's family and her favourite colour!

Apache

I thought I'd look at a solution using Apache which I'm already using for virtual hosting  (Apache2 on Ubuntu) so the initial rewriting will be done with mod-rewrite.  Since the site uses a number of different paths, it would be tedious and inflexible to rewrite all the possible URLs this way, so I decided to pass the whole path as a parameter to a main script, appending the orginal query string.

The Apache virtual host file created for the domain contains the Proxy directives and rewrite rules:



   ......


   ProxyPass / http://localhost:8080/exist/rest/db/apps/Maths/
      ...
    RewriteMap escape int:escape
      RewriteRule ^/$ /math/  [R]
    RewriteRule ^/math$ /math/ [R]
    RewriteRule ^/math/(.*) /xquery/home.xq?_path=${escape:$1} [QSA,P] 
 

The application will reside in the collection /db/apps/Maths with a main script called home.xq in the xquery subdirectory. The first two rules normalise short paths  to /math/ . The third gets the path after /math/ and passes it, after escaping,  as the parameter _path.  The query string on the original URL is appended (as directed by the QSA parameter)

So a url like

http://ourstreet.info/math/set/1/exercise/2

will be rewritten as

http://localhost... Math/xquery/home.xq?_path=set/1/exercise/2

and

http://ourstreet.info/math/set/1/exercise/2/variant/7,4/response?answer=7

will be rewritten as

http://localhost... ..Math/xquery/home.xq?_path=set/1/exercise/2/response/variant/2,10/response?answer=7

XQuery database structure 

These paths are mapped into a sub-collection Maths of the apps collection in the eXist database.

/db/apps/Maths/

                     xquery -- xquery scripts including home.xq

                     lib       --xquery modules

                     data    -- data files such as the exercises

                     system  -- configuration files etc

                     css, jscript etc.

/db/apps/Maths/math is a virtual collection on the same level as the other Maths collections.  Common XQuery libraries are placed in /db/lib/ 

The context object

A common pattern I use in XQuery applications is to gather all the parameters and any other environment variables of interest into a context object (sorry, node) to pass into functions.  This may seem a bit heavy but its very flexible. Query parameters are converted to child elements of the context. Now we also need to parse the _path string. To simplify parsing, I've assumed that types and type values alternate in the path. So



set/1  => <set>1</set>


set    =>  <set/>


set/1/exercise/2/variant/12,4/response  =>


    <set>1</set>
    <exercise>2</exercise>
    <variant>12,4</variant>
    <response/>


Dispatching

The appropriate function to generate an HTMLpage can be selected in the basis of the 'signature' of the function derived from the path by replacing the value parts of the path with "*" to create a signature:

set/1/exercise/2  => set/*/exercise/*

An XQuery dispatch function calls the appropriate function:



declare function tm:dispatch($signature,$context) {
if ($signature eq '') then tm:home() 
 else if ($signature eq 'set') then tm:home() 
 else if ($signature eq 'doc/*') then tm:doc($context) 
 else if ($signature eq 'set/*') then tm:set($context) 
 else if ($signature eq 'set/*/exercise') then tm:set($context) 
 else if ($signature eq 'set/*/exercise/*') then tm:exercise($context) 
 else if ($signature eq 'set/*/exercise/*/variant/*/response') then tm:answer($context) 
 else if ($signature eq 'set/*/worksheet-form') then tm:worksheet-form($context) 
 else if ($signature eq 'set/*/worksheet') then tm:worksheet($context)
 else ()
};



The dispatcher can contain multiple signatures for the same function to support alternative paths to the same endpoint.

Thus the function tm:exercise($context) could be associated with  the signature exercise/*/set/* as well as the signature set/*/exercise/* 

[Aside] I originally used an XML table of signatures and functions, and either used selection followed by util:eval() or generated the text of the function above from the table. On reflection, the cost of the machinery of these approaches doesn't seem to be worth-while. The switch statement of XQuery 3.0 will improve the dispatch code.

Absolute or Relative URIs

The straighforward approach is to use absolute URIs when generating links to related resources in the application.  For example a breadcrumb-style menu:



let $set := $tm:sets[@id=$context/set]
 return 
 <div class="menu">
   Sets
   {$set/title/string()}
   Another?
 </div>


Absolute URIs are also used in the HTML to link to CSS and JavaScript:

    <link rel="stylesheet" type="text/css" href="/css/screen.css" media="screen" ></link>

Alternatively, we could use relative URLs but I found these require much more care to get right

However neither approach meets the needs to be able to use alternative versions of the script for testing on the same server. So I append a _root parameter to the rewritten URL and then prefix all URIs with /{$context/_root}.  The modified  mod-rewrite rules become:



RewriteRule ^/math/(.*) /xquery/home.xq?_path=math&_resource=${escape:$1} [QSA,P]
RewriteRule ^/test/(.*) /xquery/home-2.xq?_path=test&_resource=${escape:$1} [QSA,P] 


The full configuration file is here

HTML forms and REST-style URIs

HTML forms produce query strings and not resource paths. Forms are needed to gather inputs , for example the child's answer to a question. The action part of the form is the absolute resource path whilst the form creates the additional query string




<form action="/{context/_root}/set/{$model/set}/exercise/{$model/exercise}/variant/{$model/variant}/response" 
      method="get">  
    ...
  <input type="text" name="answer" id="answer" size="4"/>


Thus the path to a specific answer to a question will look like

/math/set/1/exercise/3/vars/10,3/response?answer=5

Code

The code is browsable and the full application is available as a zip file. 

A brief explanation of the application itself

Each exercise in a set is parameterised by expressions at points in the exercise definition. When a question is selected, the var expressions defining variables are executed, typically to generate random values. These values are used to complete the question and the sequence of values define a variant of  the exercise. When a response is returned, the value elements are computed using the variant's values and an appropriate response created. The transformation of the exercise XML to HTML  is performed by a recursive function which is guided by the current context which includes the sequence of variables which define a variant of an exercise. var and value elements are evaluated with util:eval()

Here is an example exercise:



 <exercise id="2" use-words="true">
         <title>The runaway chickens</title>
         <question>There are <var>tm:random(5,13)</var>  chickens in the yard and <var>tm:random(1,5)</var> chickens  ran away. How many chickens are now in the yard?</question>
         <answer>
             <correct>Yes would you believe it, there are now  <value>$var[1] - $var[2]</value> chickens left in the yard!  </correct>
             <wrong>Actually there are really only  <value>$var[1] - $var[2]</value> chickens left in the  yard.</wrong>
             <alternative>
                 <value>$var[1] + $var[2]</value>
                 <diagnostic>You have added the numbers rather than subtracted them</diagnostic>
             </alternative>
         </answer>
         <hint>This is a subtraction problem. You have to subtract  the number of chickens who ran away from the number in the  yard.</hint>
     </exercise>


So what did Maia think of it? 

Maia liked the generated worksheets which she could complete and take to school. However the interactive version didn't compare very well with her current obsession -  Moshi Monsters. So my next improvement is to add an intensely annoying repetative and addictive soundtrack. 

 

Converting an indented list to a tree - the power of util:parse #existdb

Joe Wicentowski posted a nice example of the power of XQuery to do text processing.  The problem concerns converting an indented list to a tree.

I find that eXist's util:parse() can be helpful with this kind of transformation. I used this approach in a function to convert from json to XML: first generate the XML as a string and then parse the string to XML.

As Joe does in his approach, the indented list is first converted to a sequence of lines, each of which has a level attribute  ( $lines ) e.g.

<line level="0">The President left at 8:48 am</line>
<line level="1">Administration recommendations on Capitol Hill</line>
<line level="1">Improvements</line>
<line level="1">Richardson’s trip to New York</line>
<line level="1">Health programs</line>
<line level="2">Goals</line>
 

A recursive function generates nested lists and items as a string using one item look-ahead:

declare function local:nest($lines) {
  let $this := $lines [1]
  let $next := $lines [2]
  return
      if (empty($next))
      then concat("<item>",$this,"</item>", string-pad("</list></item>",$this/@level))
      else 
         if ($this/@level = $next/@level)  (:in the same list :)
         then concat("<item>",$this,"</item>", local:nest (subsequence($lines,2)))
         else if ($this/@level < $next/@level) (: going down :)
         then concat("<item>",$this,
                     string-pad("<list>",$next/@level - $this/@level),
                     local:nest(subsequence($lines,2))
                     )
         else  (: ($this/@level > $next/@level) so  going up :)
            concat("<item>",$this,"</item>",
                     string-pad("</list></item>",$this/@level - $next/@level),
                     local:nest(subsequence($lines,2))
                    )
};

Finally the string is parsed to XML to create  the tree:

   util:parse(local:nest($lines))

So Joe's text converted looks like this

The function (deprecated in eXist's XPath namespace) string-pad($s, $n) creates a string of $n $s's. You need to be a bit careful to ensure that the string is well-formed XML (I forgot that level changes may not be just +/- 1) so its a bit tricky to build the string correctly but at least this kind of processing is very fast in eXist.

Making TheGloucesterRoadStory location-aware

I've been thinking for a while about a location-aware page for the Gloucester Road Story.  I've now knocked up a simple page using the W3C Geolocation API with jQuery. I've  been guided by the Opera help page.

When the home page is loaded, the API is initialised. If no location is available, fall back to a default location in the road:

$(document).ready(function() {   
    if (navigator.geolocation) {
       navigator.geolocation.watchPosition(get_premise_latlong, errorFunction,{maximumAge:100000});
    } else {
       get_premise_number(200);
    }
  });

get_premise_latlong uses AJAX to request the page for the nearest premise from the server:

function get_premise_latlong(position) {
    var lat = position.coords.latitude;
    var long = position.coords.longitude;
    var url = "xq/mobile.xq?lat="+lat+"&long="+long;
//    alert (url);
    $('#info').load(url);
}

and the mobile.xq script finds the nearest premise using the following function:

declare variable $glm:range := xs:double(0.00001);

declare function glm:nearest-premise($lat as xs:double,$long as xs:double) as element(premise)? {
    (for $premise in $gl:premises[latitude][longitude] (:only geo-coded premises :)
    let $dlat := xs:double($premise/latitude) - $lat
    let $dlong := xs:double($premise/longitude) - $long
    let $distance := $dlat * $dlat + $dlong * $dlong
       (:not correct since $dlong size depends on $lat but good enough :)
    where $distance < $glm:range  (: only if within range of user :)
    order by $distance
    return 
       $premise
    )[1] (:get the first i.e. nearest if any :)
};

If no location is found, the script selects a default premise. The premise data is then rendered as an HTML fragment and returned to the client where it replaces the #info div on the page.

Testing was done using the Opera Mini Simulator but now I'm a bit stuck. I'm ashamed to admit that I don't have a location-aware phone so I can't actually test it on the ground.  If anyone suitable equiped happens to be walking down Gloucester Road (and where better to shop is there), I'd love to know if it works. Better still, pop in for a cuppa and we can test it together.

The Gloucester Road Story

 My Gloucester Road project has come some way since I first wrote about this idea in April last year. It now has its own domain and the number of transcribed records from the Kelly's Street Directories is about 20% of the total.

Today I added some photographs of a section of the road known as Pigsty Hill.  The properties have been empty for years so I wasn't in a hurry, but quite suddenly (to me) boarding has gone up and the buildings are being demolished to make way for housing. All the properties have a long history until their closure. At number 154, the upholster and his wife collected Bristol pottery, now in the Bristol Museum. At number 156, there has been a fish and chip shop for over 100 years.

The last occupant of number 156 were a very nice Chinese family who ran the Diamond Wok. Despite the fact that the business closed some years ago, Google finds about 40 pages on the internet, confidently enticing customers to this chinese takeaway. One of the worst is http://www.placesto-eat.com/diamond-wok-bishopston-bs7-8nt/   Note the inapporiate images (for a chinese take-away) and the not-so-helpful location in Beckenham, London.

These are business directory sites which seem mostly to harvest address lists. Entries seem rarely to be removed.  Neither the source of the orginal data nor the date of acquisition is given so it is impossible for the viewer to rate the accuracy of the data for themselves. Given the date and an estimate of the churn in a locality, we could deduce the likelihood of an entry being still correct after N years.

The internet often seems to be an idiot savant that never forgets but never checks. That's a good question for internet science I think: in typical business online directories, what are typical rates for both businesses listed but no longer active (false positives) and businesses active but not listed( false negatives) ?  One suspects that online directories have much higher rates for both than the published street directories from which our historical data is being extracted. In addition, both kinds contain simple errors, a problem for transcribers.

The local business directories, such as gloucester-rd.co.uk, Love Gloucester Road and Bishopston Matters manage better. False postives seem lower whilst false negatives seem higher, perhaps because businesses have to submit their own details.

I think of the Gloucester Road Story site is a prototype for a different kind of site, a site where tiime and place are core, the main dimesions of the data space.

Time must be explicit not just so that historical date can be included, but so that data about current occupants can become history as new occupants appear.

Place must be explict because it allows location browsing.  Addresses are not just strings, they are located geographically.  This means that a visitor can see what is next door, what is across the road (although you can't do that yet!).  Using both dimensions, we can imagine a virtual street view at any time past.  However we need to collect a lot more data to achieve this goal.

A third, fuzzier dimension is the type of business enacted. Since the street directories lack a formal naming convention some recoding will be needed. but the goal of being able to visualise the decline in boot-menders and the rise of coffee shops will be worth the effort.

Provinence is important because visitors should be able to check the validity of data themselves.  This is straightforward for transcriptions from street directories since the photocopied pages are on the site. Personal observation and interview are more problematic and need attribution to individuals.

I now understand why addresses are treated as strings and not full entities.  I'd naively assumed that numbering of addresses was monotonic  - the house next door going up the street has a higher number. Not so - there is a whole terrace of houses which have the same number as houses further down the street. I'd assumed all premises had numbers. Not so: Churches and public buildings  typically don't.  Added to this are the complications which I had anticipated such as the change of address with time due to road renaming (Gloucester Road was formerly called Horfield Road), re-numbering or rebuilding.  I dont imagine solving this problem, just a hard ongoing fight.

The sustainability of the site will depend on the ability of the team of local enthusists to keep it going, but I think this model puts us in with a chance. The challenge is to complete the historical sample and then lay down another, say, 10 years of history. Well, you have to take the long view.

 

 

 

 

 

Virtual Hosts with Apache and eXist-db

Being a sole developer is so much harder that working within the supportive framework of, say, a university. Whereas in the past I could pop alone to my mates in IT support and ask them to set a new server, and grizzle at them when it went down, now its all up to me. Both paid consultancy and hobby projects have really stretched my limited UNIX skills over the past few months.

Today however I feel elated. After a few days of struggle, I can now set up virtual hosts with Apache, Jetty and eXist-db. This means that I can at long last provide a clean URL for The Gloucester Road Story and a few others.

Here for the record is a summary of my approach.

Choose a VPS host

There is a confusing number of companies offering UNIX VPS now . I've tried Amazon EC2 (thanks guys for the year's free micro-instance) ElasticHost (very helpful people but a tad expensive) and now I'm trialing BitFolk on a friend's recomendation. The server has about 1 Gb, 20 Gb disk, 1 IP address and costs about ...

Choose a UNIX distro

I've been using Centos on servers but Ubuntu on desktops.  So this time I installed Ubuntu Lucid Lynx to reduce confusion.

Choose Software

Apache2  (even if the tide is going Nginx's way), Java Open-JDK., eXist-db 1.4.2

Configure eXist

eXist is installed from the .jar into /usr/local/eXist. The only problem is in setting passwords for the guest and admin users - It seems almost impossible to get these set right using the web admin screens - I had to resort to using the Java client

The only changes I made are to enable  (in /usr/local/eXist/conf.xml) some additional modules I use - math, compression for example.  I also created a new database user for each application.

The resources for each site are all stored in the database. So the gloucesterroadstory site is stored in the collection /db/apps/theroad.

Configure Apache

In addition to the default enabled modules, the following also need to be enabled: 

  • proxy.conf
  • proxy.load
  • proxy_http.load
  • rewrite.load

I created files for each site in /etc/apache2/sites-available and made them live with symbolic links in sites-enabled. Here is the configuration I created for the Gloucester Road site:

 

 <VirtualHost *:80>
    ServerAdmin kit.wallace@gmail.com
    ProxyRequests off
    ServerName thegloucesterroadstory.org
    ServerAlias www.thegloucesterroadstory.org
    <Proxy *>
         Allow from all
    </Proxy>
    ProxyPass / http://localhost:8080/exist/rest/db/apps/theroad/ 
    ProxyPassReverse / http://localhost:8080/exist/rest/db/apps/theroad/ 
    ProxyPassReverseCookieDomain localhost thegloucesterroadstory.org 
    ProxyPassReverseCookiePath / / 
    RewriteEngine on 
    RewriteRule ^/$   /home.xq [P] 
    RewriteRule ^/system  -  [F]  
 </VirtualHost> 

ServerName thegloucesterroadstory.org     the site's domain name, with a DNS entry pointing to the server's IP address. I'm using 123-reg for DNS management.

<Proxy *> Allow from all </Proxy>   the proxy.conf file denies proxying to all hosts so that must be overridden here

ProxyPass / http://localhost:8080/exist/rest/db/apps/theroad/   this is the host, port and path to the application in the eXist database via the REST interface

ProxyPassReverse / http://localhost:8080/exist/rest/db/apps/theroad/   URLs in headers in HTTP reponses are rewritten using this rule (this command would make more sense if the arguments were reversed since thats how thery are used)

ProxyPassReverseCookieDomain localhost thegloucesterroadstory.org The domain under which cookies need to be stored on the client needs to be the site domain name, not localhost. In my applications cookies are used for the session identifier because sessions are needed for user login.

ProxyPassReverseCookiePath / /  The path attached to the cookie - just root here (not /exist which is the default)

RewriteRule ^/$ /home.xq [R] The domain name alone invokes the main page, home.xq.

RewriteRule ^/system - [F]   Forbid access to the system subcollection. All other paths are passed unchanged

XQuery coding

Redirects

 Use request:get-uri() to get the internal URI (e.g. http://localhost:8080/exist/rest/db/apps/theroad/home.xq) which will then be rewritten using ProxyPassReverse to thegloucesterroadstory.org/home.xq.  I use it in this construction to transfer to a different page:

response:redirect-to(xs:anyURI(concat(request:get-uri(),"?action=login-form")))

Resource locations

Documents referenced by the HTML page, such as css, javascript and image files, need to be held in the application collection since any path above this will fail. This was not the case when the application is called with the full URL. XQuery library modules, for example common library functions can be placed anywhere.

Logging transactions

I usually monitor access to sites from within scripts so that appplication data such as elapsed time can be recorded along with the query string. Before proxying, I logged the host with       

request:get-host()

but now I have to log the X-Forwarded-For IP address. I'm only interested in the first in the chain so now I use

tokenize(request:get-header("X-Forwarded-For"),", ")[1]

Reflection

I used a lot of sources to get this working. It's difficult to know where to invest time: there are a couple of pages of documentation on the eXist site (here and here) but they don't deal with applications in the database and are incomplete; there is a ton of documentation online; mailing lists to ask; friends to badger; Google to guery. All helpful but in the end, careful experimentation is vital.  The Apache error log, the Firefox Live HTTP Headers add-on and the Firefox cookie view were valuable tools in debugging.

More work to do, especially on access control, but I'm a happy man today. 

 

 

 

 

 

 

 

 

Twitter Photo wall - server-side caching

The Twitter photo wall needs server-side caching for several reasons:

  • improved performance when multiple browsers are looking at the same search - on the local server and on Twitter's. 
  • moderation can be interposed- a moderator's screen would present the same wall but with a check box to authorise an image- the public wall would then only see  accepted photos.
  • integration of  images coming from other sources - email, flickr georss feed etc
  • better deduping - duplicates in a batch can be removed, but for the incremental update, they need de-duping over the full set
  • a permanent record of photos for the event would be desirable

In version 2 of the photo wall, an authorised user can create, modify and delete photo walls. A wall may be moderated or unmoderated. If moderated, a moderation screen shows all unmoderated photos, the oldest at the top. The moderator can accept or reject a photo by clicking the appropriate button  The moderation element of the photo in the cache will be updated and the photo removed from the moderation screen. The public page of a moderated wall will show only accepted photos. Each photo is timestamped so that screens can be updated with those photos acquired since the last refresh.

This achitecture is more efficient since the twitter search is done only once so each public page only has to access the cached data. However there are now three lags in the system: the interval between searches of the twitter stream, the pause whilst a human operator moderates the photos, and the interval between refreshes of the wall.

You can view a few walls which have been created but are most likely currently stopped.  If anyone wants a a login to use the prototype, drop me a line.

Implementation

The data for each wall is the query description and a sequence of photo descriptors.  This structure is held in an XML file in the database. It is updated in situ using the eXistdb XQuery update extensions when new photos are acquired, when photos are moderated and when the query parameters are updated 

When a wall is created, the XML file is created containing the initial query.  The twitter stream is searched for the first time using the query parameters  and matching photos added to the XML file, ignoring duplicates. In addition a task is scheduled to re-run the search task at the defined refresh rate.

The moderation page uses AJAX both to repetitatively fetch the set of all unmoderated photos and as each moderation decision is made, to update the moderation status in the photo in the database. The public page also uses AJAX to fetch newly acquired or moderated photos.

Current line count is about 600 lines of XQuery and 30 lines of JavaScript/JQuery.

To do

The wall display is very basic. New photos are inserted at the top of the page which results in a jerky appearance. One idea would be for new photos to appear at the centre of the screen and drift outwards, reducing in size as new photos arrive - well beyond my JavaScript skills.

Short urls like Twitters own t.co need to be converted to their unabbreviated form to detect which photo service is being used, if any. The usual approach using Curl is to request the function not to follow Locations in the header, but the httpclient module in eXist does not have this option. This means that the page has to be fetched and then analysed to see whether it is an image service and if so which one - messy.