Project: PlurioParser

PlurioParser is a utility that converts data from a range of selectable data-sources, such as Semantic MediaWiki (SMW) or an SQL database into a predefined XML format for import into the plurio.net database of cultural events in the greater region. Data from the plurio.net platform is used on several national and international websites, such as culture.lu, grrrrr.eu and an ever increasing number of smaller, institutional websites retrieving and displaying data from plurio.net’s database.

David, and later TenTwentyFour1024 have developed and are maintaining the utility in close cooperation with the agence luxembourgeoise d’action culturelle, the organisation coordinating and running plurio.net.

Development

The first version of this script was written before TenTwentyFour1024 even existed. It was developed to export data about events and their locations from the Semantic Media Wiki at use by syn2cat, the organisation nowadays running the level2 hackerspace in Bonnevoie, Luxembourg.

Semantic MediaWiki Logo

A second, evolved version of the script was created to not only support the Semantic Media Wiki API and data structures used by syn2cat, but just as well be able to retrieve data from a Microsoft SQLServer database.

This release is in use by the Musée national d’histoire naturelle (MNHN), generating an export file from about 100 events and almost as many locations.

Interna

The PlurioParser utility extracts and transforms data from the SQLServer database, then generates an XML file representing the complete published information, to finally validate it using a .xsd file.

Data from the SMW or MS SQLServer data sources are retrieved and represented by the same objects, which allows us not only to use the same interface when handling events, locations, people and categories, but also to transform and normalize values – such as dates – that may have idiosyncratic formats in the various sources.

In doing so, several checks are applied in different locations, always allowing to fall back to default values if the given values do not validate. For instance, if locations linked to the events have incomplete address data or can not be identified based on the pre-defined plurio.net categories, the utility first checks for a meeting point instead, and, failing to identify one, falls back to a common location. In some cases, when events fail to validate and the fall-back location is not pertinent, events are omitted from the export.

The utility heavily relies on caching to avoid making the same requests to either data back-end more than once. Locations and people that have been retrieved once are added to the XML’s guide section only once, then cached and re-used whenever an event refers to them again.

Validation and reporting

The PHP utility is controlled by a bash script which is – in turn – executed nightly by a cron job. Before executing the PHP utility that retrieves event data and builds the XML, the bash script retrieves updated categorization data from the plurio.net website.

If the generated XML validates using the .xsd file, a generation report is sent to people at the MNHN and TenTwentyFour1024, detailing which events had to be omitted from the export, and for what reason.

Besides the bash script, which makes sure to log information about each run to the system’s syslog, the PHP utility itself logs any problems to a rotating log file. A debug flag allows to increase the verbosity of the utility whenever required.

The XML file that is thus created is also imported by the plurio.net system on a nightly basis, generating a detailed import log.

    +++ plurio.net XML-Import für Gruppe "Musée national d'histoire naturelle, Luxembourg" erfolgreich +++
    ************************************************************************************

    Info: Es ist kein Eingreifen ihrerseits erforderlich!


    +++ Allgemeine Informationen +++
    ************************************************************************************

    Importzeit:      2016-08-31 04:40:14 - 2016-08-31 04:41:52
    Importgruppe:    Musée national d'histoire naturelle, Luxembourg (ID: xxx)
    Importuser:      xxxxxxxxxx (ID: xxxxx)


    +++ Statistik zu Ihrem XML-Import +++
    ************************************************************************************

    82   Datensätze in Ihrer XML-Datei gefunden, davon:
         -----------------------------------------------------------------------------
         82  Datensätze wurden in plurio.net Datenbank geschrieben, davon:
             -----------------------------------------------------------------------
             82  Datensätze vollständig und direkt veröffentlicht.
             0   Datensätze unvollständig
         -----------------------------------------------------------------------------
         0   Datensätze übersprungen, davon:
             -----------------------------------------------------------------------
             0   Datensätze existieren bereits in der plurio.net-Datenbank.
             0   Datensätze fehlerhaft
             0   Datensätze (Events) bereits abgelaufen.


    [...]

    Mit freundlichen Grüßen
    plurio.net (http://www.plurio.net/XML/)

Pipe your data into plurio.net!

The PlurioParser utility can be adapted to work with your data, generating an XML output that is as complete as possible and 100% compliant to plurio.net’s requirements.

contact us if you have considerable amounts of data that you would like to see on plurio.net and its sibling websites without having to manually insert each and every item.

Technologies used in building this application

PHP

PHP is a popular general-purpose scripting language that is especially suited to web development. Fast, flexible and pragmatic, PHP powers everything from your blog to the most popular websites in the world. (+)

Bash

Bash – the Bourne Again SHell – is a Unix shell and command language written by Brian Fox for the GNU Project as a free software replacement for the Bourne shell. It is a default shell on the major Linux distributions and OS X. (+)