Wednesday, March 22, 2017

The Haskell Tool Stack for Multi-Package Projects

I organize my coding in projects, which include several packages, but are managed in a single git repository; a single repository coordinates changes in different packages to be able to reproduce consistent states of the code. In this blog I want to document methods to deal with such arrangements using the new Haskell Stack build methods and postpone a discussion of such Multi-Package Project organization for another time.

The Stack project is an effort to overcome one of the major impediments of complex Haskell projects. Haskell packages are versioned and cabal and the linker check

  • the dependencies between packages and their versions, assuring that an acceptable version of a dependent (used) packages is present
  • from each package only one version is present in the build.

It may be difficult to find a set of versions of the required packages, which are consistent; finding such consistent sets is partially automated by cabal install. Two major issues are observed:
  • solutions are not really reproducible and may change by actions outside of the programmers control (e.g. new package version appear)
  • the algorithm may not find an acceptable solution; with some tricks and detective work, a solution can be found, sometimes not - the frustration is known as "cabal hell"

Haskell Stack overcomes these impediments by proposing curated sets of package version which are known to fit together; these are known as snapshots or "lts" and are imported from  Stackage (similar to importing data about versions found on Hackage in cabal).

Using the Haskell Tool Stack starts following the installation guidelins   is installing a complete Haskell environment in locations different than what GHC and cabal usually uses (i.e. you get a new copy of ghc in ~/.stack)

Directories used
  • ghc and other global things go into ~/.stack
  • binaries go into ~/.local/bin  (the result of getAppUserDataDirectory)

The guidelines    to install stack and to run a simple new project are easy to follow and work; in the following, I assume an installed stack.

The example project used here consists of two packages, a and b, a is a library (and a test suite), b is an executable using some of the functions in a. 

Stack relies on a stack.yaml file, which for this project is

flags: {}
extra-package-dbs: []
- a
- b
extra-deps: []
resolver: lts-8.2

This fixes the snapshot to lts-8.2 (newest in march 2017 would be 8.5), which uses ghc 8.0.2 (lts-7.20 is the latest for ghc 8.0.1, for others see on stackage). As long as the resolver for a project is not changed, the same versions of packages are used and the build is repeatable.

The difference to a single package project stack.yaml file is to replace for packages the entry 
- "." 
with the names of the package directories.

Extending the multi-project test with a subdirectory with libraries m and n (in the librarySubdir branch) requires additions to stack.yaml
flags: {}
extra-package-dbs: []
- a
- b
- libs/m
- libs/n
extra-deps: []
resolver: lts-8.2
Stack builds the project and updates the required parts after updates automatically.

Autocompletion for stack requires
    eval "$(stack --bash-completion-script stack)" 

which can be included in .bashrc

Sunday, January 8, 2017

A short tutorial to use Virtuoso RDF database to store some RDF data and execute a SPARQL query

I have used 4store, which is easy to install ("apt-get install 4store") and very easy to use. For 4store exist a short but reasonable documentation Unfortunately, the code has not been much developed lately on github ( and new constructs in SPARQL are not implemented (e.g. subselect queries). My task required queries, where select clauses are nested and I was forced to look for another RDF database and SPARQL endpoint. 


My understanding of "easy to install" is "apt-get install" or very close. Virtuoso was the next candidate I found for Debian stretch it ready to install. The installation is smooth but it is necessary to enter a password during installation - write it down, it is required later! 

sudo apt-get install virtuoso-opensource 

which gives version 6.1.6 at the time of writing. 


The Virtuoso tutorial does not start at the beginning and points to various places not connected to my task and sometimes not even working, but you can see in it, that you should point your browser at http://localhost:8890/conductor (obvious!), where the default user is "dba" but the default password is not what is stated in the tutorial but the password you set during installation. "the rest is self-explanatory" says the "tutorial".

The initial screen, after login,  invites you to 5 major themes (from Business Process Integration to Enterprise Collaboration) none of them relevant for wanting just to load RDF data and execute a query. Bewildering! 

On top of the screen a list of 9 tabs includes "LinkedData", which opens a second list of tabs and a screen for "SPARQL Execution" - closer. But how to load data? 

Load data

Check the other 8 tabs... perhaps: "QuadStore Upload"? Correct: with the browse tab find the file (in my case Ntriples with .nt extension) and extend the proposed "Named Graph IRI" to with some name (e.g. xx) to "http://localhost:8890/DAVxx". The value will be required for the query. Click on upload. The filename will disappear and in the upper right you can read - if you look there - "upload finished". Done. 


Now click on the SPARQL tab again, paste the "Named Graph IRI" value into the "Default Graph IRI" field and the query into the query field. Test with 

SELECT * WHERE { ?s ?p ?o } LIMIT 10

and you should see underneath the list of s, p and o value. Neat. 

A more complex query can be pasted in, e.g. 

SELECT ?filename ?fpath ?count ?md5group 
    ?unit dove:md5 ?md5group . 
    ?unit dove:filename ?filename . 
    ?unit dove:filepath ?fpath . 
        { SELECT  ?md5group (Count(?unitgroup) as ?count) 
             WHERE { 
                   ?unitgroup dove:md5 ?md5group . 
             } GROUP BY ?md5group Having (count(?unitgroup) > 1) 
   } order by ?md5group ?fpath 

 and produces the expected results. 

The above complex query relies on the extension of the "Namespace" in its tab with the prefix value for "dove" with "". 

Queries can be saved (clicking on "Save") by giving a "File of XML template value" starting with "/DAV/", e.g. "/DAV/groupsWC" and a description text. In the SPARQL Execution tab, selecting saved queries shows a list of the stored ones and one can click "Edit" to get them back into the query window and execute them as above or click the query name, which executes them and opens the XML file in the browser (the URL, in my case http://localhost:8890/DAV/groupsWC can be opened in any browser and produces the XML doc - could be useful). 


My purpose, loading data and executing a SPARQL query was satisfied and I did, for now, not venture deeper into Virtuoso and conductor. I hope this short tutorial helps to get started quickly and easily with Virtuoso and it shows, implicitly,  what is wrong with (a) tutorials which do not start at the beginning, (b) inconsistent terminology and (c) software which is not modular but includes "all but the kitchen sink". 


I had some trouble with loading RDF and found:
  1. The RDF triples can be provided as gziped file (no other compression method is currently supported) - this saves a lot of disk space!
  2. There is a bulk loader process which can be used; it took me a while to understand that the directory from which the file is loaded must be included in the /etc/virtuoso/virtuoso.ini file: 
    DirsAllowed              = ., /usr/share/virtuoso-opensource-6.1/vad, /home/frank/virtuosoData, /home/frank/.Dove
    You can then use the isql-vt console (in debian, may have a diffferent name in other distributions) as 
    sudo isql-vt localhost usernaem password 
    ld_dir('/home/frank/virtuosoData/', '*.*', 'http://graphname');

Thursday, December 1, 2016

Digitale Literaturwissenschaft - eine Außensicht aus der Perspektive des Informatikers

Digitale Literaturwissenschaft - eine Außensicht aus der Perspektive des Informatikers

1 Einleitung

Die Informationstechnologie verändert die Produktion von Literatur, die Form der entstehenden Literatur und das Lesen; ebenfalls beeinflusst Informationstechnologie die Arbeitsweisen der Literaturwissenschaft. Nicht nur, dass literarische Texte elektronisch erfasst, übertragen und bearbeitet werden und dass literaturwissenschaftliche Arbeiten mit dem Computer statt von Hand oder mit der Schreibmaschine geschrieben werden, sondern besonders weil die Informationstechnologie das Potential hat neue Arbeitsweisen und neue Erkenntnismethoden in die Literaturwissenschaft einzuführen.
Es scheint, dass die Anwendung der Informationstechnologie sowohl
  • den Gegenstand der literaturwissenschaftlichen Untersuchung verändert; für einen als Korpus beschriebene Textsammlung sind kategorische Urteile („immer“, „nie“) möglich.
als auch
  • die Methode beeinflusst; algorithmische Interpretation beruht auf einer definierten Sammlung von Sprach- und Sachwissen und ist so nachvollziehbar.

2 Gegenstand einer digitalen Literaturwissenschaft

Viele literaturwissenschaftliche Arbeiten beschäftigen sich mit einem literarischen Text oder einer unscharf bezeichneten Textmenge (z.B. „Der Europäische Roman des 19. Jahrhunderts“). Mittels Informationstechnologie werden digitale literarische Korpora zusammengestellt. Texte, die nach bestimmten Kriterien ausgewählt sind, werden in einer algorithmisch verarbeitbaren Form aufbereitet und als Korpus bereitgestellt. Der Aufwand größere Korpora zu schaffen ist beträchtlich, verteilt sich aber durch die mehrfache Verwendung durch verschiedene Wissenschaftler und für unterschiedliche Forschungsarbeiten. Der Urheberschutz und unterschiedliche Interpretationen des Verhältnisses von Urheberrecht und Freiheit der Wissenschaft zueinander, schränken manchmal den Zugang erheblich ein [11]. Aufbereitete Texte sind oft bereits nach einem bestimmten Standard (z.B. TEI [9]) „ausgezeichnet“, das heißt, dass auch Seiteneinteilung, Lesarten oder Differenzen zwischen Ausgaben etc. kodiert sind.
Liegt einer literaturwissenschaftlichen Arbeit ein Textkorpus zugrunde, so wird zumindest der algorithmisch bestimmte Teil der Arbeit wiederholbar. Andere Forscher können im Prinzip nachprüfen, ob sie mit dem gleichen Korpus und den gleichen Methoden gleiche Ergebnisse erhalten; wesentlicher ist aber die Möglichkeit, zu prüfen, wie sich die Ergebnisse verändern, wenn der Korpus erweitert, eingeschränkt oder ein ganz anderer Korpus mit den gleichen Methoden evaluiert wird. Literaturwissenschaftliche Forschungsergebnisse sind dann schärfer umrissen, auf bestimmte Sammlungen von literarischen Werke bezogen und mit anderen Ergebnissen vergleichbar. Die Interpretation der algorithmisch gewonnenen Ergebnisse bleibt, wie bisher, dem Wissenschaftler vorbehalten.
Die Verwendung von Korpora bringt auch einen methodischen Gewinn: es ist, mit Bezug auf einen fixierten Korpus möglich, Aussagen der Form „in diesem Korpus gibt es keinen Fall, dass ...“ oder „in diesem Korpus ist immer der Fall, dass ..“ zu machen. Solche kategorischen Urteile waren bisher nur in Bezug auf kleine, überblickbare Textmengen möglich - mit der Festlegung eines Korpus und der algorithmischen Untersuchung der darin enthaltenen Texte sind sie auch bezüglich großer Textsammlungen möglich.

3 Methodik einer digitalen Literaturwissenschaft

Die Interpretation eines literarischen Textes entsteht beim Lesen durch die Verbindung der Zeichen im Text mit dem Wissen des Lesers. Ähnlich wie der Gegenstand für die digitale Untersuchung als Korpus festgelegt wird, muss das Wissen, das die Interpretation erlaubt, beschrieben werden. Zu diesem Wissen gehört die Kenntnis der verwendeten Sprache, aber auch Allgemeinwissen und schließlich Spezialwissen, die für das vertiefte Verständnis notwendig sind. Eine digitale Literaturanalyse muss nicht nur den Gegenstand der Untersuchung, d.h. den Korpus der untersuchten Texte, sondern auch das für die Analyse verwendete Wissen und die verwendeten logischen Regeln bezeichnen.
Eine digitale Analyse eines natürlich-sprachlichen Textes zerfällt in verschiedene Phasen; in einem ersten Schritt erfolgt meist die Verarbeitung der Sprach\SpecialChar softhyphenoberfläche mit den Mitteln der Computerlinguistik (z.B. Stanford CoreNLP [A]  [A], für verschiedene Sprachen verfügbar [12]); der Text wird dabei in Wörter aufgelöst, diese auf Wortstämme reduziert und die grammatischen Konstruktionen analysiert, Referenzen und Verweise im Text kodiert und Eigennamen erkannt. Das dabei eingesetzte Wissen kann vereinfachend aber nachvollziehbar mit dem Verweis auf das verwendete Programm (bzw. auf das Programm und dem für das Training verwendete Korpus) erfolgen.
Je nach Fragestellung kann die Verbindung des Textes mit anderem Wissen anschließen; beispielsweise eine Analyse des Raumbezuges durch Verbindung von Ortsbezeichnungen mit geographischem Wissen oder der Beziehungen der Personen durch Verbindung mit historischem Wissen oder Bibel- und Mythologiekenntnisse. Es kann mit Werkzeugen der Computerlinguistik auch nach anderssprachigen Einschüben gesucht und diese analysiert werden; digitale Methoden können solche Hinweise systematisch und für mehr Sprachen als ein einzelner Leser beherrschen kann, sichtbar machen [B]  [B] Beispielsweise sind in Gedichten Celans zumindest russische (rot), französische (neige) und japanische (i-i-e) Wörter auffindbar. Die Untersuchung der Handlungen nach narratologischen Schemata wie von Propp initiiert [15], und zu computational narratology [C]  [C] zu entwickeln, benötigt Generalisierungen, die Linguisten in taxonomisch organisierten Wortlisten bereitstellen (wordnet [D]  [D] [2] u.ä).
Entscheidend ist, dass das bei einer algorithmischen Interpretation angewandte Wissen und die verwendeten logischen Schlussregeln nachvollziehbar beschrieben werden. Praktisch kann das Wissen durch die Angabe der bei der Analyse eingesetzten Werkzeuge beschrieben werden, also z.B. die verschiedenen Analysen die Datenbanken (z.B. dbpedia [E]  [E] - eine logisch strukturierte Form großer Teile des Inhaltes von Wikipedia als RDF kodiert), Taxonomien und Schlussregeln (z.B. OWL [14]). Damit ist deren Einfluss auf die Analyse dokumentiert, überprüfbar und einer Kritik zugänglich.
Die technische Verarbeitung wird erleichtert, wenn die Verarbeitung in Schritte unterteilt wird und die Ein- und Ausgaben in standardisierten Formaten erfolgen. Für die sprachliche Analyse werden Treebanks (mit leicht unterschiedlichen Kodierungen) verwendet, wobei auch bereits sprachunabhängige Lösungen vorgeschlagen wurden[10, 17, 16], die für die Komparatistik wahrscheinlich besonders fruchtbar sind [6, 7]. Das für die literaturwissenschaftliche Analyse heranzuziehende Sachwissen ist zu einem großen Teil im Semantic Web [1, 5] bereits in in RDF Form[13] vorhanden; es könnte nützlich sein, Korpora ebenfalls in RDF Form zu kodieren um die Verbindung mit dem Semantic Web zu vereinfachen[8].

4 Coda

Durch eine konsequente Beschreibung von Gegenstand - als Korpus der maschinellen Verarbeitung zugänglich - und dem darauf bezogenen Sachwissen - in Form von Programmen und Semantic Web - , das für die Analyse verwendet wird, ist es möglich, Interpretationen algorithmisch zu überprüfen und zu objektivieren.
Die digitale Form von Texten (und ähnlichen Materialien) erlaubt den Einsatz von algorithmischen Verarbeitungen; dies verändert Literaturwissenschaft insofern, als Hypothesen überprüfbar werden. Nötig dazu ist die Beschreibung des Gegenstandes in Form der im Korpus eingeschlossenen Texte und ein Modell des Lesens und des dabei eingebrachten Sachwissens. Es kann dann entschieden werden, ob eine Interpretation eines Textes mit dem angegebenen Wissen möglich ist, welche Interpretationen aus einem limitierten Wissen resultieren (z.B. fehlende Fremdsprachkenntnisse), oder welches Sachwissen notwendig für eine bestimmte Interpretation ist.
Eine „automatische“ Interpretation liegt zwar in weiter Ferne, aber der Wissenschaftler könnte, durch eine automatische Produktion aller möglicher Assoziationen (taxonomisches und Faktenwissen) und deren Gruppierung auf vielleicht sonst übersehene Hypothesen zur Interpretation hingewiesen werden, die dann kritisch beurteilt werden müssen.


[1] Tim Berners-Lee, James Hendler, Ora Lassila, others: “The semantic web”, Scientific american, pp. 28—37, 2001.
[2] Christiane Fellbaum: WordNet: An Electronic Lexical Database. The MIT Press, 1998.
[3] Hanno Biber, Evelyn Breiteneder, Karlheinz Mörth: “Words in Contexts: Digital Editions of Literary Journals in the "AAC - Austrian Academy Corpus" ”, Proceedings of the International Conference on Language Resources and Evaluation, LREC 2008, 26 May - 1 June 2008, Marrakech, Morocco, 2008. URL
[4] Hanno Biber, Evelyn Breiteneder: “Fivehundredmillionandone Tokens. Loading the AAC Container with Text Resources for Text Studies”, Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012), Istanbul, Turkey, May 23-25, 2012, pp. 1067—1070, 2012. URL
[5] Pascal Hitzler, Markus Krötzsch, Sebastian Rudolph, York Sure: “Semantic Web”, Berlin, Heidelberg, 2008.
[6] Christine Ivanovic, Andrew U Frank: “Corpus-based Research in Computational Comparative Literature”, Corpus-Based Research in the Humanities (CRH), pp. 69, 2015.
[7] Christine Ivanovic, Andrew U Frank: “Korpusanalyse in der computergestützten Komparatistik”, Digital Humanities deutsch (DHd), 2016.
[8] Christine Ivanovic, Andrew U Frank: Viennavigator: Digitale Formalisierung literarischer Topographien am Beispiel des Gesamtwerks von Ilse Aichinger in Nach Wien! Sehnsucht, Distanzierung, Suche. Literarische Darstellungen Wiens aus komparatistischer Perspektive (Bachleitner, Norbert and Ivanovic, Christine, ed.). Peter Lang, Frankfurt a.M., 2015.
[9] Fotis Jannidis: “TEI in a crystal ball”, Literary and linguistic computing, pp. 253—265, 2009.
[10] L. Banarescu, C. Bonial, M S. Cai, Georgescu, K. Griffitt, U. Hermjakob, K. Knight, P. Koehn, M. Palmer, N. Schneider: “Abstract Meaning Representation for Sembanking”, , 2013.
[11] Lawrence Lessig: Code and Other Laws of Cyberspace. Basic Books, 1999.
[12] Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, David McClosky: “The Stanford CoreNLP Natural Language Processing Toolkit.”, ACL (System Demonstrations), pp. 55—60, 2014.
[13] Frank Manola, Eric Miller, Brian McBride, others: “RDF primer”, W3C recommendation, pp. 6, 2004.
[14] D.L. McGuinness, F. Van Harmelen, others: “OWL web ontology language overview”, W3C recommendation, pp. 10, 2004.
[15] Vladimir Jakovlevič Propp, EM Meletinskij, Christel Wendt: Morphologie des Märchens. Carl Hanser Verlag, 1972.
[16] Lucy Vanderwende, Arul Menezes, Chris Quirk: “An AMR parser for English, French, German, Spanish and Japanese and a new AMR-annotated corpus”, Proceedings of NAACL-HLT, pp. 26—30, 2015.
[17] Nianwen Xue, Ondrej Bojar, Jan Hajic, Martha Palmer, Zdenka Uresova, Xiuhong Zhang: “Not an Interlingua, But Close: Comparison of English AMRs to Chinese and Czech.”, LREC, pp. 1765—1772, 2014.

Saturday, April 2, 2016

Debian with XFCE4 on Lenovo Thinkpad Yoga14

I bought a Lenovo Thinkpad Yoga14 because I wanted a new device with handwriting input for writing and drawing; i know that it would work with Ubuntu and assumed that it would also work with a regular Debian installation (I dislike Canonical's approach to force users to their views).

Installation of Debian was somewhat more difficult; here the short description what worked for me in the end.

1. I observed that my working installation of Ubuntu used libwacom-bin, which is only in Debian stretch; thus I decided to install stretch (aka testing). Installations of jessie were not supporting the pointer.

2. To install Debian stretch (alpha 5 release) requires the non-free firmware and additionally the driver for the Intel Wireless 7265 rev 61 (aka 7265D), which is iwlwifi--7260-17.ucode (download from a git...), which i put on a second usb stick (on the first i had the inofficial stretch release including the firmware).

3. Install regularly and add the USB stick with the firmware when requested.

 4. Restart and produce /etc/apt/sources.list (not automatic in alpha 5)

deb stretch main contrib non-free 
deb stretch/updates main contrib non-free 

and /etc/network/interfaces

auto lo 
iface lo inet loopback 

auto wlp2s0 
iface wlp2s0 inet dhcp 
     wpa-ssid xxxx 
     wpa-psk "pppppp"

 5. restart and login as root; check that network is ok.

Then apt-get update, apt-get upgrade (some problem with iwlwifi?? - it needed a apt-get upgrade -f). install xfce4, xfce4-goodies, xournal (for drawing) startx to test install lightdm

it works, when you restart, the regular xfce login greeter is there!

Wednesday, December 4, 2013

Too Much Data!

Too much data!

Too much data!

“Die Guten ins Töpfchen, die Schlechten ins Kröpfchen” (the good into the pot, the bad into the crop, in Aschenputtel (Cinderella) by Grimm)

1 Situation and Problem

In the past — say 1990s — we had small data files and small disks. I had a Macintosh with 128k! Today we have a great number of large files, photos have typically 4 MB and movies even 4 GB, and much larger hard disks. Gone are magnetic tapes and similar external media (who remembers Magneto-Optical disks?), replaced by external hard disks with USB connectors.
Regular backup is still necessary - even if you use cloud services! The past years I bought about one hard disk per year, with capacity increasing from 300 GB to 3 TB, to hold (differential) backups of all my computers. The utility rdiff-backup did serve me well; files were easy to recover and it never failed me.
The questions for which I did not have an answer: How long to keep the backups? How to use the backups? How to decide that a backup is not relevant anymore and can be discarded. How to know whether a backup contain a valuable file not kept anywhere else. Even if all files are stored, finding a “lost” file is practically impossible.
This year, I decided that I should not simply buy “yet another hard disk” but clean up. A very cursory analysis showed that I had approximatively 10 Terabyte of date on disks; it was mostly copies of the same files. The copies were produced because all my 4 synchronized computers contain more or less the same files and I started a fresh backups every year. The count of duplicates is approximatively the number of years times the number of computers.
The task is simply (1) to extract one copy of each different valuable file and keep it in a safe place and then discard the remainder and (2) invent a policy to avoid creating the same mess again. Step one will be discussed here, step two discussed in a following blog. To start, two difficult definitions are required:
  1. When are two files equal?
  2. What is a valuable file?
The solution will produce a set of valuable files to keep in a “keepFolder”, to replace all the backups. The keepFolder should contain only one copy of each valuable files.

2 Doves: Design for a Cleanup Tool

Tools like Unison and rdiff-backup characterize files first by path and filename (short “filepath”). The effect is that any reorganisation of the file system (e.g., moving or renaming directories) is reflected as delete and addition - bloating the backup and adding each file twice (once as a deleted file, once as a new addition) and two files are only identified as “the same” if they have the same content and the same filepath.
A backup must reflect the current file structure, but for the store of preserved copies of files longtime ago deleted only one copy of each file is sufficient. Equality of files is here defined by content only and independent of filepath.
The cleanup does
  1. create an inventory of files.
  2. identify a set of unique files.
  3. move the unique files to a safe place (the keepFolder).
Two files are equal, if they have the same content, independent of name or location in a directory. Practically, a file is characterized by it MD5 digest, a 128 bit checksum obtained from the file contents. The more secure and collision resistant SHAL1, which gives a 256 bit checksum, could be used, but the goal here is only detection of files with the same content.
Valuable files are defined negatively by a list of file extensions and a list of directories; files with extensions in the list and all files in directories in the list are non-valuable (i.e. files in directories called “Trash”, “tmp” and files with extensions “bak”, “tmp” etc. ). Unfortunately, programmers are inventive with new names for ephemeral files; the XDG ( proposal to select specific places for ephemeral files is an important step in the right direction. Files which are “special” and not readable to compute an MD5 value, e.g. pipes, links, broken pipes and corrupted non-readable files, are also considered non-valuable.

3 Operations of Doves program

The program to collect the unique files is called “doves”. The operations are:
collect (-c): collect the description of files and directories for all files in a given directory and store it in a doves file (extension .doves)
base (-b): extract the MD5 digests from a doves file (i.e. all the file descriptors for a directory) into a md5 file
process (-p): compare a given doves file (i.e. all files in a directory) with the files in base, i.e. a md5 file; the files not in base are listed in a keep file.
keep (-k): copy all the files in the keep file into a keep directory and update the base md5 file with the MD5 digests of the files in the keep file (to avoid these files in the next directory processed).
One starts with collecting the information for all files currently in the keep folder and producing a base md5 file. The sequence of steps: collect, process against base and keep extracts from a directory all files, which are not already in the keep folder and puts them there. It is repeated for each folder of interest.

4 Programming

Most of the tools necessary, in particular functions to computer MD5 digest or SHA1 checksums are available in Haskell from
The difficulty with programming (in Haskell) was my lack of experience with programming for “large” datasets. The “naive” approaches, which rely on the “natural” lazyness of Haskell, work only for reasonable size tasks, but collecting MD5 digests for the million plus files in 500 GB of backup data tends to exhaust resources. A lazy approach opens all files as quickly as possible (breadth first) and crashes when the maximum number of open files is reached (this is in linux typically limited to 1000 open files per process). The Pipes ( package gives a consistent, generic way to sequence actions to avoid exhaustion of resources and leads to a construction of a program as a sequence of smaller actions which can be composed as a pipe. A sequential (sweeping) approach means dealing with each file individually and excludes approaches which want to process all data at once.
I was also not prepared to deal with the many different ways a file can be “special”, i.e. not readable to produce an MD5 value. It was necessary to test for such exceptions at the first point possible and exclude “special files” from further processing

5 Experience

Handling directories with hundreds of gigabytes is time consuming, especially if stored on external hard disks connected by USB 2.0. Just copying data between two disks takes about 4 minutes per GB, reading and computing the MD5 digest is a bit faster. The time it takes to delete 100 GB from a disk can be 10 minutes (and over an hour for 1 TB)
The reduction is substantial. Overall, from the total 10 TB of backup data only 1 TB is left. This is still more than what I expect to keep longterm, which is perhaps 200 GB; my music collection is about 50 GB and photos fill less than 100 GB, the rest is much less.

6 Open Questions

Pictures and music make a very large part of the files we collect today; keeping only one copy of each makes for me about 100 GB of disk space. Unfortunately, it is hard to get rid of all copies, as different music and picture managing programs stick some additional bits of information into the file: tags for pictures and information about music genre for music, and probably other things. The result is that the file is slightly different and has a different fingerprint; it is not recognized as a copy and thus the different near-copies are all stored separately; this is left for another effort in reduction with a different approach, in this pass I did not consider different file types, except to exclude “non-valuable” files.

Wednesday, April 4, 2012

Spatial planning ability as precursor for human language

Since the NCGIA (1989) research initiative 2 “Language of Spatial Relations” (Mark et al., 1989) and the meeting in Las Navas (Mark and Frank, 1991) we have argued that analyzing the ways humans think about space is fundamental for other – more abstract – kinds of reasoning. I found a most surprising connection recently in an article by Steedman, in which he argues: “Thus it [λ-calculus] is a theory that makes language look as if it has been built on a pre-existing system for planning action in the world, and thereby seem less unique as a cognitive faculty than is usually assumed.”(2002, , p. 4) and “the language faculty in its syntactic aspect is directly hung onto a more primitive set of prelinguistic operations including these combinators, originally developed for motor planning” (p. 5). Some background to the argument:

Planning an optimal path in space is a well established, basic human ability which requires the construction of network knowledge of the environment, combined from multiple trips. Planning an optimal sequence of actions can be computed with the same methods, representing the actions in a State – Transition – Diagram, which is structurally equivalent to a street network. I have discovered recently, that planning actions and executing actions can be seen in a category and in the corresponding co-category (Asperti and Longo, 1991). In robotics, the planning but also the recognition of plans of others is very important. Geib and Steedman (2007) show the structural similarity in the processes used for plan recognition and natural language processing. Producing an explanation for a plan is the same operation as parsing a sentence. This then connects to the initial quote above, which establishes a direct link between the planning of an optimal spatial path and human language, based on a categorical (λ-calculus) argument that these abilities all use the same fundamental process. Developmental arguments demonstrate that the spatial ability is primal and the other are “hung onto” these.

The insight I gain from the connection between path planning, action planning and human language production is first, the correspondences:
  • path/action planning — sentence production,
  • plan recognition — sentence parsing,
  • location/state — concept expressed in sentence,
  • target location/state — concept to communicate,
  • starting location/state — current context of conversation.
The correspondences point out that language production is always based on the “current context” and whether a produced sentence is felicitous (or not) depends on the context (as can be seen in linguistic discussions, for example, Moens and Steedman, 1987). The correspondence further indicates that “locations” in space correspond to concepts (typically complex concepts expressing situations) and we need a representation of concepts and context; here I am looking at the representation by Aerts and Gabora (2005b); Aerts and Gabora (2005a), and compare it with a lattice scheme based on distinctions (Frank, 2006).

The second important insight found in this and other papers by Steedman, but also extensively argued by Carpenter (1997), is the advantage of using λ-calculus over first order predicate logic: “unlike first-order logic, we can in addition provide a term corresponding to the meaning of the verb phrase” (p. 39).


Diederik Aerts and Liane Gabora, “A Theory of Concepts and Their Combinations II: A Hilbert Space Representation”, Kybernetes 34 (2005), pp. 0402205.

Diederik Aerts and Liane Gabora, “A theory of concepts and their combinations I: The structure of the sets of contexts and properties”, Kybernetes 34 (2005), pp. 167–191.

Asperti, Andrea and Longo, Giuseppe, Categories, Types and Structures – An Introduction to Category Theory for the Working Computer Scientist1 (The MIT Press, 1991).

Carpenter, Bob, Type-Logical Semantics (MIT, 1997).

Frank, Andrew U., “Distinctions Produce a Taxonomic Lattice: Are These the Units of Mentalese?”, in Bennett, Brandon and Fellbaum, Christiane, ed., Formal Ontology in Information Systems (Amsterdam: IOS Press, 2006), pp. 27–38.

Geib, C.W. and Steedman, M., “On natural language processing and plan recognition”, in Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI) (, 2007), pp. 1612–1617.

Mark, David M. and Frank, Andrew U., ed., Cognitive and Linguistic Aspects of Geographic Space vol. 63, (Kluwer Academic Publishers, 1991).

Mark, David M. and Frank, Andrew U. and Egenhofer, Max J. and Freundschuh, Scott M. and McGranaghan, Matthew and White, R. Michael, “Languages of Spatial Relations: Initiative Two Specialist Meeting Report”, National Center for Geographic Information and Analysis (1989).

Moens, M. and Steedman, M., “Temporal ontology in natural language”, in Proceedings of the 25th annual meeting on Association for Computational Linguistics (, 1987), pp. 1–7.

NCGIA,, “The U.S. National Center for Geographic Information and Analysis: An Overview of the Agenda for Research and Education”, IJGIS 2, 3 (1989), pp. 117-136.

Steedman, M., “Formalizing affordance”, in Proceedings of the 24th Annual Meeting of the Cognitive Science Society (, 2002), pp. 834–839.

Sunday, January 30, 2011

Optimisms in Software Development

Software development is notoriously behind schedule and above budget. Why are we so unable to estimate the time it takes to produce an application?

Development of hardware is always underestimated and the development of software overestimated; reality is different. For example compare the current multi-core CPU in your laptop with the CPUs built from discrete components 40 years ago and then contrast this stunning advancement with the similarity of FORTRAN and ALGOL used on these CPU's and Java used today!

“It is only a small matter of programming” was originally a 'game people play' (see the book by Eric Berne) and is now the title of a book by Nardi – the game is played as IT projects in most organization in the western world and we all suffer from late, inadequate and error-producing software we are forced to use in our work environment.

I had recently a student proposing to write software to allow end-users to create and adapt graphical user interfaces. He started with a good idea, observing the connection between database schema and GUI, and had quickly wiped together rudimentary graphical editors for GUI and database schema. This must have been emotionally rewarding, i.e. fun, and it seemed a minor step to complete the rest.

Having often started simple looking projects myself, to observe later the hidden complexity that made it hard to produce at the end, something coping with the complexity of a real world application, I start identifying some common sources for the optimism. We do not address the hard questions first and start to program early. The hard questions can be found by:
- Identify a compact but not trivial use case and describe it from the user's perspective in detail.
- Select a formal (or pseudo-formal) language to describe the representation for the information used in the program.
- Describe the functions (transformations) used to achieve the goal in a (pseudo-)formal language.

In my effort to exploit the data structure information to produce a GUI I failed eventually when I started considering exceptions: special situations in the underlying application operations and problems with user input has to be dealt with in parallel with the ordinary application logic. I found the current exception handling violating the principle of locality in my code and confuse me to the point I gave up; I will return to the question when I have a good idea what I could do differently.

Some advice at the end:
Use existing tools and do not duplicate them unless you have a compelling reason; try first to patch together a prototype using existing tools.