Thursday, December 1, 2016

Digitale Literaturwissenschaft - eine Außensicht aus der Perspektive des Informatikers

Digitale Literaturwissenschaft - eine Außensicht aus der Perspektive des Informatikers

1 Einleitung

Die Informationstechnologie verändert die Produktion von Literatur, die Form der entstehenden Literatur und das Lesen; ebenfalls beeinflusst Informationstechnologie die Arbeitsweisen der Literaturwissenschaft. Nicht nur, dass literarische Texte elektronisch erfasst, übertragen und bearbeitet werden und dass literaturwissenschaftliche Arbeiten mit dem Computer statt von Hand oder mit der Schreibmaschine geschrieben werden, sondern besonders weil die Informationstechnologie das Potential hat neue Arbeitsweisen und neue Erkenntnismethoden in die Literaturwissenschaft einzuführen.
Es scheint, dass die Anwendung der Informationstechnologie sowohl
  • den Gegenstand der literaturwissenschaftlichen Untersuchung verändert; für einen als Korpus beschriebene Textsammlung sind kategorische Urteile („immer“, „nie“) möglich.
als auch
  • die Methode beeinflusst; algorithmische Interpretation beruht auf einer definierten Sammlung von Sprach- und Sachwissen und ist so nachvollziehbar.

2 Gegenstand einer digitalen Literaturwissenschaft

Viele literaturwissenschaftliche Arbeiten beschäftigen sich mit einem literarischen Text oder einer unscharf bezeichneten Textmenge (z.B. „Der Europäische Roman des 19. Jahrhunderts“). Mittels Informationstechnologie werden digitale literarische Korpora zusammengestellt. Texte, die nach bestimmten Kriterien ausgewählt sind, werden in einer algorithmisch verarbeitbaren Form aufbereitet und als Korpus bereitgestellt. Der Aufwand größere Korpora zu schaffen ist beträchtlich, verteilt sich aber durch die mehrfache Verwendung durch verschiedene Wissenschaftler und für unterschiedliche Forschungsarbeiten. Der Urheberschutz und unterschiedliche Interpretationen des Verhältnisses von Urheberrecht und Freiheit der Wissenschaft zueinander, schränken manchmal den Zugang erheblich ein [11]. Aufbereitete Texte sind oft bereits nach einem bestimmten Standard (z.B. TEI [9]) „ausgezeichnet“, das heißt, dass auch Seiteneinteilung, Lesarten oder Differenzen zwischen Ausgaben etc. kodiert sind.
Liegt einer literaturwissenschaftlichen Arbeit ein Textkorpus zugrunde, so wird zumindest der algorithmisch bestimmte Teil der Arbeit wiederholbar. Andere Forscher können im Prinzip nachprüfen, ob sie mit dem gleichen Korpus und den gleichen Methoden gleiche Ergebnisse erhalten; wesentlicher ist aber die Möglichkeit, zu prüfen, wie sich die Ergebnisse verändern, wenn der Korpus erweitert, eingeschränkt oder ein ganz anderer Korpus mit den gleichen Methoden evaluiert wird. Literaturwissenschaftliche Forschungsergebnisse sind dann schärfer umrissen, auf bestimmte Sammlungen von literarischen Werke bezogen und mit anderen Ergebnissen vergleichbar. Die Interpretation der algorithmisch gewonnenen Ergebnisse bleibt, wie bisher, dem Wissenschaftler vorbehalten.
Die Verwendung von Korpora bringt auch einen methodischen Gewinn: es ist, mit Bezug auf einen fixierten Korpus möglich, Aussagen der Form „in diesem Korpus gibt es keinen Fall, dass ...“ oder „in diesem Korpus ist immer der Fall, dass ..“ zu machen. Solche kategorischen Urteile waren bisher nur in Bezug auf kleine, überblickbare Textmengen möglich - mit der Festlegung eines Korpus und der algorithmischen Untersuchung der darin enthaltenen Texte sind sie auch bezüglich großer Textsammlungen möglich.

3 Methodik einer digitalen Literaturwissenschaft

Die Interpretation eines literarischen Textes entsteht beim Lesen durch die Verbindung der Zeichen im Text mit dem Wissen des Lesers. Ähnlich wie der Gegenstand für die digitale Untersuchung als Korpus festgelegt wird, muss das Wissen, das die Interpretation erlaubt, beschrieben werden. Zu diesem Wissen gehört die Kenntnis der verwendeten Sprache, aber auch Allgemeinwissen und schließlich Spezialwissen, die für das vertiefte Verständnis notwendig sind. Eine digitale Literaturanalyse muss nicht nur den Gegenstand der Untersuchung, d.h. den Korpus der untersuchten Texte, sondern auch das für die Analyse verwendete Wissen und die verwendeten logischen Regeln bezeichnen.
Eine digitale Analyse eines natürlich-sprachlichen Textes zerfällt in verschiedene Phasen; in einem ersten Schritt erfolgt meist die Verarbeitung der Sprach\SpecialChar softhyphenoberfläche mit den Mitteln der Computerlinguistik (z.B. Stanford CoreNLP [A]  [A] https://stanfordnlp.github.io/CoreNLP/, für verschiedene Sprachen verfügbar [12]); der Text wird dabei in Wörter aufgelöst, diese auf Wortstämme reduziert und die grammatischen Konstruktionen analysiert, Referenzen und Verweise im Text kodiert und Eigennamen erkannt. Das dabei eingesetzte Wissen kann vereinfachend aber nachvollziehbar mit dem Verweis auf das verwendete Programm (bzw. auf das Programm und dem für das Training verwendete Korpus) erfolgen.
Je nach Fragestellung kann die Verbindung des Textes mit anderem Wissen anschließen; beispielsweise eine Analyse des Raumbezuges durch Verbindung von Ortsbezeichnungen mit geographischem Wissen oder der Beziehungen der Personen durch Verbindung mit historischem Wissen oder Bibel- und Mythologiekenntnisse. Es kann mit Werkzeugen der Computerlinguistik auch nach anderssprachigen Einschüben gesucht und diese analysiert werden; digitale Methoden können solche Hinweise systematisch und für mehr Sprachen als ein einzelner Leser beherrschen kann, sichtbar machen [B]  [B] Beispielsweise sind in Gedichten Celans zumindest russische (rot), französische (neige) und japanische (i-i-e) Wörter auffindbar. Die Untersuchung der Handlungen nach narratologischen Schemata wie von Propp initiiert [15], und zu computational narratology [C]  [C] http://www.lhn.uni-hamburg.de zu entwickeln, benötigt Generalisierungen, die Linguisten in taxonomisch organisierten Wortlisten bereitstellen (wordnet [D]  [D] http://wordnet.princeton.edu/ [2] u.ä).
Entscheidend ist, dass das bei einer algorithmischen Interpretation angewandte Wissen und die verwendeten logischen Schlussregeln nachvollziehbar beschrieben werden. Praktisch kann das Wissen durch die Angabe der bei der Analyse eingesetzten Werkzeuge beschrieben werden, also z.B. die verschiedenen Analysen die Datenbanken (z.B. dbpedia [E]  [E] http://dbpedia.org/ - eine logisch strukturierte Form großer Teile des Inhaltes von Wikipedia als RDF kodiert), Taxonomien und Schlussregeln (z.B. OWL [14]). Damit ist deren Einfluss auf die Analyse dokumentiert, überprüfbar und einer Kritik zugänglich.
Die technische Verarbeitung wird erleichtert, wenn die Verarbeitung in Schritte unterteilt wird und die Ein- und Ausgaben in standardisierten Formaten erfolgen. Für die sprachliche Analyse werden Treebanks (mit leicht unterschiedlichen Kodierungen) verwendet, wobei auch bereits sprachunabhängige Lösungen vorgeschlagen wurden[10, 17, 16], die für die Komparatistik wahrscheinlich besonders fruchtbar sind [6, 7]. Das für die literaturwissenschaftliche Analyse heranzuziehende Sachwissen ist zu einem großen Teil im Semantic Web [1, 5] bereits in in RDF Form[13] vorhanden; es könnte nützlich sein, Korpora ebenfalls in RDF Form zu kodieren um die Verbindung mit dem Semantic Web zu vereinfachen[8].

4 Coda

Durch eine konsequente Beschreibung von Gegenstand - als Korpus der maschinellen Verarbeitung zugänglich - und dem darauf bezogenen Sachwissen - in Form von Programmen und Semantic Web - , das für die Analyse verwendet wird, ist es möglich, Interpretationen algorithmisch zu überprüfen und zu objektivieren.
Die digitale Form von Texten (und ähnlichen Materialien) erlaubt den Einsatz von algorithmischen Verarbeitungen; dies verändert Literaturwissenschaft insofern, als Hypothesen überprüfbar werden. Nötig dazu ist die Beschreibung des Gegenstandes in Form der im Korpus eingeschlossenen Texte und ein Modell des Lesens und des dabei eingebrachten Sachwissens. Es kann dann entschieden werden, ob eine Interpretation eines Textes mit dem angegebenen Wissen möglich ist, welche Interpretationen aus einem limitierten Wissen resultieren (z.B. fehlende Fremdsprachkenntnisse), oder welches Sachwissen notwendig für eine bestimmte Interpretation ist.
Eine „automatische“ Interpretation liegt zwar in weiter Ferne, aber der Wissenschaftler könnte, durch eine automatische Produktion aller möglicher Assoziationen (taxonomisches und Faktenwissen) und deren Gruppierung auf vielleicht sonst übersehene Hypothesen zur Interpretation hingewiesen werden, die dann kritisch beurteilt werden müssen.

Literatur

[1] Tim Berners-Lee, James Hendler, Ora Lassila, others: “The semantic web”, Scientific american, pp. 28—37, 2001.
[2] Christiane Fellbaum: WordNet: An Electronic Lexical Database. The MIT Press, 1998.
[3] Hanno Biber, Evelyn Breiteneder, Karlheinz Mörth: “Words in Contexts: Digital Editions of Literary Journals in the "AAC - Austrian Academy Corpus" ”, Proceedings of the International Conference on Language Resources and Evaluation, LREC 2008, 26 May - 1 June 2008, Marrakech, Morocco, 2008. URL http://www.lrec-conf.org/proceedings/lrec2008/summaries/92.html.
[4] Hanno Biber, Evelyn Breiteneder: “Fivehundredmillionandone Tokens. Loading the AAC Container with Text Resources for Text Studies”, Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012), Istanbul, Turkey, May 23-25, 2012, pp. 1067—1070, 2012. URL http://www.lrec-conf.org/proceedings/lrec2012/summaries/857.html.
[5] Pascal Hitzler, Markus Krötzsch, Sebastian Rudolph, York Sure: “Semantic Web”, Berlin, Heidelberg, 2008.
[6] Christine Ivanovic, Andrew U Frank: “Corpus-based Research in Computational Comparative Literature”, Corpus-Based Research in the Humanities (CRH), pp. 69, 2015.
[7] Christine Ivanovic, Andrew U Frank: “Korpusanalyse in der computergestützten Komparatistik”, Digital Humanities deutsch (DHd), 2016.
[8] Christine Ivanovic, Andrew U Frank: Viennavigator: Digitale Formalisierung literarischer Topographien am Beispiel des Gesamtwerks von Ilse Aichinger in Nach Wien! Sehnsucht, Distanzierung, Suche. Literarische Darstellungen Wiens aus komparatistischer Perspektive (Bachleitner, Norbert and Ivanovic, Christine, ed.). Peter Lang, Frankfurt a.M., 2015.
[9] Fotis Jannidis: “TEI in a crystal ball”, Literary and linguistic computing, pp. 253—265, 2009.
[10] L. Banarescu, C. Bonial, M S. Cai, Georgescu, K. Griffitt, U. Hermjakob, K. Knight, P. Koehn, M. Palmer, N. Schneider: “Abstract Meaning Representation for Sembanking”, , 2013.
[11] Lawrence Lessig: Code and Other Laws of Cyberspace. Basic Books, 1999.
[12] Christopher D Manning, Mihai Surdeanu, John Bauer, Jenny Rose Finkel, Steven Bethard, David McClosky: “The Stanford CoreNLP Natural Language Processing Toolkit.”, ACL (System Demonstrations), pp. 55—60, 2014.
[13] Frank Manola, Eric Miller, Brian McBride, others: “RDF primer”, W3C recommendation, pp. 6, 2004.
[14] D.L. McGuinness, F. Van Harmelen, others: “OWL web ontology language overview”, W3C recommendation, pp. 10, 2004.
[15] Vladimir Jakovlevič Propp, EM Meletinskij, Christel Wendt: Morphologie des Märchens. Carl Hanser Verlag, 1972.
[16] Lucy Vanderwende, Arul Menezes, Chris Quirk: “An AMR parser for English, French, German, Spanish and Japanese and a new AMR-annotated corpus”, Proceedings of NAACL-HLT, pp. 26—30, 2015.
[17] Nianwen Xue, Ondrej Bojar, Jan Hajic, Martha Palmer, Zdenka Uresova, Xiuhong Zhang: “Not an Interlingua, But Close: Comparison of English AMRs to Chinese and Czech.”, LREC, pp. 1765—1772, 2014.

Saturday, April 2, 2016

Debian with XFCE4 on Lenovo Thinkpad Yoga14

I bought a Lenovo Thinkpad Yoga14 because I wanted a new device with handwriting input for writing and drawing; i know that it would work with Ubuntu and assumed that it would also work with a regular Debian installation (I dislike Canonical's approach to force users to their views).

Installation of Debian was somewhat more difficult; here the short description what worked for me in the end.

1. I observed that my working installation of Ubuntu used libwacom-bin, which is only in Debian stretch; thus I decided to install stretch (aka testing). Installations of jessie were not supporting the pointer.

2. To install Debian stretch (alpha 5 release) requires the non-free firmware and additionally the driver for the Intel Wireless 7265 rev 61 (aka 7265D), which is iwlwifi--7260-17.ucode (download from a git...), which i put on a second usb stick (on the first i had the inofficial stretch release including the firmware).

3. Install regularly and add the USB stick with the firmware when requested.

 4. Restart and produce /etc/apt/sources.list (not automatic in alpha 5)

deb http://ftp.at.debian.org/debian/ stretch main contrib non-free 
deb http://security.debian.org/debian stretch/updates main contrib non-free 

and /etc/network/interfaces

auto lo 
iface lo inet loopback 

auto wlp2s0 
iface wlp2s0 inet dhcp 
     wpa-ssid xxxx 
     wpa-psk "pppppp"

 5. restart and login as root; check that network is ok.

Then apt-get update, apt-get upgrade (some problem with iwlwifi?? - it needed a apt-get upgrade -f). install xfce4, xfce4-goodies, xournal (for drawing) startx to test install lightdm

it works, when you restart, the regular xfce login greeter is there!

Wednesday, December 4, 2013

Too Much Data!

Too much data!

Too much data!

“Die Guten ins Töpfchen, die Schlechten ins Kröpfchen” (the good into the pot, the bad into the crop, in Aschenputtel (Cinderella) by Grimm)

1 Situation and Problem

In the past — say 1990s — we had small data files and small disks. I had a Macintosh with 128k! Today we have a great number of large files, photos have typically 4 MB and movies even 4 GB, and much larger hard disks. Gone are magnetic tapes and similar external media (who remembers Magneto-Optical disks?), replaced by external hard disks with USB connectors.
Regular backup is still necessary - even if you use cloud services! The past years I bought about one hard disk per year, with capacity increasing from 300 GB to 3 TB, to hold (differential) backups of all my computers. The utility rdiff-backup did serve me well; files were easy to recover and it never failed me.
The questions for which I did not have an answer: How long to keep the backups? How to use the backups? How to decide that a backup is not relevant anymore and can be discarded. How to know whether a backup contain a valuable file not kept anywhere else. Even if all files are stored, finding a “lost” file is practically impossible.
This year, I decided that I should not simply buy “yet another hard disk” but clean up. A very cursory analysis showed that I had approximatively 10 Terabyte of date on disks; it was mostly copies of the same files. The copies were produced because all my 4 synchronized computers contain more or less the same files and I started a fresh backups every year. The count of duplicates is approximatively the number of years times the number of computers.
The task is simply (1) to extract one copy of each different valuable file and keep it in a safe place and then discard the remainder and (2) invent a policy to avoid creating the same mess again. Step one will be discussed here, step two discussed in a following blog. To start, two difficult definitions are required:
  1. When are two files equal?
  2. What is a valuable file?
The solution will produce a set of valuable files to keep in a “keepFolder”, to replace all the backups. The keepFolder should contain only one copy of each valuable files.

2 Doves: Design for a Cleanup Tool

Tools like Unison and rdiff-backup characterize files first by path and filename (short “filepath”). The effect is that any reorganisation of the file system (e.g., moving or renaming directories) is reflected as delete and addition - bloating the backup and adding each file twice (once as a deleted file, once as a new addition) and two files are only identified as “the same” if they have the same content and the same filepath.
A backup must reflect the current file structure, but for the store of preserved copies of files longtime ago deleted only one copy of each file is sufficient. Equality of files is here defined by content only and independent of filepath.
The cleanup does
  1. create an inventory of files.
  2. identify a set of unique files.
  3. move the unique files to a safe place (the keepFolder).
Two files are equal, if they have the same content, independent of name or location in a directory. Practically, a file is characterized by it MD5 digest, a 128 bit checksum obtained from the file contents. The more secure and collision resistant SHAL1, which gives a 256 bit checksum, could be used, but the goal here is only detection of files with the same content.
Valuable files are defined negatively by a list of file extensions and a list of directories; files with extensions in the list and all files in directories in the list are non-valuable (i.e. files in directories called “Trash”, “tmp” and files with extensions “bak”, “tmp” etc. ). Unfortunately, programmers are inventive with new names for ephemeral files; the XDG (http://standards.freedesktop.org/basedir-spec/basedir-spec-latest.html) proposal to select specific places for ephemeral files is an important step in the right direction. Files which are “special” and not readable to compute an MD5 value, e.g. pipes, links, broken pipes and corrupted non-readable files, are also considered non-valuable.

3 Operations of Doves program

The program to collect the unique files is called “doves”. The operations are:
collect (-c): collect the description of files and directories for all files in a given directory and store it in a doves file (extension .doves)
base (-b): extract the MD5 digests from a doves file (i.e. all the file descriptors for a directory) into a md5 file
process (-p): compare a given doves file (i.e. all files in a directory) with the files in base, i.e. a md5 file; the files not in base are listed in a keep file.
keep (-k): copy all the files in the keep file into a keep directory and update the base md5 file with the MD5 digests of the files in the keep file (to avoid these files in the next directory processed).
One starts with collecting the information for all files currently in the keep folder and producing a base md5 file. The sequence of steps: collect, process against base and keep extracts from a directory all files, which are not already in the keep folder and puts them there. It is repeated for each folder of interest.

4 Programming

Most of the tools necessary, in particular functions to computer MD5 digest or SHA1 checksums are available in Haskell from www.haskell.org/hackage.
The difficulty with programming (in Haskell) was my lack of experience with programming for “large” datasets. The “naive” approaches, which rely on the “natural” lazyness of Haskell, work only for reasonable size tasks, but collecting MD5 digests for the million plus files in 500 GB of backup data tends to exhaust resources. A lazy approach opens all files as quickly as possible (breadth first) and crashes when the maximum number of open files is reached (this is in linux typically limited to 1000 open files per process). The Pipes (http://hackage.haskell.org/package/pipes) package gives a consistent, generic way to sequence actions to avoid exhaustion of resources and leads to a construction of a program as a sequence of smaller actions which can be composed as a pipe. A sequential (sweeping) approach means dealing with each file individually and excludes approaches which want to process all data at once.
I was also not prepared to deal with the many different ways a file can be “special”, i.e. not readable to produce an MD5 value. It was necessary to test for such exceptions at the first point possible and exclude “special files” from further processing

5 Experience

Handling directories with hundreds of gigabytes is time consuming, especially if stored on external hard disks connected by USB 2.0. Just copying data between two disks takes about 4 minutes per GB, reading and computing the MD5 digest is a bit faster. The time it takes to delete 100 GB from a disk can be 10 minutes (and over an hour for 1 TB)
The reduction is substantial. Overall, from the total 10 TB of backup data only 1 TB is left. This is still more than what I expect to keep longterm, which is perhaps 200 GB; my music collection is about 50 GB and photos fill less than 100 GB, the rest is much less.

6 Open Questions

Pictures and music make a very large part of the files we collect today; keeping only one copy of each makes for me about 100 GB of disk space. Unfortunately, it is hard to get rid of all copies, as different music and picture managing programs stick some additional bits of information into the file: tags for pictures and information about music genre for music, and probably other things. The result is that the file is slightly different and has a different fingerprint; it is not recognized as a copy and thus the different near-copies are all stored separately; this is left for another effort in reduction with a different approach, in this pass I did not consider different file types, except to exclude “non-valuable” files.

Wednesday, April 4, 2012

Spatial planning ability as precursor for human language


Since the NCGIA (1989) research initiative 2 “Language of Spatial Relations” (Mark et al., 1989) and the meeting in Las Navas (Mark and Frank, 1991) we have argued that analyzing the ways humans think about space is fundamental for other – more abstract – kinds of reasoning. I found a most surprising connection recently in an article by Steedman, in which he argues: “Thus it [λ-calculus] is a theory that makes language look as if it has been built on a pre-existing system for planning action in the world, and thereby seem less unique as a cognitive faculty than is usually assumed.”(2002, , p. 4) and “the language faculty in its syntactic aspect is directly hung onto a more primitive set of prelinguistic operations including these combinators, originally developed for motor planning” (p. 5). Some background to the argument:

Planning an optimal path in space is a well established, basic human ability which requires the construction of network knowledge of the environment, combined from multiple trips. Planning an optimal sequence of actions can be computed with the same methods, representing the actions in a State – Transition – Diagram, which is structurally equivalent to a street network. I have discovered recently, that planning actions and executing actions can be seen in a category and in the corresponding co-category (Asperti and Longo, 1991). In robotics, the planning but also the recognition of plans of others is very important. Geib and Steedman (2007) show the structural similarity in the processes used for plan recognition and natural language processing. Producing an explanation for a plan is the same operation as parsing a sentence. This then connects to the initial quote above, which establishes a direct link between the planning of an optimal spatial path and human language, based on a categorical (λ-calculus) argument that these abilities all use the same fundamental process. Developmental arguments demonstrate that the spatial ability is primal and the other are “hung onto” these.

The insight I gain from the connection between path planning, action planning and human language production is first, the correspondences:
  • path/action planning — sentence production,
  • plan recognition — sentence parsing,
  • location/state — concept expressed in sentence,
  • target location/state — concept to communicate,
  • starting location/state — current context of conversation.
The correspondences point out that language production is always based on the “current context” and whether a produced sentence is felicitous (or not) depends on the context (as can be seen in linguistic discussions, for example, Moens and Steedman, 1987). The correspondence further indicates that “locations” in space correspond to concepts (typically complex concepts expressing situations) and we need a representation of concepts and context; here I am looking at the representation by Aerts and Gabora (2005b); Aerts and Gabora (2005a), and compare it with a lattice scheme based on distinctions (Frank, 2006).

The second important insight found in this and other papers by Steedman, but also extensively argued by Carpenter (1997), is the advantage of using λ-calculus over first order predicate logic: “unlike first-order logic, we can in addition provide a term corresponding to the meaning of the verb phrase” (p. 39).

References

Diederik Aerts and Liane Gabora, “A Theory of Concepts and Their Combinations II: A Hilbert Space Representation”, Kybernetes 34 (2005), pp. 0402205.

Diederik Aerts and Liane Gabora, “A theory of concepts and their combinations I: The structure of the sets of contexts and properties”, Kybernetes 34 (2005), pp. 167–191.

Asperti, Andrea and Longo, Giuseppe, Categories, Types and Structures – An Introduction to Category Theory for the Working Computer Scientist1 (The MIT Press, 1991).

Carpenter, Bob, Type-Logical Semantics (MIT, 1997).

Frank, Andrew U., “Distinctions Produce a Taxonomic Lattice: Are These the Units of Mentalese?”, in Bennett, Brandon and Fellbaum, Christiane, ed., Formal Ontology in Information Systems (Amsterdam: IOS Press, 2006), pp. 27–38.

Geib, C.W. and Steedman, M., “On natural language processing and plan recognition”, in Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI) (, 2007), pp. 1612–1617.

Mark, David M. and Frank, Andrew U., ed., Cognitive and Linguistic Aspects of Geographic Space vol. 63, (Kluwer Academic Publishers, 1991).

Mark, David M. and Frank, Andrew U. and Egenhofer, Max J. and Freundschuh, Scott M. and McGranaghan, Matthew and White, R. Michael, “Languages of Spatial Relations: Initiative Two Specialist Meeting Report”, National Center for Geographic Information and Analysis (1989).

Moens, M. and Steedman, M., “Temporal ontology in natural language”, in Proceedings of the 25th annual meeting on Association for Computational Linguistics (, 1987), pp. 1–7.

NCGIA,, “The U.S. National Center for Geographic Information and Analysis: An Overview of the Agenda for Research and Education”, IJGIS 2, 3 (1989), pp. 117-136.

Steedman, M., “Formalizing affordance”, in Proceedings of the 24th Annual Meeting of the Cognitive Science Society (, 2002), pp. 834–839.

Sunday, January 30, 2011

Optimisms in Software Development

Software development is notoriously behind schedule and above budget. Why are we so unable to estimate the time it takes to produce an application?

Development of hardware is always underestimated and the development of software overestimated; reality is different. For example compare the current multi-core CPU in your laptop with the CPUs built from discrete components 40 years ago and then contrast this stunning advancement with the similarity of FORTRAN and ALGOL used on these CPU's and Java used today!

“It is only a small matter of programming” was originally a 'game people play' (see the book by Eric Berne) and is now the title of a book by Nardi – the game is played as IT projects in most organization in the western world and we all suffer from late, inadequate and error-producing software we are forced to use in our work environment.

I had recently a student proposing to write software to allow end-users to create and adapt graphical user interfaces. He started with a good idea, observing the connection between database schema and GUI, and had quickly wiped together rudimentary graphical editors for GUI and database schema. This must have been emotionally rewarding, i.e. fun, and it seemed a minor step to complete the rest.

Having often started simple looking projects myself, to observe later the hidden complexity that made it hard to produce at the end, something coping with the complexity of a real world application, I start identifying some common sources for the optimism. We do not address the hard questions first and start to program early. The hard questions can be found by:
- Identify a compact but not trivial use case and describe it from the user's perspective in detail.
- Select a formal (or pseudo-formal) language to describe the representation for the information used in the program.
- Describe the functions (transformations) used to achieve the goal in a (pseudo-)formal language.

In my effort to exploit the data structure information to produce a GUI I failed eventually when I started considering exceptions: special situations in the underlying application operations and problems with user input has to be dealt with in parallel with the ordinary application logic. I found the current exception handling violating the principle of locality in my code and confuse me to the point I gave up; I will return to the question when I have a good idea what I could do differently.

Some advice at the end:
Use existing tools and do not duplicate them unless you have a compelling reason; try first to patch together a prototype using existing tools.

Thursday, December 23, 2010

Finding potential applications for technology, e.g. the Smartphone

In his thesis “Potential use of GPS on the Smartphone”, Jürgen Brandstetter investigated if there are methods to find systematically new application ideas. I suggested that Maslow's hierarchy of needs could help to identify human needs for which new technology could lead to more satisfaction. Maslow's pyramid orders needs and suggests that more basic needs must be satisfied before higher level needs are addressed.

The lowest levels are physiological needs, like drinking, eating, sleeping, clothing and shelter. The second level are safety needs - avoiding and protection from dangers. Then comes “love and belonging”, esteem and finally self-actualization.

The students thesis started mostly as a review of iPhone applications to find hotels, restaurants, friends and using trackers. To a degree, smartphone applications are “solutions in search of a problem” - often a very particular problem, suggesting a niche for one of the thousands of specialized applications. They seem to follow a rule that only very specialized applications can be made simple enough to be usable without lengthy instructions: a restaurant finder for a city (and each city its own), a finder for tenis courts, etc. In the long run, an inefficient method, requiring much learning from users.

Thinking through the student's thesis, I realized that Maslow could indeed serve as guidance:

The first level needs are physical processes; the smart-phone can help to find places offering the pertinent affordance (e.g. a place where one can eat or sleep). Commonalities for all such applications are: which affordance? How urgent? How close? criteria to identify the optimal solution and finally guidance to move to the selected place. The commonality is confirmed technically: the thousands of different “find best x” applications use mostly the Google Maps API for showing location and providing wayfinding instructions. Maslow suggests a that a general find affordance to satisfy primary need application is possible. This would avoid the problem of, say, a hotel finder, which claims nothing is found, even-though a local youth hostel could satisfy the need for sleep and shelter.

Applications to inform us about dangers - both physical dangers, from inclement weather to landslides and tsunamis, and social dangers in areas with high criminality - seems equally possible to avoid and protect us.

Smart-phones and social networks experiment at the moment with methods to satisfy “love and esteem” needs: finding friends nearby, finding locations where social contacts are facilitated, but also telling others “what is here” to gain esteem in a community are current.

Numerous computerized helpers exist to increase our self-actualization; smart-phones with GPS can, for example, track our movements and document to ourselves the running or bicycling we have done, localize photographs taken etc.

In summary, Marlow's hierarchy of needs seems to provide a useful generalization and classification of human needs. It could be used to build more general applications than what is currently provided. The levels of needs indicate generalities between needs and usable to reduce the myriad of particular applications to a few more generalized ones. The approach follows more the “need to the solution” path and not the, often unsuccessful, reverse.

Three Ages of Geographic Information Users:


Substantial changes are occuring how GIS are used. Roughly three ages can be differentiated:
past: public administration,
current: commercial, and
future: personal.

Past: In the 1970s, Geographic Information Systems were proposed to use the then new electronic data processing machines to reduce the duplication of collection and maintenance cost of spatial information in the public administration. Multi-purpose cadastres were designed to integrate available data based on location to improve urban planing, maintenance of urban infrastructure and avoid accidents. The Harvard Graphics Lab at that time included researchers like Jack Dangermond and Nick Chrisman to name but two who are still influencing GIS today.

Present: 1990 the U.S. Bureau of the Census put street centerlines on-line and topography became available from USGS. These datasets allowed commercial users to geo-code their client data with street addresses and to use spatial analysis tools. GPS receivers and mobile communication devices helped logistic business to manage their fleets and improve dispatch of vehicles. Mapquest, Google, and car navigation system hold now improved street network and traffic data to sell navigation guidance information in different forms, often paid indirectly by advertisement. An increasing number of ordinary people use today location related information on a day-to-day base.

Future: People manage their personal information increasingly in electronic form on multiple devices: calenders in the Internet “cloud”, address list and phone books on PC or mobile phones. Personal collections of digital photographs grow quickly. My new digital camera includes a GPS receiver and all photographs are geo-coded, but I cannot ask “How did this place look when I last visited?” My smart-phone allows me to trade with GPS my movements during the day. The different tools producing and managing my personal data are mostly location blind and non-cooperative. Location could in the future serve as an organizing principle. I expect the next generation of personal information management system (PIM) to be spatially aware PIMs (sPIM), which amounts to a GIS for personal use (pGIS).

Two concluding observations:

- The market for systems for public administrations is mostly saturated and the commercial market, much larger, is still growing. Imagine the market for pGIS - virtually everybody is a potential user!

- The time it takes from research to widespread application is much longer than expected: it took 10 to 20 years for GIS in public administration and commercial use. The research required to make pGIS a reality done today will be the foundation of the killer application in the 2020s!