Wednesday, December 4, 2013

Too Much Data!

Too much data!

Too much data!

“Die Guten ins Töpfchen, die Schlechten ins Kröpfchen” (the good into the pot, the bad into the crop, in Aschenputtel (Cinderella) by Grimm)

1 Situation and Problem

In the past — say 1990s — we had small data files and small disks. I had a Macintosh with 128k! Today we have a great number of large files, photos have typically 4 MB and movies even 4 GB, and much larger hard disks. Gone are magnetic tapes and similar external media (who remembers Magneto-Optical disks?), replaced by external hard disks with USB connectors.
Regular backup is still necessary - even if you use cloud services! The past years I bought about one hard disk per year, with capacity increasing from 300 GB to 3 TB, to hold (differential) backups of all my computers. The utility rdiff-backup did serve me well; files were easy to recover and it never failed me.
The questions for which I did not have an answer: How long to keep the backups? How to use the backups? How to decide that a backup is not relevant anymore and can be discarded. How to know whether a backup contain a valuable file not kept anywhere else. Even if all files are stored, finding a “lost” file is practically impossible.
This year, I decided that I should not simply buy “yet another hard disk” but clean up. A very cursory analysis showed that I had approximatively 10 Terabyte of date on disks; it was mostly copies of the same files. The copies were produced because all my 4 synchronized computers contain more or less the same files and I started a fresh backups every year. The count of duplicates is approximatively the number of years times the number of computers.
The task is simply (1) to extract one copy of each different valuable file and keep it in a safe place and then discard the remainder and (2) invent a policy to avoid creating the same mess again. Step one will be discussed here, step two discussed in a following blog. To start, two difficult definitions are required:
  1. When are two files equal?
  2. What is a valuable file?
The solution will produce a set of valuable files to keep in a “keepFolder”, to replace all the backups. The keepFolder should contain only one copy of each valuable files.

2 Doves: Design for a Cleanup Tool

Tools like Unison and rdiff-backup characterize files first by path and filename (short “filepath”). The effect is that any reorganisation of the file system (e.g., moving or renaming directories) is reflected as delete and addition - bloating the backup and adding each file twice (once as a deleted file, once as a new addition) and two files are only identified as “the same” if they have the same content and the same filepath.
A backup must reflect the current file structure, but for the store of preserved copies of files longtime ago deleted only one copy of each file is sufficient. Equality of files is here defined by content only and independent of filepath.
The cleanup does
  1. create an inventory of files.
  2. identify a set of unique files.
  3. move the unique files to a safe place (the keepFolder).
Two files are equal, if they have the same content, independent of name or location in a directory. Practically, a file is characterized by it MD5 digest, a 128 bit checksum obtained from the file contents. The more secure and collision resistant SHAL1, which gives a 256 bit checksum, could be used, but the goal here is only detection of files with the same content.
Valuable files are defined negatively by a list of file extensions and a list of directories; files with extensions in the list and all files in directories in the list are non-valuable (i.e. files in directories called “Trash”, “tmp” and files with extensions “bak”, “tmp” etc. ). Unfortunately, programmers are inventive with new names for ephemeral files; the XDG (http://standards.freedesktop.org/basedir-spec/basedir-spec-latest.html) proposal to select specific places for ephemeral files is an important step in the right direction. Files which are “special” and not readable to compute an MD5 value, e.g. pipes, links, broken pipes and corrupted non-readable files, are also considered non-valuable.

3 Operations of Doves program

The program to collect the unique files is called “doves”. The operations are:
collect (-c): collect the description of files and directories for all files in a given directory and store it in a doves file (extension .doves)
base (-b): extract the MD5 digests from a doves file (i.e. all the file descriptors for a directory) into a md5 file
process (-p): compare a given doves file (i.e. all files in a directory) with the files in base, i.e. a md5 file; the files not in base are listed in a keep file.
keep (-k): copy all the files in the keep file into a keep directory and update the base md5 file with the MD5 digests of the files in the keep file (to avoid these files in the next directory processed).
One starts with collecting the information for all files currently in the keep folder and producing a base md5 file. The sequence of steps: collect, process against base and keep extracts from a directory all files, which are not already in the keep folder and puts them there. It is repeated for each folder of interest.

4 Programming

Most of the tools necessary, in particular functions to computer MD5 digest or SHA1 checksums are available in Haskell from www.haskell.org/hackage.
The difficulty with programming (in Haskell) was my lack of experience with programming for “large” datasets. The “naive” approaches, which rely on the “natural” lazyness of Haskell, work only for reasonable size tasks, but collecting MD5 digests for the million plus files in 500 GB of backup data tends to exhaust resources. A lazy approach opens all files as quickly as possible (breadth first) and crashes when the maximum number of open files is reached (this is in linux typically limited to 1000 open files per process). The Pipes (http://hackage.haskell.org/package/pipes) package gives a consistent, generic way to sequence actions to avoid exhaustion of resources and leads to a construction of a program as a sequence of smaller actions which can be composed as a pipe. A sequential (sweeping) approach means dealing with each file individually and excludes approaches which want to process all data at once.
I was also not prepared to deal with the many different ways a file can be “special”, i.e. not readable to produce an MD5 value. It was necessary to test for such exceptions at the first point possible and exclude “special files” from further processing

5 Experience

Handling directories with hundreds of gigabytes is time consuming, especially if stored on external hard disks connected by USB 2.0. Just copying data between two disks takes about 4 minutes per GB, reading and computing the MD5 digest is a bit faster. The time it takes to delete 100 GB from a disk can be 10 minutes (and over an hour for 1 TB)
The reduction is substantial. Overall, from the total 10 TB of backup data only 1 TB is left. This is still more than what I expect to keep longterm, which is perhaps 200 GB; my music collection is about 50 GB and photos fill less than 100 GB, the rest is much less.

6 Open Questions

Pictures and music make a very large part of the files we collect today; keeping only one copy of each makes for me about 100 GB of disk space. Unfortunately, it is hard to get rid of all copies, as different music and picture managing programs stick some additional bits of information into the file: tags for pictures and information about music genre for music, and probably other things. The result is that the file is slightly different and has a different fingerprint; it is not recognized as a copy and thus the different near-copies are all stored separately; this is left for another effort in reduction with a different approach, in this pass I did not consider different file types, except to exclude “non-valuable” files.

Wednesday, April 4, 2012

Spatial planning ability as precursor for human language


Since the NCGIA (1989) research initiative 2 “Language of Spatial Relations” (Mark et al., 1989) and the meeting in Las Navas (Mark and Frank, 1991) we have argued that analyzing the ways humans think about space is fundamental for other – more abstract – kinds of reasoning. I found a most surprising connection recently in an article by Steedman, in which he argues: “Thus it [λ-calculus] is a theory that makes language look as if it has been built on a pre-existing system for planning action in the world, and thereby seem less unique as a cognitive faculty than is usually assumed.”(2002, , p. 4) and “the language faculty in its syntactic aspect is directly hung onto a more primitive set of prelinguistic operations including these combinators, originally developed for motor planning” (p. 5). Some background to the argument:

Planning an optimal path in space is a well established, basic human ability which requires the construction of network knowledge of the environment, combined from multiple trips. Planning an optimal sequence of actions can be computed with the same methods, representing the actions in a State – Transition – Diagram, which is structurally equivalent to a street network. I have discovered recently, that planning actions and executing actions can be seen in a category and in the corresponding co-category (Asperti and Longo, 1991). In robotics, the planning but also the recognition of plans of others is very important. Geib and Steedman (2007) show the structural similarity in the processes used for plan recognition and natural language processing. Producing an explanation for a plan is the same operation as parsing a sentence. This then connects to the initial quote above, which establishes a direct link between the planning of an optimal spatial path and human language, based on a categorical (λ-calculus) argument that these abilities all use the same fundamental process. Developmental arguments demonstrate that the spatial ability is primal and the other are “hung onto” these.

The insight I gain from the connection between path planning, action planning and human language production is first, the correspondences:
  • path/action planning — sentence production,
  • plan recognition — sentence parsing,
  • location/state — concept expressed in sentence,
  • target location/state — concept to communicate,
  • starting location/state — current context of conversation.
The correspondences point out that language production is always based on the “current context” and whether a produced sentence is felicitous (or not) depends on the context (as can be seen in linguistic discussions, for example, Moens and Steedman, 1987). The correspondence further indicates that “locations” in space correspond to concepts (typically complex concepts expressing situations) and we need a representation of concepts and context; here I am looking at the representation by Aerts and Gabora (2005b); Aerts and Gabora (2005a), and compare it with a lattice scheme based on distinctions (Frank, 2006).

The second important insight found in this and other papers by Steedman, but also extensively argued by Carpenter (1997), is the advantage of using λ-calculus over first order predicate logic: “unlike first-order logic, we can in addition provide a term corresponding to the meaning of the verb phrase” (p. 39).

References

Diederik Aerts and Liane Gabora, “A Theory of Concepts and Their Combinations II: A Hilbert Space Representation”, Kybernetes 34 (2005), pp. 0402205.

Diederik Aerts and Liane Gabora, “A theory of concepts and their combinations I: The structure of the sets of contexts and properties”, Kybernetes 34 (2005), pp. 167–191.

Asperti, Andrea and Longo, Giuseppe, Categories, Types and Structures – An Introduction to Category Theory for the Working Computer Scientist1 (The MIT Press, 1991).

Carpenter, Bob, Type-Logical Semantics (MIT, 1997).

Frank, Andrew U., “Distinctions Produce a Taxonomic Lattice: Are These the Units of Mentalese?”, in Bennett, Brandon and Fellbaum, Christiane, ed., Formal Ontology in Information Systems (Amsterdam: IOS Press, 2006), pp. 27–38.

Geib, C.W. and Steedman, M., “On natural language processing and plan recognition”, in Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI) (, 2007), pp. 1612–1617.

Mark, David M. and Frank, Andrew U., ed., Cognitive and Linguistic Aspects of Geographic Space vol. 63, (Kluwer Academic Publishers, 1991).

Mark, David M. and Frank, Andrew U. and Egenhofer, Max J. and Freundschuh, Scott M. and McGranaghan, Matthew and White, R. Michael, “Languages of Spatial Relations: Initiative Two Specialist Meeting Report”, National Center for Geographic Information and Analysis (1989).

Moens, M. and Steedman, M., “Temporal ontology in natural language”, in Proceedings of the 25th annual meeting on Association for Computational Linguistics (, 1987), pp. 1–7.

NCGIA,, “The U.S. National Center for Geographic Information and Analysis: An Overview of the Agenda for Research and Education”, IJGIS 2, 3 (1989), pp. 117-136.

Steedman, M., “Formalizing affordance”, in Proceedings of the 24th Annual Meeting of the Cognitive Science Society (, 2002), pp. 834–839.

Sunday, January 30, 2011

Optimisms in Software Development

Software development is notoriously behind schedule and above budget. Why are we so unable to estimate the time it takes to produce an application?

Development of hardware is always underestimated and the development of software overestimated; reality is different. For example compare the current multi-core CPU in your laptop with the CPUs built from discrete components 40 years ago and then contrast this stunning advancement with the similarity of FORTRAN and ALGOL used on these CPU's and Java used today!

“It is only a small matter of programming” was originally a 'game people play' (see the book by Eric Berne) and is now the title of a book by Nardi – the game is played as IT projects in most organization in the western world and we all suffer from late, inadequate and error-producing software we are forced to use in our work environment.

I had recently a student proposing to write software to allow end-users to create and adapt graphical user interfaces. He started with a good idea, observing the connection between database schema and GUI, and had quickly wiped together rudimentary graphical editors for GUI and database schema. This must have been emotionally rewarding, i.e. fun, and it seemed a minor step to complete the rest.

Having often started simple looking projects myself, to observe later the hidden complexity that made it hard to produce at the end, something coping with the complexity of a real world application, I start identifying some common sources for the optimism. We do not address the hard questions first and start to program early. The hard questions can be found by:
- Identify a compact but not trivial use case and describe it from the user's perspective in detail.
- Select a formal (or pseudo-formal) language to describe the representation for the information used in the program.
- Describe the functions (transformations) used to achieve the goal in a (pseudo-)formal language.

In my effort to exploit the data structure information to produce a GUI I failed eventually when I started considering exceptions: special situations in the underlying application operations and problems with user input has to be dealt with in parallel with the ordinary application logic. I found the current exception handling violating the principle of locality in my code and confuse me to the point I gave up; I will return to the question when I have a good idea what I could do differently.

Some advice at the end:
Use existing tools and do not duplicate them unless you have a compelling reason; try first to patch together a prototype using existing tools.

Thursday, December 23, 2010

Finding potential applications for technology, e.g. the Smartphone

In his thesis “Potential use of GPS on the Smartphone”, Jürgen Brandstetter investigated if there are methods to find systematically new application ideas. I suggested that Maslow's hierarchy of needs could help to identify human needs for which new technology could lead to more satisfaction. Maslow's pyramid orders needs and suggests that more basic needs must be satisfied before higher level needs are addressed.

The lowest levels are physiological needs, like drinking, eating, sleeping, clothing and shelter. The second level are safety needs - avoiding and protection from dangers. Then comes “love and belonging”, esteem and finally self-actualization.

The students thesis started mostly as a review of iPhone applications to find hotels, restaurants, friends and using trackers. To a degree, smartphone applications are “solutions in search of a problem” - often a very particular problem, suggesting a niche for one of the thousands of specialized applications. They seem to follow a rule that only very specialized applications can be made simple enough to be usable without lengthy instructions: a restaurant finder for a city (and each city its own), a finder for tenis courts, etc. In the long run, an inefficient method, requiring much learning from users.

Thinking through the student's thesis, I realized that Maslow could indeed serve as guidance:

The first level needs are physical processes; the smart-phone can help to find places offering the pertinent affordance (e.g. a place where one can eat or sleep). Commonalities for all such applications are: which affordance? How urgent? How close? criteria to identify the optimal solution and finally guidance to move to the selected place. The commonality is confirmed technically: the thousands of different “find best x” applications use mostly the Google Maps API for showing location and providing wayfinding instructions. Maslow suggests a that a general find affordance to satisfy primary need application is possible. This would avoid the problem of, say, a hotel finder, which claims nothing is found, even-though a local youth hostel could satisfy the need for sleep and shelter.

Applications to inform us about dangers - both physical dangers, from inclement weather to landslides and tsunamis, and social dangers in areas with high criminality - seems equally possible to avoid and protect us.

Smart-phones and social networks experiment at the moment with methods to satisfy “love and esteem” needs: finding friends nearby, finding locations where social contacts are facilitated, but also telling others “what is here” to gain esteem in a community are current.

Numerous computerized helpers exist to increase our self-actualization; smart-phones with GPS can, for example, track our movements and document to ourselves the running or bicycling we have done, localize photographs taken etc.

In summary, Marlow's hierarchy of needs seems to provide a useful generalization and classification of human needs. It could be used to build more general applications than what is currently provided. The levels of needs indicate generalities between needs and usable to reduce the myriad of particular applications to a few more generalized ones. The approach follows more the “need to the solution” path and not the, often unsuccessful, reverse.

Three Ages of Geographic Information Users:


Substantial changes are occuring how GIS are used. Roughly three ages can be differentiated:
past: public administration,
current: commercial, and
future: personal.

Past: In the 1970s, Geographic Information Systems were proposed to use the then new electronic data processing machines to reduce the duplication of collection and maintenance cost of spatial information in the public administration. Multi-purpose cadastres were designed to integrate available data based on location to improve urban planing, maintenance of urban infrastructure and avoid accidents. The Harvard Graphics Lab at that time included researchers like Jack Dangermond and Nick Chrisman to name but two who are still influencing GIS today.

Present: 1990 the U.S. Bureau of the Census put street centerlines on-line and topography became available from USGS. These datasets allowed commercial users to geo-code their client data with street addresses and to use spatial analysis tools. GPS receivers and mobile communication devices helped logistic business to manage their fleets and improve dispatch of vehicles. Mapquest, Google, and car navigation system hold now improved street network and traffic data to sell navigation guidance information in different forms, often paid indirectly by advertisement. An increasing number of ordinary people use today location related information on a day-to-day base.

Future: People manage their personal information increasingly in electronic form on multiple devices: calenders in the Internet “cloud”, address list and phone books on PC or mobile phones. Personal collections of digital photographs grow quickly. My new digital camera includes a GPS receiver and all photographs are geo-coded, but I cannot ask “How did this place look when I last visited?” My smart-phone allows me to trade with GPS my movements during the day. The different tools producing and managing my personal data are mostly location blind and non-cooperative. Location could in the future serve as an organizing principle. I expect the next generation of personal information management system (PIM) to be spatially aware PIMs (sPIM), which amounts to a GIS for personal use (pGIS).

Two concluding observations:

- The market for systems for public administrations is mostly saturated and the commercial market, much larger, is still growing. Imagine the market for pGIS - virtually everybody is a potential user!

- The time it takes from research to widespread application is much longer than expected: it took 10 to 20 years for GIS in public administration and commercial use. The research required to make pGIS a reality done today will be the foundation of the killer application in the 2020s!

Sunday, September 12, 2010

Ecological Social Science


Most of us understand that human actions are a major influence on ecology and discussion of climate control, air pollution and other ecological topics of global and local importance should not be studied without considering the impact of humans on nature. Moran's book with the same title gives a very complete overview of this new subject: the combination of ecology and social sciences. The book reviews the differences in the viewpoints and methods; it mentions many of the studies carried out and the models they use.

I missed a substantive insight, how natural and social systems interact and I was surprised that the political influence is dealt with in about 4 pages on political economy and political ecology. The book forced me to think about my personal understanding of 'social ecology':

Taking global warming as the most pressing problem, I use from Ostrom the insight, that trust among participants is necessary. It is hard to see, how trust can be built, if the major players (U.S.A. and China) are not accepting rules to play by and one of them (U.S.A.) reserves for themselves the right to a much larger appropriation than the proportional share. It is hard to see how to establish rules acceptable to all with this start point.

It seems that we live in a world with one superpower. The developed countries, under U.S. leadership, try to maintain their high level of resource appropriation. I see cooperation with the elites in the lesser developed countries, elites, who maintain equally high level of resource utilization. I imagine a economic model with only 3 players: (1) the population in the developed countries, (2) the elites and (3) the poor populations in the lesser developed countries. If each player serves his own interest, development will not happen and resource utilization will remain skewed. The majority of the world population is in the third group (poor population in lesser developed countries), making for a dangerous, non-equilibrium situation.

Thursday, September 9, 2010

Avoiding the Tragedy of the Commons" (Eleonore Ostrom: Governing the Commons)

The book by 2009 Nobel price laureate Eleonor Ostrom is one of the most positive books I have recently read; it gives rise to optimism: first, for the management of common resources - which include climate and natural habitats, and second, for a realistic and empirically grounded social science, producing applicable theories.

A commons was the communal pasture in a village, where every farmer could herd as much cattle as he wanted. More cows on the commons means more milk to the farmer, thus every farmer in his own best interest increased the number of cows, even though the aggregate action of all farmers will destroy the pasture and at the end all will lose. The effects of 'best self-interest action' by the individuals lead seemingly inevitably to a loss of the resource for all. This was described by Garrett Hardin as "The Tragedy of the Commons" (Science, Vol. 162, No. 3859 (December 13, 1968), pp. 1243-1248) and it is empirically observable in many instances - the fishing industry and groundwater pumping are current examples in the public eye.

Fortunately, not all Common Pool Resources (CPR) follow the this route to destruction; alpine meadows on several continents are managed communally for centuries, irrigation systems in many traditional communities have functioned for 'time immemorial'. Eleonor Ostrom has carefully collected empirical studies of CPR management which have worked and distilled a set of principles which help to avoid the tragedy:
1. clearly defined boundary of the resource and the community it serves,
2. congruence between the rules of using the resource (appropriation rules) and the contributions (provision rules),
3. institutions to allow collective decisions making,
4. monitoring the resource and its use,
5. sanctions for violation are graduated (going from very small to substantial for repeated offenders),
6. conflict-resolution mechanism,
7. the right of the community to organize is recognized by higher level authorities.

If the monitoring methods are simple and infractions easily observable by others, trust among participants is emerging. Ostrom analyzes cases where the 'tragedy was played out' and identifies as causes: (1) a lack of political will to enforce rules, (2) complex rules based on not easily observed facts, requiring a costly staff, which is likely not functioning, and (3) general interference of higher levels of government with the local people who depend on the resource for their livelihood.

The book is an example for a new kind of social science, avoiding hypothetical examples, which cannot exist in reality (e.g. perfect markets), but starting with careful empirical studies, collecting qualitative and not only quantitative figures to analyze institutions and to deduce conditions under which the tragedy can be avoided.

For GIscience, the book is relevant as it points out that (1) studies of mechanism of a CPR based on natural science only are not sufficient but the interaction with the community exploiting the resource must be included and (2) the cost of the management rules must be considered: information collection, processing and later enforcement of rules can be expensive and ruin in practice a theoretically nice solution.