Dan Olsen’s “Evaluating User Interface Systems Research” review
“Evaluating User Interface Systems Research” (Olsen, 2002) is a paper that I consider fundamental for critique of HCI evaluation, despite its longevity. It highlights the evaluation caveats of interactive systems, and the claims that can be made for complex systems and interactive innovations. They stand across and are applicable as we progress through new interactive media technologies and computing paradigms.

The fundamental question asked in this paper is “How to evaluate new user interface systems research and ensure true progress is made?” This question addresses what Olsen describes as a “decline in new systems ideas” which was more obvious in the period when the paper was written (i.e., after the maturation of the desktop paradigm and just before the mobile paradigm boom). He puts forward the following three reasons:
- in the early days, there was fragmentation of tools with many competing toolkits, but with the stabilisation of the main OSes and consolidation of frameworks, the user interface architecture is now dictated.
- this level of consolidation can lead to a lack of researchers’ skills in architecture and design of toolkits or windowing systems
- lack of appropriate criteria for evaluating systems architectures
Olsen attempts to demonstrate the value of user interface interface research more specifically when, given the continuous evolution and forces of change, some assumptions no longer hold:
- early systems’ memory and CPU constraints (fast-paced innovation in hardware)
- the general level of users expertise (nowadays, they are much more comfortable with different technological forms and with fast-paced innovation)
- new broader forms of interaction that surpass the WIMP model
Olsen notes how UI systems architecture and toolkits research can offer different ways to provide value to new ideas around interactive solutions and general interactive innovations. For instance:
- “Reducing development viscosity” — UI toolkits reduce time to create a solution, by supporting fast prototyping, which leads to more alternative solutions, and therefore a more effective design process.
- “Least resistance to good solutions” — providing a free standard UI widget set was more effective at facilitating and accelerating the adoption of a common look and feel than a style manual for Apple “…toolkits can encapsulate and simplify expertise. When exploration of a space of possibilities finally settles on a few good solutions, these can be packaged into a toolkit to simplify development of future systems.”
- “Lower skills barriers”—with modern toolkits, more people with different skillsets (e.g., designers, rather than software engineers only) are able to participate in the creation of new solutions.
- “Power in common infrastructure” — common infrastructure can be leveraged across applications (e.g., mouse input event system, HTTP requests)
- “Enabling scale”— aggregating all of the above, toolkits lay the foundations for achieving more solutions with more powerful results.
On the other hand, Olsen claims that misapplied evaluation methods and evaluation errors can be very harmful for interactive system architectures and techniques that are potentially useful but still under development. He shows how evaluation can undermine and limit change and progress of artefacts that are only passible of easy and robust evaluation.
a) Usability trap—Olsen questions the benefits of usability measurement for certain case, for their previous importance and the prominence the research community has placed in it as a validation instrument for research and publications. He claims that toolkit and UI architecture work rarely meet usability testing assumptions. Usability metrics, such as task completion time, time to proficiency and error minimisation, have proven useful for their easily comparable and supported results, but can be built upon erroneous assumptions. For instance
- the ‘walk up and use’ assumption fails in domains where specialised knowledge is required (i.e., assumes that users always possess minimal required training or shared expertise that allows them to use a given system).
- the ‘standardised task’ assumption, which fails to consider the unsuitability of tasks with a high degree of variability or complexity for usability testing (e.g., data entry vs drawing, the latter task it is very difficult assess objectively for the variance of differing techniques).
- the scale of the problem and amount of time to achieve ‘complete’ usability test, specially for a toolkit, that can incur in high cost but with very low statistical significance.
b) Fatal flaw fallacy—Olsen critiques how the search for fatal flaws in new interactive techniques, and the thorough examination of their validity constricts development and innovation with new systems. This is presented as one of the main bottlenecks for research in this area, specially when work in progress has most certainly flaws, gaps or compromises that can signal for an unsuccessful evaluation.
c) Legacy code—the price of providing support for legacy code or of rewriting certain applications is sometimes so high that it becomes detrimental to the emergence of new UI architectures and interfaces
So all of this leads to the question of, if not with usability measurement, how to evaluate the effectiveness of interactive systems and tools? Olsen proposes a set of the most important claims that can be made about the value provided by toolkits and interactive innovations, which I summarise as follows:
1. Situation, tasks and users (STU) — for Olsen these entities comprise an ontology and the context that needs to be set for assessing the quality of interactive systems innovation. How clearly is the interactive innovation set in a STU context? Olsen illustrate with the case of toolkits, in which users are developers with the task of producing applications and interfaces for another STU context, i.e., for specific end users to fulfil end tasks in an end situations.
2. Importance—Olsen considers the importance of any system, toolkit or technique demonstrates, its most valuable attribute. Importance seems a rather ill-defined and subjective attribute in this context, but Olsen points to importance analysis within the STU context of the interactive innovation. For instance, what is the importance of a certain population in terms of size, performance and societal role, their degree of need? Or for tasks, how important are the target tasks to the target population, what is their level of difficulty, or what are consequences of not being fulfilled? Or for situations, how frequently do situations in which these tasks need performing, occur for these populations? Claims about the importance of any interactive system innovation need support from the importance of its STU components.
3. Problem not previously solved—Olsen admits that usefulness is more relevant than usability, particularly when the solution is effective at solving a problem for a given STU. What problems do the interactive solution solve?
4. Generality—inline with the previous point, there is a strong claim for a general tool if a solution can solve a different set of tasks for different populations. “If one has used the tool to solve three diverse problems then one can argue that the tool solves most of the problems lying in the space between the demonstrated solutions.” However, Olsen calls attention to the problem of difficulty proving this claim, given the need to demonstrate all the solutions for which the tool is useful.
5. Reduce solution viscosity— viscosity is detrimental to design processes as they require fast iteration on many possible solutions. Claiming that a tool reduces viscosity, or resistance to change is a important claim. Olsen explains how a tool can reduce viscosity in three ways:
- Flexibility—the tool enables users to rapidly iterate and evaluate design changes. Claiming that a tool is flexible is easy to support by defining a set of possible design changes, and showing how changes take less effort than a comparable tool (e.g., graphic vs code-based layout tools);
- Expressive leverage—the tool reduces the number of choices and repetitions needed to express a specific solution. This is enabled by mechanisms such as abstraction and modularisation—generalise and reuse strategy. For instance, a module that encapsulates the expression of a certain choice Y required for a large set of STU contexts, makes a tool more generalisable and more expressive (e.g., the tabular data widget generalises to many different contexts). As technology improves, with the introduction of more automation processes, abstraction and choice reduction becomes more powerful. To support the claim expressive leverage the reduction of choices must be demonstrated, but the implications of this reduction must be made clear to the users.
- Expressive match— the tool effectively reduces the distance between the design choices it enables and the requirements of the problem being solved. In other words, the tool enable the alignment of the design space and problem space. For instance, visual tools can provide a better match than textual programming tools for certain tasks (e.g., selecting a colour with a colour picker widget vs hexadecimal codes, dragging widgets in a canvas for creating a layout vs using coordinates);
6. Empowering design participants—the tool enables the involvement and empowerment of “new design participants”, some population that benefits from being closely involved in the design process (e.g., empowering artists for improving perceptual aspects of the solution, or involving workers and end-users through participatory design for making the solution more accurate to real-world needs, etc.). This claim requires the description of the population, of the reason for its involvement in the design process, and why existing solutions are not acceptable (e.g., lack of training, different language, dissonance between design goals and affordances provided). Furthermore, it requires demonstration that new tools are more accessible, easier and more effective for the given population.
7. Power in combination— the tool enables the combination of different building blocks to provide and effective solution. There are different claims that can be made regarding the combination power of tools:
- Inductive combination—Olsen suggests that firstly, it has to be shown that the tool provides a small set of primitives that work as basic building blocks and that can be extended with new primitives. Secondly, there has to be a mechanism by which these primitives are combined into more complex constructs. Finally, the coverage of the tool needs to be assessed, i.e., by demonstrating the set of interesting design solutions within the system (i.e., the design space) but also its limitations, in terms of designs that are unsupported and how they can be tackled otherwise outside the system.
- Simplifying interconnection—to support this claim it must be shown the diversity of choices in the domains connected by the tool that are interesting and non-trivial, and that the tool provides infrastructure that reduces the cost of connecting the different components.
- Ease of combination—the tool should provide abstraction over the complexities of combination and interconnection, so that it simple and straightforward mechanism.
8. Scaling up—“Any new UI system must either show that it can scale up to the size of realistic problems or that such scaling is irrelevant because there is an important class of smaller problems that the new system addresses. To evaluate this criteria one must try the system on a reasonably large problem and show that the advantages of the new model still hold”
To briefly sum up, this paper provides an overview of different aspects of evaluation, one of the most important stages of HCI/UCD research. It covers the value that UI systems innovation and toolkits research can generate, how to make claims about it, noting different caveats that can incur with evaluation. Although it is grounded in the reasons of a specific technological context and scope, its principles appear to be generalisable, specially when comparing with other generic toolkits research.
Reference:
Olsen Jr., D. R. (2007). Evaluating user interface systems research. 20th Annual ACM Symposium on User Interface Software and Technology (UIST’07), 251–258. http://doi.org/10.1145/1294211.1294256