Aquila project

Latest published version:
http://mozilla.hoa-project.net/Aquila/literature/html/
Previous versions:
-
Editor:
Ivan Enderlin, Mozilla's intern, developper of Hoa Framework

Abstract

The Aquila project aims to be a new experimental perform find system for Firefox.

This is the ideas manual, i.e. all ideas, sketches, early drafts etc., are written here.

Table of contents

  1. mark element: prestissimo
    1. Needs and usages
    2. Gecko implementation status
  2. Core: allegro
    1. AquilaUtil, core utility
      1. Get the current window
      2. Get the current document
    2. AquilaInterface, interface manager
    3. AquilaFinder, finder
      1. Find a word
      2. Move in matches
    4. Prelude and bootstrap
      1. XPCOM constants and log
      2. Bootstrap
  3. User experience
    1. Matching search
      1. Pattern matching algorithms
      2. Phonetic algorithms
    2. Relevant search
      1. Weight and distance
      2. Group and section
  4. User interface


Introduction: preludio

Presentation

This section is non-normative.

The Aquila project aims to be a new experimental perform find system for Firefox, distributed as a Firefox extension (downloadable as a .xpi file).

Background

This section is non-normative.

The current find system in Firefox dates from the time of Netscape. No new feature was added since this time. Nevertheless, user experience and user interface are increasing all the days: Web documents are longer, denser and contain more informations than ever.

No research about a new way of finding informations on a Web document has been recently exposed. At present, the majority of browsers uses a simple findbar (accessibles via Accel + F) but it could be the wrong way. For example, if a Web document is heigher than the window, user is not aware if some results was found somewhere else in the document.

The current system is incremental (it means: previous, next) and not very visual. The Aquila project attempts to rectify this, while proposing many experimentals features.

Audience

This section is non-normative.

This document is intended for all people that are interesting by the project, have ideas, fancies etc.

Yet, this document is probably not suited to readers who do not already have at least a passing familiarity with Web technologies, users experiences or users interfaces.

Scope

This section is non-normative.

This document provides all ideas, sketches, early drafts etc., comes with a semantic tour using new HTML 5 tags.

mark element: prestissimo

Needs and usages

HTML 5 introduces a new HTML semantic element: mark.

The HTML 5 specification describes it like:

The mark element represents a run of text in one document marked of highlighted for reference purposes, due to its relevance in another context. When used in a quotation or other block of text referred to from the prose, it indicates a highlight that was not originally present but which has been added to bring the reader's attention to a part of the text that might not have been considered important by the original author when the block was originally written, but which is now under previously unexpected scrutiny. When used in the main prose of a document, it indicates a part of the document that has been highlighted due to its likely relevance to the user's current activity.

This element meets the needs of a search. Indeed, the scope of the mark element is to mark found matches.

For example, 42 words of a lipsum:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin egestas sodales nulla id imperdiet. Aenean facilisis varius nulla, in feugiat arcu consequat vel. Suspendisse potenti. Mauris vitae nisi orci. Praesent tincidunt consectetur condimentum. In laoreet sodales nibh id facilisis. Donec porta faucibus.

We search the in text, then:

Lorem ipsum dolor sit amet, consectetur adipisc<mark>in</mark>g elit. Pro<mark>in</mark> egestas sodales nulla id imperdiet. Aenean facilisis varius nulla, <mark>in</mark> feugiat arcu consequat vel. Suspendisse potenti. Mauris vitae nisi orci. Praesent t<mark>in</mark>cidunt consectetur condimentum. In laoreet sodales nibh id facilisis. Donec porta faucibus.

Gecko implementation status

A bug is currently open on Bugzilla: Bug 485377, Implement HTML5's mark tag.

An early draft was proposed but it is not as relevant as we hoped it would be. It calls into question the general semantic tag declaration in Gecko; typically all elements that are span like. We are precisely focused on this point.

We are working on a new patch.

Core: allegro

We try to describe the Aquila core.

AquilaUtil, core utility

The prototype AquilaUtil allows user to get some usefull datas, like object, values, context etc.

Get the current browser

A first getter is browser that returns the current browser — through XPCOM —. Thus:

var aquilaUtil = new AquilaUtil();
var browser    = aquilaUtil.browser;

Get the current window

A first function is getCurrentWindow that returns the current selected window. It takes care about the selected tab. Thus:

var aquilaUtil    = new AquilaUtil();
var currentWindow = aquilaUtil.getCurrentWindow();

Get the current document

We also have the getCurrentDocument that returns the current selected document. Like its twin, it takes care about the selected tab. Thus:

var aquilaUtil      = new AquilaUtil();
var currentDocument = aquilaUtil.getCurrentDocument();

AquilaInterface, interface manager

AquilaFinder, finder

Find a word

Move in matches

Prelude and bootstrap

XPCOM constants and log

The prelude defines two usefull constants: Cc and Ci, respectively to get the Components.classes and Components.interfaces values.

The log function is added to help you to debug your application, making test etc. It is easily hackable. Example given:

log("= Ready to go!");

Bootstrap

The global variable aquila is declared on the top of the application and contains all aquila's prototypes (previously saw), like AquilaUtil, AquilaInterface and AquilaFinder, through a simple associative array. Thus:

aquila.util.getCurrentWindow(); // will return the current window.
aquila.finder.find();           // will find a word in the current document.
// …

User experience

Cette section décrit la manière dont on doit chercher les éléments pour répondre au besoin de l'utilisateur. On ne se concentre pas sur la manière de présenter visuellement le résultat mais plutôt sur la manière dont on doit trouver le résultat.

On va tout d'abord commencer par présenter les diverses techniques pour trouver des occurences à partir d'une séquence, puis on va se concentrer sur la façon la plus adéquate pour rassembler ces occurences, afin de créer des groupes pour avoir une plus grande pertinence dans le résultat.

Matching search

Le but de cette partie est de trouver textuellement ce qui correspond au mieux à ce que l'utilisateur cherche. On n'a pas encore réellement cette notion de pertinence. On cherche à trouver toutes les occurences qui se rapproches au mieux de ce que souhaite l'utilisateur. Soit on cherche toutes les occurences in extenso, soit on cherche à quelques différences près, soit toutes les occurences qui ont la même prononciation.

Une fois toute les occurences trouvées, cela nous constituera une première base. Toutefois, les occurences seront ordonnées selon l'ordre du document. Cet ordre n'est pas nécessairement le plus pertinent. Cette liste d'occurences va donc être remaniées pour être triées de façon plus pertinente.

Pattern matching algorithms

On peut constater deux manières de chercher des occurences dans un texte. La première — présentée ici — est textuelle. La seconde — présentée dans la section suivante — est phonétique.

Parmis ces techniques textuelles, on peut en considérer trois : une recherche triviale, ou in extenso, une recherche par expressions régulières ou enfin une recherche approchée.

In extenso search

Une première approche triviale serait de chercher une séquence dans un document in extenso, c'est à dire on cherche la séquence complète, sans différence, sans se soucier du sens de la séquence etc.

C'est de cette façon que procède Firefox actuellement.

De cette manière, on récupère tous les éléments correspondants à notre recherche. Tous les éléments seront identiques (à une sensibilité de casse près).

Regular expressions

Les expressions régulières sont un mécanisme puissant exprimant la théorie des langages réguliers.

Il existe plusieurs moteurs, plus ou moins puissants, plus ou moins rapides, plus ou moins expressifs etc. On conseille d'utiliser le moteur PCRE, pour Perl Compatible Regular Expressions. Une implémentation en C existe et est disponible sur le site PCRE.org.

Approximates search

Les recherches approchées peuvent être réalisées à l'aide d'expression régulières mais au détriment d'une lourde syntaxe peu lisible, et d'une lenteur à l'exécution.

Il existe des algorithmes spécialés dans la reconnaissance de séquences approchées, notamment dans la génétique pour la recherche de chaînes ADN.

Un exemple significatif :

On cherche GATAA dans CAGATAAGAGAA avec au plus une différence :

Ce type de recherche permet de trouver des occurences même si l'utilisateur fait des fautes de frappes. Ce n'est pas le seul cas intéressant car il peut trouver des mots avec un sens légèrement différent, ou avec une conjugaison, un accord, une orthographe légèrement différente.




















This section describes the user experience, i.e. finder experience, not the presentation, but only the way of how to find words in the document.

A first trivial approach is to search by purely textual matching, i.e. we search all matches that are likely to approach the searched sequence, either by pattern matching, with or without difference, or by phonetic. It does not combine the words and give them a weight, it is purely textual and aural. It does not concern the meaning of words, their semantics.

Pattern matching algorithms

There are differents ways, theories and algorithms to match a pattern on a document. We make a non-exaustive list.

Trivial search

A trivial search whould have the same behavior like the findbar has at present. Each character in the searched sequence is interpreted like a normal character.

For example, we search the sequence dummy va in a text:

This is a paragraph with a <mark>dummy va</mark>lue!
Regular expression

The search input should understand regular expressions.

Considere this text as example:

Here is simply a simple example, very simple but not simplistic about PCRE assertions that actually are not that simple!

Search \bsimpl.\b will match:

Here is <mark>simply</mark> a <mark>simple</mark> example, very <mark>simple</mark> but not simplistic about PCRE assertions that actually are not that <mark>simple</mark>!

We wonder if the PCRE engine would be used. It is based on a Nondeterministic Finite Automation, like Perl, Python, Java, .NET etc. used. But, a Deterministic Finite Automation would be faster, even if it is powerless, like MySQL, awk, egrep etc. used.

Approximate search

An other way to find better is to detect if the user would not have made typos. We can make an approximate pattern matching with differences.

Take this example — woefully simplistic —:

Try to match GATAA in CAGATAAGAGAA with at most one difference:

Many algorithms exist; let take us a look at the Levenshtein distance.

The Levenshtein distance is a metric for measuring the amount of difference between two sequences. The Levenshtein distance between two strings is given by the minimum of operations needed to transform one string into the other, where an operation is an insertion, deletion or substitution of a single character.

For example, the Levenshtein distance between kitten and sitting is 3:

Since the following three edits change one into the other, and there is no way to do it with fewer than three edits:

In fine, if the distance between two words is lesser than n, we could recognized this word as matching.

If Levenshtein distance is not much as relevant as we hoped, we can take a look to the Hamming distance, or many other algorithm.

Else, an algorithm is proposed in appendix: Search with an approximated patterns, with differences.

Phonetic algorithms

A phonetic algorithm is an algorithm for indexing of words by their pronunciation. Most phonetic algorithms were developed for use with the English language; consequently, applying the rules to words in other languages might not give a meaningful result.

Among the best-known phonetic algorithms are:

It should be hard to develop new algorithms for all Firefox supported languages (about 70 languages). A solution may exists with phonemes. It needs relfexions.

A second approach, more experimental this time, is to search by relevance, i.e. taking into account the meaning of words. Either we can assigned a weight from a distance (textual, aural, spatial) or can grouped them (example given by using headings or semantic).

Actually, this approach is based on the matching search (please, see the previous section), but it aims to order found matches to clarify the result.

Weight and distance

Group and section

Words weight

Each word has a weight, that could be determined with the Hamming weight algorithm. It would be great to match all words that have a similar weight that the original word or sequence.

Euclidean distance

Phonetic algorithms

Smartcase

User interface

Matches arity

List of excerpts

List of relevant sections

Bubbled scrollbar

Smooth transition when jumping

Focus the current matched element

Page preview

Highlight viewable matches in a list

Darken the document except matches

Appendix

Search with an approximated patterns, with differences

Here is a algorithm of a search with an approximated pattern, with differences, based upon the diagonal monotony, in PHP.

<?php

header('content-type: text/plain');

/*  Search by approximated patterns, with differences */
/*     based upon the principle diagonal monotony.    */


function approximatePatternMatchingWithDifferences ( $x, $y, $k ) {

    $m      = strlen($x);
    $n      = strlen($y);
    $offset = array();

    for($d = -1; $d <= ($n-$m+$k+1); $d++)
        $L[-1][$d] = -2;

    for($q = 0; $q <= ($k-1); $q++) {

        $L[$q][-$q-1] = $q-1;
        $L[$q][-$q-2] = $q-1;
    }

    for($q = 0; $q <= $k; $q++) {

        for($d = -$q; $d <= ($n-$m+$k-$q); $d++) {

            $l         = max($L[$q-1][$d-1],
                             $L[$q-1][$d]  +1,
                             $L[$q-1][$d+1]+1);
            $l         = min($l, $m-1);
            $a         = substr($x, $l+1   , -($l)   +($m-1));
            $b         = substr($y, $d+$l+1, -($d+$l)+($n-1));
            $L[$q][$d] = $l+lcp($a, $b);

            if(   $L[$q][$d]    == $m-1
               || $d+$L[$q][$d] == $n-1) {
                $j            = $m+$d;
                $i            = $j-$m >= 0 ? $j-$m : 0;
                $offset[$q][] = array('i' => $i, 'j' => $j);
            }
        }
    }

    return $offset;
}

function lcp ( $x, $y ) {

    $max = min(strlen($x), strlen($y));
    $i   = 0;

    while($i < $max && $x{$i} == $y{$i})
        $i++;

    return $i;
}

$x      = 'GATAA';
$y      = 'CAGATAAGAGAA';
$k      = 1;
$search = approximatePatternMatchingWithDifferences($x, $y, $k);

$n      = count($search);
echo 'Try to match ' . $x . ' in ' . $y . ' with at most ' . $k . ' difference(s): ' . "\n";
foreach($search as $q => $n) {
    echo '  - with ' . $q . ' difference(s), ' . count($n) . ' match(es) found:' . "\n";

    foreach($n as $i => $pos)
        echo '    - ' . substr($y, $pos['i'], -$pos['i']+$pos['j']) . ' ;' . "\n";
}

Auto-complétion

Proposition de mots clés