Aquila project

Latest published version:
5th august 2009
Previous versions:
4th august 2009
20th may 2009
18th may 2009
15th may 2009
14th may 2009
13th may 2009
12th may 2009
Editor:
Ivan Enderlin, Mozilla's intern, developper of Hoa Framework

Abstract

Aquila project aims to be a platform to experiment new ways to search through a document in Firefox.

This specification shows all ideas, sketches, early drafts etc.

Table of contents

  1. mark element: prestissimo
    1. Needs and usages
    2. Gecko implementation status
  2. Core: allegro
    1. nsIAquila interface
    2. nsIAquilaFinder interface
      1. nsIAquilaFinderMatch interface
      2. nsIAquilaFinderPositions interface
      3. nsIAquilaFinderPosition interface
    3. nsIAquilaDocument interface
    4. nsIAquilaWindow interface
  3. User experience
    1. Matching search
      1. Pattern matching algorithms
        1. In extenso search
        2. Regular expressions
        3. Approximated search
      2. Phonetic search
    2. Relevance of search results
      1. Weight and distance
      2. Group and section
      3. Euclidean distance
    3. Smartcase
  4. User interface
    1. Findbar
      1. Layout of the finbar
      2. Central search
      3. Next and previous
      4. Matches arity
    2. Results listings
      1. List of excerpts
      2. List of sections or groups
      3. List of excerpts grouped by sections
    3. Matches localizing…
      1. …with a list
      2. …with a scrollbar
        1. Place points
        2. Bubbles and moving
    4. Document state when searching
      1. Highlight and darken
      2. Smooth transition


Introduction: preludio

Presentation

This section is non-normative.

Aquila project aims to be a platform to experiment new ways to search through a document. Aquila is mainly developped for Firefox, distributed as a Firefox extension (downloadable as a .xpi file). Just type Accel + Shift + F to open the search bar.

Videos can be found in the video/ directory.

Background

This section is non-normative.

The current find system in Firefox is old. No exciting feature was added since a long time. Nevertheless, user experience and user interface are increasing all the days: Web documents are longer, denser and contain more informations than ever.

At present, the majority of browsers uses a simple findbar (accessibles via Accel + F) but it could be the wrong way. For example, if a Web document is heigher than the window, user is not aware if some results was found somewhere else in the document.

The current system is incremental (it means: previous, next) and not very visual. Aquila project attempts to rectify this, while proposing many experimentals features. It also attempts to be an experimental platform because of its modularity and light-dependences components.

Audience

This section is non-normative.

This document is intended for all people that are interesting by the project, have ideas, fancies etc.

Yet, this document is probably not suited to readers who do not already have at least a passing familiarity with Web technologies, users experiences or users interfaces.

Scope

This section is non-normative.

This document provides all ideas, sketches, early drafts etc., and comes with a semantic tour using new HTML5 tags.

mark element: prestissimo

Needs and usages

HTML5 introduces a new HTML semantic element: mark.

The HTML5 specification describes it like:

The mark element represents a run of text in one document marked of highlighted for reference purposes, due to its relevance in another context. When used in a quotation or other block of text referred to from the prose, it indicates a highlight that was not originally present but which has been added to bring the reader's attention to a part of the text that might not have been considered important by the original author when the block was originally written, but which is now under previously unexpected scrutiny. When used in the main prose of a document, it indicates a part of the document that has been highlighted due to its likely relevance to the user's current activity.

This element meets the needs of a search. Indeed, the scope of the mark element is to mark found matches.

For example, 42 words of a lipsum:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Proin egestas sodales nulla id imperdiet. Aenean facilisis varius nulla, in feugiat arcu consequat vel. Suspendisse potenti. Mauris vitae nisi orci. Praesent tincidunt consectetur condimentum. In laoreet sodales nibh id facilisis. Donec porta faucibus.

We search the in text, then:

Lorem ipsum dolor sit amet, consectetur adipisc<mark>in</mark>g elit. Pro<mark>in</mark> egestas sodales nulla id imperdiet. Aenean facilisis varius nulla, <mark>in</mark> feugiat arcu consequat vel. Suspendisse potenti. Mauris vitae nisi orci. Praesent t<mark>in</mark>cidunt consectetur condimentum. In laoreet sodales nibh id facilisis. Donec porta faucibus.

Gecko implementation status

This section is non-normative.

A bug is currently open on Bugzilla: Bug 485377, Implement HTML5's mark tag.

An early draft was proposed but it is not as relevant as we hoped it would be. It calls into question the general semantic tag declaration in Gecko; typically all elements that are span like. I am precisely focused on this point.

I am working on a new patch.

Core: allegro

The core is delivered as a XPCOM component, written in Javascript. We can see the aquilaFinder.idl interface and aquilaFinder.js associated component in the components/ directory.

This component presents many interfaces.

nsIAquila interface

The nsIAquila is a dispatcher to reach every important interfaces of the component. In the chrome context, it is often available as the aquila global variable.

It holds the finder (the main part), the document, the window and the dispatcher itself.

There is an init method to initialize the object. This method returns the object itself, it allows user to make a fluid interface.

Finally, the refresh method allows user to refresh the component when a special action is performed (like changing the tabulation or toggling the aquila bar).

Using example (in a chrome context); we write the bootstrap:

var aquila     = null;
var Aquila     = Components.Constructor(
    "@hoa-project.net/aquila;1",
    "nsIAquila"
);
var dispatcher = (new Aquila()).init(window);
aquila         = {
    finder:     dispatcher.finder,
    document:   dispatcher.document,
    window:     dispatcher.window,
    dispatcher: dispatcher.dispatcher
};

nsIAquilaFinder interface

The nsIAquilaFinder interface is the core of the core, understand that it implements the find method and its siblings (please, see the IDL to know more).

A trivial approach to find the word hoa would be:

aquila.finder.find("hoa");

Now, the finder stores the result in memory. Each match is a nsIAquilaMatch object. An iterator is available as a nsIAquilaFinderIterator object (accessibles with the getIterator and __iterator__ methods). The iterator could be reset with the help of the reset method.

Control methods are also available, through previous, next, change (seek to a new position), getMatch, getCurrentIndex and getCurrentMatch methods.

Finally, the number of matches are present on the count attribute.

Thus, we log all the matches' extracts:

for(let it = aquila.finder.getIterator(); it.valid(); it.next())
    log(it.key() + " " + it.current().toString());

nsIAquilaFinderMatch interface

Each found match is a nsIAquilaFinderMatch object. It allows user to mark or unmark the match (through a very simple CSS class), delete the match (it deletes the inserted <mark> tag), scrollTo the match, get an extract arround the match through the toString method or get all positions through the getPositions and __iterator__ methods (please, see the IDL to know more).

To mark a match:

aquila.finder
    .getMatch(42) // We assume that it exists.
    .mark();      // Mark the match, i.e. highlight it.

Try to scroll to a match:

aquila.finder
    .getMatch(42) // We assume that it exists.
    .scrollTo();  // If the match is not visible, it will be!

Actually, we cannot get a text position, so we surround all matches by <mark> tag. It is the main cause of a possible lag if a lot of matches exists. Knowing text positions will solve this problem: if the match is visible, we surround it, else we do nothing. In this way, we do not make a lot of DOM modifications and we can avoid the lag. It will be more smooth and nice to use.

nsIAquilaFinderPositions interface

Every match contains a collection of positions. Actually, it is a collection of nsIAquilaFinderMatchPosition objects.

We will not give details on this part, the IDL is meaningful.

nsIAquilaFinderPosition interface

We must show a position of a search result as a rectangle —which is obviously in a 2D space—.

May be more explicit? The rectangle has to kind of coordinates: document related and window related. (In the project terminology, we talk about window to imply the window viewport, i.e. what user can see of the document through its window). If we want match's positions in the document, we are talking about scrolls positions, i.e. the position added up the scrolls. Else, if we want match's positions in the window or viewport, we are talking about without scrolls positions.

The consideration of the scroll should be enable or disable through the considereScroll method (with a boolean).

For all positions, the unit is a float pixel. We speak of absolute positions. Relative positions exist: the unit is percentage. It is based on the absolute positions, so the scrolls consideration matters.

Finally, we can get the width and the height of each rectangle.

Thus, we get x and y positions, and the width and height of a match:

var match     = aquila.finder.getMatch(42); // We assume that it exists.
var positions = match.getPositions();       // Get all positions.
var position  = positions.getPosition(      // Get the middle position.
                    Math.floor(positions.count() / 2)
                );
position.considereScroll(true);             // Considere scrolls offset.
log(
    "x     : " + position.getAbsoluteTop()  + " px\n" +
    "either: " + position.getRelativeTop()  + " %\n"  +
    "y     : " + position.getAbsoluteLeft() + " px\n" +
    "either: " + position.getRelativeLeft() + " %\n"  +
    "width : " + position.getWidth()        + " px\n" +
    "height: " + position.getHeight()       + " px"
);

To be more efficient, we propose some methods to locate the match in a 2D space: isAbove, isRight, isBelow and isLeft methods. A combo-method is isVisible.

Check if the mark is above the viewport:

log(
    "Above? " + position.isAbove()
);

nsIAquilaDocument interface

The nsIAquilaDocument interface wrappes the nsIDOMDocument one and adding some methods, like getWidth and getHeight. To reach the nsIDOMDocument object, we should use the parent attribute.

Thus, we manipulate the document and get all <p> tags:

log(
    "width:  " + aquila.document.getWidth()  + " px\n" +
    "height: " + aquila.document.getHeight() + " px"
);

var ps = aquila.document.parent.querySelector("p");

nsIAquilaWindow interface

The nsIAquilaWindow interface wrappes the nsIDOMWindow one and adding some methods, like getWidth, getHeight, getTopShift and getLeftShift. The two lasts ones concern the scroll offset/shift.

And, as the nsIAquilaDocument, we have the parent attribute that enables user to reach the nsIDOMWindow object.

User experience

This section describes the manner whose we should search elements to answer as better as possible to the user's need. We do not focus on the user interface, i.e. the way to present the result, but instead the manner whose we should find the result.

First, we start with a presentation of several ways to find matches from a sequence. Next, we will focus on the more relevant manner to gather round matches, in order to create groups to get a better relevance in the result.

This part aims to find textually which match better to what user are searching. We do not really need the notion of relevance; we try to find all matches which are close to what user needs. Either we search all matches in extenso, either we search with the help of regular expressions, either we search with an approximated search (with differences), either all matches which have the same pronunciation.

Once all matches found, it will build up a first base. However, the matches will be ordered according to the natural sort of the document. This sort is not necessary the most relevant. This list of matches will be reshuffled to be sorted in a more relevant manner.

Pattern matching algorithms

We can ascertain two ways for searching matches in a document. The first —developped here— is textual. The second —developped in the next section— is phonetic.

Among these textual techniques, let's consider three of them: a trivial search or in extenso search, a search based on regular expressions, or an approximated search.

Status: implemented A first trivial approach will be to search a sequence in a document in extenso, i.e. we search a complete sequence, without difference, without care the sequence meaning etc.

Firefox uses this process now.

In this manner, we retrieve all elements which match the user's search. All elements are identicals (with only one sensitivity case).

Considere this text as example:

Here is simply a simple example, very simple!

Search simpl will match:

Here is <mark>simpl</mark>y a <mark>simpl</mark>e example, very <mark>simpl</mark>e!

A relevant manner to search a match will be using a binary-vector model.

If we use a system based on regular expressions, we just need to escape all wildcards and reserved characters in the expression.

Regular expressions

Status: not implemented Regular expressions are a powerful mechanism expressing the regulars languages theory.

Many engines exist, more or less powerfull, more or less fast, more or less expressives etc. Among this multitude of engines, we advice to use the PCRE engine, for Perl Compatible Regular Expressions. A C implementation exists and is available on the site PCRE.org.

Considere this text as example:

Here is simply a simple example, very simple but not simplistic about PCRE assertions that actually are not that simple!

Search simpl.*\b will match:

Here is <mark>simply</mark> a <mark>simple</mark> example, very <mark>simple</mark> but not <mark>simplistic</mark> about PCRE assertions that actually are not that <mark>simple</mark>!

Status: not implemented The approximated searches can be released with the help of regular expressions but to the detriment of a readable syntax and a speed execution.

There are specialized algorithms in the recognition sequence approached, particularly in genetics for the DNA chains matching.

A significant example:

We search GATAA in CAGATAAGAGAA with at most one difference:

This kind of search allows to find matches even if user makes typo. It not the only interesting case because it can find words with a gently different meaning, conjugation, agreement or spelling.

A relevant manner to deploy such a search mechanism will be to use the algorithm of approximated pattern matching with differences, based on the diagonal monotony principle.

An other simple manner will be to use the Levenshtein distance, and to select all words which have a distance smaller than a given n.

If the Levenshtein distance is not sufficient, we can also use the Hamming distance, which is more global.

Status: not implemented The phonetic algorithms basis is to make a connection between words from their pronunciation.

The problem with these algorithms is that they work only for one language, and generally English.

We specially notice:

Be carefull of phonemes.

A priori, we will focus more on Double Metaphone algorithm, which seems to be more relevant and support more languages.

Number of supported languages is a real problem for the spread of Firefox, because the software is deployed in 70 languages. The used algoritm must understand those languages.

Relevance of search results

The previous section introduced the manners to find results, understand the matches, in a document. This section continues the work by trying to sort the results in order to propose a better relevance.

Weight and distance

Status: not implemented A first manner to sort the list of results is to affect a weight to each result. Heavies weights will be the more relevant, and small weights will be the less relevant.

We propose many ideas to attribute a weight to a result.

TO DO!

Group and section

Status: not implemented An interesting approach should be to not display the matches but groups or sections that hold the maximum of matches. For example, if a section holds three matches et another only holds two, the first will get a heavier weight than the second, and will be more relevant.

We will reason with section for the rest of this part.

Sections are build on h1 to h6 tags, from HTML5. We do not considere nesting levels (h2 necessary in a h1), nor grouping (with section or hgroup tags). For now, the found matches that are out of a headings stream are ignored (before that we will find a better way to manage this case).

Let the following example:

<article>
  <hgroup>
    <h1>heading 1</h1>
    <p>dummy value</p>

    <h2>heading 2</h2>
    <p>dummy & dummy</p>
  </hgroup>

  <p>A paragraph with a dummy value.</p>

  <section>
    <h1>heading 1'</h1>
    <p>dummy value</p>
  </section>

  <section>
    <h1>dummy 1''</h1>
    <p>dummy dummy dummy & so on!</p>
  </section>
</article>

Here is a “list-view” (or a “DOM-view”), with matches number:

We attempt a first decreasing sort: 4 (E), 2 (B), 1 (A), 1 (C) and 1 (D). The paragraph (C) is out of the headings stream, so we do count it. It comes to: 4 (E), 2 (B), 1 (A) and 1 (D).

In our example, we note that the relevance is not complete. Actually, there is an equality between (A) and (D). To get a complete relevance, it is necessary to decide between. At this point, the notion of distance takes place.

We will naively start with a DOM distance. For example, in our case, (D) is closer to (E) (the more relevant section) than (A). So, we suppose that (D) will be more relevant than (A).

Thus, the final result will be: 4 (E), 2 (B), 1 (D) and 1 (A).

If another equality exists after this, we can suppose that the order will not add a better relevance, and we can skip this part.

Think to use the hyperbolic geometry to make a spacial-sort and not a DOM-sort with distances. And above all: hypertrees. We can define hypertrees with 3 or 4 levels, and we use all levels that are close to the center to decide between equalities during the first sort.

Euclidean distance

TO DO!

Smartcase

Status: implemented Case sensitivity is generally putted right by a checkbox. However, we may have a “smart” sensitivity handling.

If all characters are in lowercase, we can suppose that the case is not important, and thus the sensitivity is null. If at least one character is in uppercase, thus the case will be sensitive.

Let the text:

Aquila project: aquila means eagle in latin.

We search Aquila (it is case sensitive):

<mark>Aquila</mark> project: aquila means eagle in latin.

We search aquila (it is case unsensitive):

<mark>Aquila</mark> project: <mark>aquila</mark> means eagle in latin.

User interface

The user interface describes this time the way of present the result, i.e. the way of user could interact with the find system.

Findbar

The search bar must be big enough to show all graphics elements (like inputs, checkboxes, trees etc.) and well-placed to open new graphics elements without disturbing the user. It should allow to place all the necessary buttons.

Layout of the findbar

Status: implemented The search bar could be placed in many spots but we propose: either horizontally in the bottom of the window, either vertically in sidebar.

If we choose a place instead of an other, it will be only middlegrounds solutions. Indeed, a sidebar allows to show more element but it is very big in the screen (and force the page to resize), it takes up a lot of space. A horizontal bar is thiner than a sidebar but the various lists and trees of results will be difficult to place.

Status: not implemented We should use the previous searches used on other sites, on other Web search engines or even on operating system (we think notably at Spotlight on Mac OS X), to propose better results.

Next and previous

Status: implemented It is necessary to seek to the previous and the next match in the document. We could use texts or arrows to indicate the possible actions to user. Arrows are preferables because there are more visuals and thinners than texts (saving space).

If we press the Enter key, it should seek to the next match, and Shift + Enter to seek to the previous match. Loops should be allowed.

If we use lists or trees, we could seek to any match represented in the list or in the tree by a simple click on an item. If we use the points bar, so it does.

Matches arity

Status: implemented An interesting functionnality for the user is to indicate how many matches were found in the document.

This information could dress many styles: either we simply indicate if one or more matches were found (very binary information), either we indicate the exact number of found matches, either we could transform this information as a colour, but the information may lose precision.

Results listings

We try to explain how to present results.

List of excerpts

Status: implemented A first list could be a list of all found matches, with some previous and next words so as to create a list of excerpts.

List of sections or groups

Status: not implemented A second list could be the same as a list of excerpts but with grouped matches. It ensues a tree and not a list. Thus, we will see two columns: a progress bar that indicates the relevance of the section, and the section name.

List of excerpts grouped by sections

Status: not implemented We also could have a list of excerpts but folded into sections.

Matches localizing…

Actual search systems really provide not enough informations. An information that cruelly miss is the position of matches in the document. Without scrolling into its document in the current window, the user must be able to locate the global position of all matches. For example, know if there is a match in the very bottom of the document.

…with a list

Status: under implementation In the case of a list, we can darken all the items which are not visibles, i.e. out of the window (or the viewport).

…with a scrollbar

If we hope to know visually all the matches' positions in the document, either we can reduce the document to fit with the window, to see it in totally, either use scrollbars. We decide to chose the last idea.

Place markers

Status: implemented The idea is as follow: a scrollbar with markers, where each marker is set up proporitionnaly at the match position in the page. For example, if a match is at the very bottom of the page, its proportionnal position (or relative position in the document) is 100%, so the marker will be placed at the very bottom of the scrollbar. Same example with a match in the middle of the document where the corresponding marker will be placed at the middle of the scrollbar.

Marker can recover many shapes: a simple point, an arrow, a triangle etc. We decide to chose these conventions: a point for the visibles matches and an arrow for the not visibles matches.

Arrows are useful to locate matches in a 2D space. We considere only 4 positions for arrows: top, left, bottom and right. Having full 360 positions would be useless and could make confuse the user (and adding a lag).

Bubbles and moving

Status: not implemented When hover a marker, we should see an extract of the page, like a balloon. And clicking on a marker should scroll the window to the match position.

Document state when searching

Highlight and darken

Status: implemented When user is in searching mode, the document could be darken and only found matches could be with the same style as before, and the selected/current match could be highlighted.

Smooth transition

Status: not implemented When seeking to a previous or a next match, if we need to scroll to a new position of the window, it could be by appling a smooth transition.


Appendix

Search with an approximated patterns, with differences

Here is a algorithm of a search with an approximated pattern, with differences, based upon the diagonal monotony, in PHP.

<?php

header('content-type: text/plain');

/*  Search by approximated patterns, with differences */
/*     based upon the principle diagonal monotony.    */


function approximatePatternMatchingWithDifferences ( $x, $y, $k ) {

    $m      = strlen($x);
    $n      = strlen($y);
    $offset = array();

    for($d = -1; $d <= ($n-$m+$k+1); $d++)
        $L[-1][$d] = -2;

    for($q = 0; $q <= ($k-1); $q++) {

        $L[$q][-$q-1] = $q-1;
        $L[$q][-$q-2] = $q-1;
    }

    for($q = 0; $q <= $k; $q++) {

        for($d = -$q; $d <= ($n-$m+$k-$q); $d++) {

            $l         = max($L[$q-1][$d-1],
                             $L[$q-1][$d]  +1,
                             $L[$q-1][$d+1]+1);
            $l         = min($l, $m-1);
            $a         = substr($x, $l+1   , -($l)   +($m-1));
            $b         = substr($y, $d+$l+1, -($d+$l)+($n-1));
            $L[$q][$d] = $l+lcp($a, $b);

            if(   $L[$q][$d]    == $m-1
               || $d+$L[$q][$d] == $n-1) {
                $j            = $m+$d;
                $i            = $j-$m >= 0 ? $j-$m : 0;
                $offset[$q][] = array('i' => $i, 'j' => $j);
            }
        }
    }

    return $offset;
}

function lcp ( $x, $y ) {

    $max = min(strlen($x), strlen($y));
    $i   = 0;

    while($i < $max && $x{$i} == $y{$i})
        $i++;

    return $i;
}

$x      = 'GATAA';
$y      = 'CAGATAAGAGAA';
$k      = 1;
$search = approximatePatternMatchingWithDifferences($x, $y, $k);

$n      = count($search);
echo 'Try to match ' . $x . ' in ' . $y . ' with at most ' . $k . ' difference(s): ' . "\n";
foreach($search as $q => $n) {
    echo '  - with ' . $q . ' difference(s), ' . count($n) . ' match(es) found:' . "\n";

    foreach($n as $i => $pos)
        echo '    - ' . substr($y, $pos['i'], -$pos['i']+$pos['j']) . ' ;' . "\n";
}

Auto-complétion

Proposition de mots clés