OPIC Details

HITS algorithm

Introduction

Since OPIC is just a way of computing the HITS score for each page we must first understand what the HITS score is. HITS stands for Hyperlink-Induced Topic Search, originally described in:

Authoritative sources in a hyperlinked environment
Kleinber J.
1998

The idea is to compute for each page a pair of numbers called the Hub and Authority scores. Intutively we say a page is a hub when it points to lot of pages with high authority, and we say that a page has high authority if it is pointed by many hubs. Mathematically this is expressed as:

\[\begin{split}h_i &= \sum_{j: i \to j} a_j \\ a_i &= \sum_{j: j \to i} h_j\end{split}\]

Where \(h_i\) represents the hub score of page \(i\) and \(a_i\) represents its authority score. \(j: i \to j\) represent the set of pages \(j\) that are pointed by \(i\) and \(j: j \to i\) represents the set of pages \(j\) that point to \(i\). For example, consider the following set of pages:

The equations to compute the scores would be:

\[\begin{split}h_1 &= a_2 &\quad a_1 &= h_3\\ h_2 &= a_4 &\quad a_2 &= h_1 + h_3\\ h_3 &= a_1 + a_2 + a_4 &\quad a_3 &= 0\\ h_4 &= 0 &\quad a_4 &= h_2 + h_3\end{split}\]

The above equations can be rewritten in matrix form. Let’s define:

\[\begin{split}L = \begin{bmatrix} 0 & 1 & 0 & 0 \\ 0 & 0 & 0 & 1 \\ 1 & 1 & 0 & 1 \\ 0 & 0 & 0 & 0 \end{bmatrix}\end{split}\]

\(L\) is called the link matrix because it contains all the link information available in the graph. Notice that \(L_{ij}\) is 1 if there is a link between nodes \(i\) and \(j\) and 0 otherwise. If we group the hub and authorities scores in the vectors \(h\) and \(a\) we can write the HITS equations as:

\[\begin{split}h &= La &= (LL^T)h &= Hh\\ a &= L^Th &= (L^TL)a &= Aa\end{split}\]

The above equations always have the trivial solutions \(h=a=0\). The other non-zero solutions are given by the eigenvectors of the matrices \(H\) and \(A\).

Power iteration

A very simple and efficient way of getting the eigenvector with the largest eigenvalue is using the power method: to compute the largest eigenvector \(v\) of a matrix \(M\) simple iterate long enough starting from a random vector \(v^0\)

\[v^k = Mv^{k-1}\]

Notice that unless the associated eigenvalue is exactly 1 you will need to re-normalize after each iteration.

Let’s apply this method to our toy problem. Using numpy the code for the power method is very simple:

import numpy as np


def power(M, eps=1e-6, max_iter=1000):
    """Compute the largest eigenvector of M.

    Returns: the eigenvector and the eigenvalue
    """
    v0 = np.random.rand(M.shape[0])

    converged = False
    for i in xrange(max_iter):
        v1 = M.dot(v0)
        v1 /= np.linalg.norm(v1)
        err = np.max(np.abs(v1 - v0))
        if err < eps:
            converged = True
            break

        v0 = v1
    if not converged:
        print "Warning: didn't converge"

    return v1, M[0, :].dot(v1)/v1[0]

And we can apply it to our toy problem:

L = np.array([
    0, 1, 0, 0,
    0, 0, 0, 1,
    1, 1, 0, 1,
    0, 0, 0, 0]).astype(float).reshape(4,4)

H = L.dot(L.transpose())
A = L.transpose().dot(L)

h, h_eigv = power(H)
a, a_eigv = power(A)

print "Hub scores: ", h
# Hub scores:  [ 0.32505736  0.3250578   0.88807383  0.        ]
print "Authority scores: ", a
# Authority scores:  [ 0.45970084  0.62796324  0.          0.62796282]

If we represent the scores directly on the graph we see how the scores correctly represent our intuitive notion of what a hub or an authoritative page represents:

Notice that the power method can fail. Take for example the matrix \(L\). No matter what initial value we start iterating from, it will always converge to a zero vector. To see why notice the following identity which gives name to the power method:

\[v^k = Mv^{k-1} = M^2v^{k-2} = \ldots = M^kv^0\]

Now let’s compute the different powers of \(L\)

\[\begin{split}L^2 &= \begin{bmatrix} 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 1 \\ 0 & 0 & 0 & 0 \end{bmatrix} \\ L^3 &= \begin{bmatrix} 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 0 & 0 & 1 \\ 0 & 0 & 0 & 0 \end{bmatrix} \\ L^4 &= L^5 = \ldots = 0\end{split}\]

The reason for this behavior is that the only eigenvalue of \(L\) is 0. There is however a non-zero eigenvector associated to the 0 eigenvalue:

\[\begin{split}\begin{pmatrix} 0 \\ 0 \\ 1 \\ 0 \end{pmatrix}\end{split}\]

But the power method fails to find it. A trick to solve this problem is to apply a small perturbation \(\epsilon\) to \(L\) and apply the power method. It should converge to an small perturbation of the original eigenvector. Let’s try:

print power(L + 1e-5)
# (array([  5.52652292e-02,   3.23859608e-03,   9.98466441e-01,
#          1.79833785e-04]), 0.058792257444600315)

As you can see it returns a result very close to the true answer. The perturbated eigenvector is \((0.055, 0.003, 0.998, 0.000)\) and the perturbated eigenvalue is 0.06.

We finally end making some comments:

  1. The power method is efficient because the link matrix is sparse, meaning that most of its elements are zero. However we need to have it fully loaded in memory to perform an iteration of the algorithm.

  2. We don’t need to compute the \(H\) and \(A\) matrices and instead we can perform the following iteration:

    \[\begin{split}h_{k+1} &= La_k \\ a_{k+1} &= L^Th_k\end{split}\]

    Which can also be rewritten as:

    \[\begin{split}\begin{pmatrix} h \\ a \end{pmatrix}^{k+1} = J \begin{pmatrix} h \\ a \end{pmatrix}^{k}\end{split}\]

Where:

\[\begin{split}J = \begin{bmatrix} 0 & L \\ L^T & 0 \end{bmatrix}\end{split}\]

And we can check that the power method applied to matrix \(J\) gives the same answer as before

N = L.shape[0]
J = np.zeros((2*N, 2*N))
J[:N, N:] = L
J[N:, :N] = L.transpose()
h, a = np.split(power(J, max_iter=1000)[0], 2)
h /= np.linalg.norm(h)
a /= np.linalg.norm(a)

print "Hub scores: ", h
# Hub scores:  [ 0.32505758  0.32505758  0.88807383  0.        ]

print "Authority scores: ",a
# Authority scores:  [ 0.45970084  0.62796303  0.          0.62796303]

Actually, it gives a warning about convergence. The reason is that the vector built by stacking together \(h\) and \(a\) oscillates between the following two vectors:

\[\begin{split}\begin{matrix} (0.200 & 0.200 & 0.545 & 0 & | & 0.363 & 0.496 & 0 & 0.496) \\ (0.257 & 0.257 & 0.701 & 0 & | & 0.282 & 0.386 & 0 & 0.386) \end{matrix}\end{split}\]

However, each half part of the above vectors are equal up to a scaling constant, and so the method converge to the correct scores.

OPIC algorithm

Introduction

OPIC is just an alternative to the power method to compute the largest eigenvector and so it can be applied to compute the HITS score. It is described here:

Adaptive On-Line Page Importance Computation
Abiteboul S., Preda M., Cobena G.
2003

Given a page graph, the idea is to maintain for each page two quantities: the cash and the history. When we update a page the cash is distributed to its children and the the total quantitiy is added to the page history. For example, returning to our toy example let’s suppose that at a given instant in time we have the current state of cash and history associated to each page:

And we update page 3. Since it has 6 units of cash it will distribute 2 units to each one of its 3 children. It will also increment its history by 6 units, since this is the total amount of cash it has distributed and its cash will be reset to 0. The following image shows the change in the graph state after a single update.

If we keep performing these updates, making sure that we update all pages, the history of the pages will converge, after normalization, to same value of the power method applied to the following matrix:

\[\begin{split}\bar{L} = \begin{bmatrix} 0 & 0 & \frac{1}{3} & 0 \\ 1 & 0 & \frac{1}{3} & 0 \\ 0 & 0 & 0 & 0 \\ 0 & 1 & \frac{1}{3} & 0 \end{bmatrix}\end{split}\]

You can see that \(\bar{L}\) is very similar to \(L^T\) since its actually the same matrix transposed, where each column has been divided by a normalizing factor, so as to make each column add to 1.

The utility of the OPIC method lies in the fact that it does not matters the order in which we update the pages, nor it matters if we update some pages more than others, as long as all pages are updated periodically.

However, we must remember that the power method sometimes fail to converge and we solved it by adding an small perturbation to the link matrix. We make an equivalente modification to the OPIC algorithm by adding an small modification to the graph: we add a virtual page that links and is linked by all other pages. For our toy problem it would look like this:

Let’s check our claims with some code. The following code will apply the OPIC algorithm to an arbitrary graph. For simplicity the graph is represented as a dictionary where the key is the node name and the value is a list of outgoing nodes.

def add_virtual_page(graph):
    new_graph = {node: children + ['V']
                 for node, children in graph.iteritems()}
    new_graph['V'] = graph.keys()
    return new_graph

example = add_virtual_page(
    {1: [2],
     2: [4],
     3: [1, 2, 4],
     4: []}
)

We now implement the algorithm, where the next page to update is randomly selected.

import random


def opic(graph, n_iter=100):
    """Simple implementation of OPIC where all children get the
        same cash"""

    # get the list of nodes
    nodes = graph.keys()
    # build a dictionary to hold cash and history
    # we give each page an initial cash of 1
    state = {node: (1.0, 0.0) for node in nodes}
    for i in xrange(n_iter):
        node = random.choice(nodes)
        cash, history = state[node]
        children = graph[node]
        # distribute cash to children
        n_children = len(children)
        for child in children:
            child_cash, child_history = state[child]
            state[child] = (child_cash + cash/n_children,
                            child_history)
        # update node state
        state[node] = (0.0, history + cash)

    # return normalized history
    history = [state[n][1] for n in nodes]
    total = sum(history)
    return zip(nodes, [h/total for h in history])

To compare against the power method we build the equivalent link matrix, where the last row/column corresponds to the virtual page. First without normalization:

\[\begin{split}L = \begin{bmatrix} 0 & 1 & 0 & 0 & 1\\ 0 & 0 & 0 & 1 & 1\\ 1 & 1 & 0 & 1 & 1\\ 0 & 0 & 0 & 0 & 1\\ 1 & 1 & 1 & 1 & 0 \end{bmatrix}\end{split}\]

And after transposition and normalization:

\[\begin{split}\bar{L} = \begin{bmatrix} 0 & 0 & \frac{1}{4} & 0 & \frac{1}{4} \\ \frac{1}{2} & 0 & \frac{1}{4} & 0 & \frac{1}{4} \\ 0 & 0 & 0 & 0 & \frac{1}{4} \\ 0 & \frac{1}{2} & \frac{1}{4} & 0 & \frac{1}{4} \\ \frac{1}{2} & \frac{1}{2} & \frac{1}{4} & 1 & 0 \end{bmatrix}\end{split}\]

We can compare now the two methods:

for node, score in opic(example, n_iter=1000):
    print node, score

# 1 0.198500025862
# 2 0.298524016644
# 3 0.157148867939
# 4 0.345827089555

L = np.array([
    0, 1, 0, 0, 1,
    0, 0, 0, 1, 1,
    1, 1, 0, 1, 1,
    0, 0, 0, 0, 1,
    1, 1, 1, 1, 0]).astype(float).reshape(5, 5)


p, k = power(make_stochastic(L.transpose()))
p = p[:4]
p /= np.linalg.norm(p, ord=1)
print p
# [ 0.19801978  0.29702981  0.15841572  0.34653469]

So, to summarize, if we have a matrix \(M\), where each rows sums to 1, we can compute its largest eigenvector either by using the power method on that matrix or by applying the OPIC algorithm in an associated graph which has an edge from \(i\) to \(j\) if \(M_{ji} \neq 0\). When node \(i\) with cash \(c_i\) inside the node is updated it will give \(M_{ji}\cdot c_i\) cash to node \(j\).

Application to HITS

Notice that we really cannot apply OPIC as HITS without modifying the problem, since we have the requirement of all columns (or rows, depending what you consider) summing to 1. This is a requirement for the OPIC algorithm to work, since it assures that the amount of cash inside the graph remains constant. Otherwise the cash would go to 0 or to \(\infty\).

We will now use 4 scores for each page: hub cash and history and authority cash and history. When we update a page it will send its hub cash distributed to all its predecessors’ authority cash, and add it to its hub history. It will send also its authority cash distributed to all its children’s hub cash and add it to its authority history.

The following figure shows a single step of the algorithm applied to our toy problem, where we have ignored the virtual page to simplify things. Actually it’s not necessary to update in the same step the two scores however we show here both the hub and authority cash flow.

Notice that with this algorithm what remains constant after each step is the total amount of cash, that is, the sum of the hub and authority cash.

If you want to see how OPIC compares against the power method the following document shows a comparison between the two:

OPIC vs power method

Adaptive OPIC

The adaptive version of the OPIC algorithm takes into account the case where the graph is changing. In this case the pages that were from the beginning inside the graph have accumulated more history than the newer ones, even if the later are more important. Of course, if we freeze the graph this unbalance will in the long term be corrected but it could take lot of time.

One possible solution would be to discard all the pages history and restart again, but this is overkill since probably we are already near to the correct solution. The adaptive version of OPIC proposes a middle ground: to discard the history before a certain time starting from now. To do this exactly we would need to save the history of each page at every point of time. A more efficient alternative is to estimate how much history we should discard. This is called in the original paper history interpolation and is the approach that has been taken in this implementation.

Notice that the adaptive version has introduced a new parameter: a time window outside of which we should ignore history.

Adding external relevance measure

Although once of the nice properties of HITS scoring is that it only requires link information it has been demonstrated that adding content based information can greatly improve the effectivenes of topic focused crawlers. There are lots of ways in which we could fusion an external relevance score and the HITS scores to give a global score of page relevance and we will describe now the one developed in this case.

We will consider that the external relevance score is another measure of the page authority (a good hub can be empty of content, as long as it points to relevant one) and ranges between 0 (not relevant at all) and 1 (completely relevant).

With this in mind we modify the distribution of authority cash when we update a page. Let’s call \(r_i\) the relevance score for page \(i\), which has \(P_i\) parent pages pointing to it. If page \(i\) is relevant we want to distribute all its authority cash back to its parents, but if it’s not relevant we want to avoid distributing any cash back. Since we don’t want to destroy cash what we will do instead is to give that cash to the virtual page.

Mathematically, if we have \(a_i\) authority cash:

  1. Give \(a_i\cdot z(r_i)\) cash to each parent page.
  2. Give \(a_i[1 - P_i\cdot z(r_i)]\) cash to the virtual page.

Where \(z\) is a function of the page relevance and must satisfy the following conditions:

\[\begin{split}z(0) &= 0 &\quad \text{No cash goes back to the parents} \\ z(1) &= \frac{1}{P} &\quad \text{All cash goes back to parents} \\ z(0.5) &= 1 - P\cdot z(0.5) &\quad \text{The virtual page counts as another parent}\end{split}\]

Notice that the third condition is equivalent to saying that when the relevance is 0.5 the algorithm is equivalent to OPIC without external relevance score.

One possible function \(z(r)\) can be build by considern the second order polynomial fitting all the above points:

\[z(r) = \frac{2r}{P}\left( \frac{2P}{P+1}(1 - r) + (r - 0.5)\right)\]

Implementation

The algorithm described in the previous sections is implemented inside this class:

class crawlfrontier.contrib.backends.opic.opichits.OpicHits(db_graph=None, db_scores=None, time_window=None, db_relevance=None)

Implements the OPIC algorithm applied to the HITS scores problem

Parameters:
  • db_graph (GraphInterface) – Read only graph database. If None create a new one using SQLite
  • db_scores (HitsDBInterface) – Scores database. If None create a new one using SQLite
  • time_window (float) – Ignore cash flow out of this time window. Set to False/None to ignore.
  • db_relevance (RelevanceDBInterface) – Read only relevance database. If None create a new one using SQLite
a_mean

Mean of authority scores

add_page(page_id)

Add a new page

Parameters:page_id (str) – Page identification
Returns:HitsScore – The new score assigned to the page
close()

Close any associated database

get_scores(page_id)

Normalized hub and authority score

Parameters:page_id (str) – Page identification
Returns:(float, float) – A tuple (hub score, authority score) for the given page_id
h_mean

Mean of hub scores

iscores()

Iterate over (page id, hub score, authority score)

mark_update(page_id)

Add this to the list of pages to update

To decide which pages we should update next we uuse an heuristic: we select the pages with the highest accumulated authority or hub cash. This function makes possible to externally add a given page to the set of pages to be updated, irrespective of its accumulated cash.

Parameters:page_id (str) – Page identification
update(n_iter=1)

Run a full iteration of the OPIC-HITS algorithm

Parameters:n_iter (int) – number of iterations
Returns:pair of lists – The first one contains pages with an updated hub score and the second ones with update authority score.

As you can see it has one optional paramter, the time window of the adaptive version, and 3 external database components:

The facilitate adding different database backends their interface has been extracted in different abstract metaclasses.

Graph database

Every class implementing the following interface can be a drop-in replacement for the included SQLite backend. It is not a requirement to inherit from this class, but is recommended for checking the interface at compile time.

Notice that this database is only accessed from opichits.OpicHits via read operations.

class crawlfrontier.contrib.backends.opic.graphdb.GraphInterface

Interface definition for a Graph database

The graph does not store any aditional data inside the nodes or edges, only the link structure. All the function below just represent nodes as unique identifier strings and edges as pairs of nodes (python tuples).

add_edge(start, end)

Add a new edge to the graph, from start to end

add_node(node)

Add a new node

clear()

Delete all contents

close()

Close connection and commit changes

delete_edge(node)

Delete edge

delete_node(node)

Delete node, and all edges connecting to this node

end_batch()

Commit pending changes

has_node(node)

True if node present inside graph

iedges()

An iterator for all the edges

inodes()

An iterator for all the nodes

nedges()

Number of edges

neighbours(node)

A set of nodes going to or from ‘node’

nnodes()

Number of nodes

predecessors(node)

A list of the predecessors for the given node

start_batch()

Do not commit changes to the graph until this batch ends

successors(node)

A list of the successors for the given node

The following class is a simple implementation of the above interface using SQLite. It maintains a table with edges and a table with nodes. The table is fully indexed for performance.

class crawlfrontier.contrib.backends.opic.graphdb.SQLite(db=None)

SQLite implementation of GraphInterface

Relevance database

If this database is provided opichits.OpicHits will consult the score associated to each page to decide how to distribute the authority score.

Notice that this database is only accessed from opichits.OpicHits via read operations.

As usual, there is an interface which all backends must satisfy:

class crawlfrontier.contrib.backends.opic.relevancedb.RelevanceDBInterface

Interface definition for relevance databases.

Page ID’s are unique string identifiers for pages and page scores are float numbers between 0 and 1.

add(page_id, page_score)

Add a new association

clear()

Delete all contents

delete(page_id)

Delete page

get(page_id)

Get score for the given page

get_best_scores(n)

Get the n highest scores as a list of tuples of type (page_id, page_score)

iscores()

An iterator over all tuples (page_id, page_score)

set(page_id, page_score)

Change score

And there is also a batteries included SQLite implementation of the interface:

class crawlfrontier.contrib.backends.opic.relevancedb.SQLite(db=None)

A SQLite implementation for RelevanceDBInterface

HITS scores database

The HITS database contains the state of the OPIC algorithm. Since there are several values to maintain for each page a new class has been created for storing them:

class crawlfrontier.contrib.backends.opic.hitsdb.HitsScore(h_history, h_cash, h_last, a_history, a_cash, a_last)

Just a container for the following (modifiable) fields:

Parameters:
  • h_history – Accumulated hub cash
  • h_cash – Non-spent hub cash
  • h_last – Total hub cash the last time it was updated
  • a_history – Accumulated authority cash
  • a_cash – Non-spent authority cash
  • a_last – Total authority cash the last time it was updated

First, the interface.

class crawlfrontier.contrib.backends.opic.hitsdb.HitsDBInterface

Interface definition for every HITS database.

The keys of this database are page id’s, unique string identifiers for each page, and the values are HitsScore instances.

add(page_id, page_score)

Associate page_score with page_id

clear()

Delete all contents

close()

Close connection and commit changes

delete(page_id)

Delete association

get(page_id)

Get the HitsScore associated with page_id

get_a_total()

Total accumulated authority score

get_count()

Get number of scores in database

get_h_total()

Total accumulated hub score

get_highest_a_cash(n=1)

Get the highest authority cash

get_highest_h_cash(n=1)

Get the highest hub cash

increase_a_cash(page_id_list, a_cash)

Increase the cash in the given pages in this amount

increase_all_cash(h_cash, a_cash)

Increase the cash in all pages in this amount

increase_h_cash(page_id_list, h_cash)

Increase the cash in the given pages in this amount

iteritems()

Iterate over hits scores

Returns:iterator over pairs – Each pair has the form (page_id, page_score)
set(page_id, page_score)

Change the HitsScore associated with page_id

And of course, the SQLite already provided implementation:

class crawlfrontier.contrib.backends.opic.hitsdb.SQLite(db=None)

SQLite based implementation for HitsDBInterface

Make a new connection to a HITS scores database or, if None provided make a new in-memory

Scheduler

Introduction

Having a measure of page importance is only the first step. Now we need to consider which page should we be crawling next. In order for the scheduler to take that decision it must have some information about the pages. We will assume that it has two items of information about each page: the change rate of the page and a measure of the page value. With this information it must give back to us the next pages to crawl. With these considerations we define the interface that all schedulers must satisfy:

class crawlfrontier.contrib.backends.opic.scheduler.SchedulerInterface

Interface that must be satisfied by all schedulers

close()

Close all databases

delete(page_id)

Remove page from scheduler

get_next_pages(n_pages)

Return next pages to crawl

set_rate(page_id, rate_new)

Set change rate for given page

set_value(page_id, value_new)

Set page value for given page

BestFirst

One possible strategy for an scheduler is just to be greedy: visit the next page in the frontier that gives the most value. This type of crawler is implemented in:

class crawlfrontier.contrib.backends.opic.scheduler.BestFirst(rate_value_db=None)

A BestFirst crawler always return the next page with highest value.

To be really BestFirst get_next_pages should be called with n_pages=1. However, the crawler runs OK if its asked for the next best n_pages.

Parameters:rate_value_db (SchedulerDBInterface .schedulerdb.SchedulerDBInterface) – database to store page_id, rate, value triplets

Revisiting scheduler

BestFirst as is has one obvious limitation: it doesn’t revisit web pages since once a page is crawled it is deleted from the frontier. We could modify it, adding an estimation of each web page change rate and then reconsidering a page after some time has passed depending.

Another possibility is to use an scheduler which is not greedy, but optimal in the sense that maximizes the long term effectiviness of the crawler if some assumptions about page change rate hold. This scheduler is implemented inside:

class crawlfrontier.contrib.backends.opic.scheduler.Optimal(n_clusters=100, rate_value_db=None, freq_db=None)

Compute the optimal refresh frequency, with a fixed computational cost

Parameters:
  • n_clusters (int) – number of clusters to use for the approximation
  • rate_value_db (SchedulerDBInterface .schedulerdb.SchedulerDBInterface) – database to store page_id, rate, value triplets
  • freq_db (FreqDBInterface .freqdb.FreqDBInterface) – database to store the scheduler solution
crawl_rate

Get crawl rate

frequency(page_id)

Get the optimal refresh frequency for a page

As you can see there are two databases associated to the scheduler:

The scheduler database is just for maintaining an association between pages and page change rates and values. As usual it can be anything that adheres to the following interface:

class crawlfrontier.contrib.backends.opic.schedulerdb.SchedulerDBInterface
add(page_id, page_rate, page_value)

Add a new association

clear()

Delete all contents

get(page_id)

Get (page_rate, page_value) for the given page

get_best_value(n_pages=1, delete=False)

Get the pages with highest value

Parameters:
  • n_pages (int) – number of pages to retrieve
  • delete (bool) – if True remove the retrieves pages from the database
iter()

An iterator over all tuples (rate, value)

set(page_id, page_rate, page_value)

Change association

Of course there is an already provided SQLite implementation:

class crawlfrontier.contrib.backends.opic.schedulerdb.SQLite(db=None)

A SQLite implementation for the SchedulerDBInterface

The “freqs” database associates pages with optimal frequencies as computed by the scheduler. However it not only takes stores the association but also allows to query what is the next page to be crawled.

The interface is defined in:

class crawlfrontier.contrib.backends.opic.freqdb.FreqDBInterface

Interface definition for every FreqDB database

add(page_id, page_freq)

Associate page_freq with page_id, where:

Parameters:
  • page_id (str) – an string which identifies the page
  • page_freq (float) – refresh frequency of the page (Hz)
clear()

Delete all contents

close()

Close connection and commit changes

delete(page_id)

Delete association

get_next_pages(n=1)

Return the next pages, to maintain the desired frequency

set(page_id, page_freq)

Change the frequency associated with page_id

And the SQLite implementation:

class crawlfrontier.contrib.backends.opic.freqdb.SQLite(db=None)

SQLite based implementation for FreqDBInterface

Make a new connection to a frequency database or, if None provided make a new in-memory

The justification of the optimal scheduler algorithm is quite lengthy and so is not included here. If you are interested in the details the it is described here:

Page change rate estimator

Right now it is assumed that the pages probability of change follow a Poisson distribution. The details are given in the description of the scheduler algorithm Optimal revisiting algorithm.

As always any implementation satisfying the following interface can be used:

class crawlfrontier.contrib.backends.opic.freqest.FreqEstimatorInterface
add(page_id)

Add initial frequency estimation for page_id

page_id – An string which identifies the page

close()

Persist or flush all necesary information

delete(page_id)

Stop tracking page_id

frequency(page_id)

Return the estimated refresh frequency for the page. If the page is not being tracked return None

refresh(page_id, updated)

Add new refresh information for page_id

updated – A boolean indicating if the page has changed

And we already have a working SQLite implementation, which just counts the number of updates in a given time interval:

class crawlfrontier.contrib.backends.opic.freqest.Simple(db=None, clock=None, default_freq=0.0)

The simple estimator just computes the frequency as the total number of updates divided by the total observation time

Initialize estimator

Arguments:
db – updates database to use. If None provided create a new
in-memory one.
clock – A function that returns elapsed time in seconds from a
fixed origin in time.
default_freq – Return this frequency if not enough info available to
compute an estimate

The frequency estimator expects as input whether or not a page has changed but does not make any assumption of what constitutes a change since this is the responsability of a page change estimator, which has a very simple interface: given a page and its body returns True if the change has changed:

class crawlfrontier.contrib.backends.opic.pagechange.PageChangeInterface

Interface for all page change detectors

update(page_id, page_body)

Returns one of the change status for the page

The value returned is one of the fields of Status

The current implementation is very simple since it just tests for any change in the body of the page by computing its hash. Notice that very simple changes, which don’t really add new content, will be detected as a change.

class crawlfrontier.contrib.backends.opic.pagechange.BodySHA1(db=None)