OPIC Details¶
HITS algorithm¶
Introduction¶
Since OPIC is just a way of computing the HITS score for each page we must first understand what the HITS score is. HITS stands for Hyperlink-Induced Topic Search, originally described in:
Authoritative sources in a hyperlinked environment
Kleinber J.
1998
The idea is to compute for each page a pair of numbers called the Hub and Authority scores. Intutively we say a page is a hub when it points to lot of pages with high authority, and we say that a page has high authority if it is pointed by many hubs. Mathematically this is expressed as:
Where \(h_i\) represents the hub score of page \(i\) and \(a_i\) represents its authority score. \(j: i \to j\) represent the set of pages \(j\) that are pointed by \(i\) and \(j: j \to i\) represents the set of pages \(j\) that point to \(i\). For example, consider the following set of pages:
The equations to compute the scores would be:
The above equations can be rewritten in matrix form. Let’s define:
\(L\) is called the link matrix because it contains all the link information available in the graph. Notice that \(L_{ij}\) is 1 if there is a link between nodes \(i\) and \(j\) and 0 otherwise. If we group the hub and authorities scores in the vectors \(h\) and \(a\) we can write the HITS equations as:
The above equations always have the trivial solutions \(h=a=0\). The other non-zero solutions are given by the eigenvectors of the matrices \(H\) and \(A\).
Power iteration¶
A very simple and efficient way of getting the eigenvector with the largest eigenvalue is using the power method: to compute the largest eigenvector \(v\) of a matrix \(M\) simple iterate long enough starting from a random vector \(v^0\)
Notice that unless the associated eigenvalue is exactly 1 you will need to re-normalize after each iteration.
Let’s apply this method to our toy problem. Using numpy the code for the power method is very simple:
import numpy as np
def power(M, eps=1e-6, max_iter=1000):
"""Compute the largest eigenvector of M.
Returns: the eigenvector and the eigenvalue
"""
v0 = np.random.rand(M.shape[0])
converged = False
for i in xrange(max_iter):
v1 = M.dot(v0)
v1 /= np.linalg.norm(v1)
err = np.max(np.abs(v1 - v0))
if err < eps:
converged = True
break
v0 = v1
if not converged:
print "Warning: didn't converge"
return v1, M[0, :].dot(v1)/v1[0]
And we can apply it to our toy problem:
L = np.array([
0, 1, 0, 0,
0, 0, 0, 1,
1, 1, 0, 1,
0, 0, 0, 0]).astype(float).reshape(4,4)
H = L.dot(L.transpose())
A = L.transpose().dot(L)
h, h_eigv = power(H)
a, a_eigv = power(A)
print "Hub scores: ", h
# Hub scores: [ 0.32505736 0.3250578 0.88807383 0. ]
print "Authority scores: ", a
# Authority scores: [ 0.45970084 0.62796324 0. 0.62796282]
If we represent the scores directly on the graph we see how the scores correctly represent our intuitive notion of what a hub or an authoritative page represents:
Notice that the power method can fail. Take for example the matrix \(L\). No matter what initial value we start iterating from, it will always converge to a zero vector. To see why notice the following identity which gives name to the power method:
Now let’s compute the different powers of \(L\)
The reason for this behavior is that the only eigenvalue of \(L\) is 0. There is however a non-zero eigenvector associated to the 0 eigenvalue:
But the power method fails to find it. A trick to solve this problem is to apply a small perturbation \(\epsilon\) to \(L\) and apply the power method. It should converge to an small perturbation of the original eigenvector. Let’s try:
print power(L + 1e-5)
# (array([ 5.52652292e-02, 3.23859608e-03, 9.98466441e-01,
# 1.79833785e-04]), 0.058792257444600315)
As you can see it returns a result very close to the true answer. The perturbated eigenvector is \((0.055, 0.003, 0.998, 0.000)\) and the perturbated eigenvalue is 0.06.
We finally end making some comments:
The power method is efficient because the link matrix is sparse, meaning that most of its elements are zero. However we need to have it fully loaded in memory to perform an iteration of the algorithm.
We don’t need to compute the \(H\) and \(A\) matrices and instead we can perform the following iteration:
\[\begin{split}h_{k+1} &= La_k \\ a_{k+1} &= L^Th_k\end{split}\]Which can also be rewritten as:
\[\begin{split}\begin{pmatrix} h \\ a \end{pmatrix}^{k+1} = J \begin{pmatrix} h \\ a \end{pmatrix}^{k}\end{split}\]
Where:
\[\begin{split}J = \begin{bmatrix} 0 & L \\ L^T & 0 \end{bmatrix}\end{split}\]And we can check that the power method applied to matrix \(J\) gives the same answer as before
N = L.shape[0] J = np.zeros((2*N, 2*N)) J[:N, N:] = L J[N:, :N] = L.transpose() h, a = np.split(power(J, max_iter=1000)[0], 2) h /= np.linalg.norm(h) a /= np.linalg.norm(a) print "Hub scores: ", h # Hub scores: [ 0.32505758 0.32505758 0.88807383 0. ] print "Authority scores: ",a # Authority scores: [ 0.45970084 0.62796303 0. 0.62796303]Actually, it gives a warning about convergence. The reason is that the vector built by stacking together \(h\) and \(a\) oscillates between the following two vectors:
\[\begin{split}\begin{matrix} (0.200 & 0.200 & 0.545 & 0 & | & 0.363 & 0.496 & 0 & 0.496) \\ (0.257 & 0.257 & 0.701 & 0 & | & 0.282 & 0.386 & 0 & 0.386) \end{matrix}\end{split}\]However, each half part of the above vectors are equal up to a scaling constant, and so the method converge to the correct scores.
OPIC algorithm¶
Introduction¶
OPIC is just an alternative to the power method to compute the largest eigenvector and so it can be applied to compute the HITS score. It is described here:
Adaptive On-Line Page Importance Computation
Abiteboul S., Preda M., Cobena G.
2003
Given a page graph, the idea is to maintain for each page two quantities: the cash and the history. When we update a page the cash is distributed to its children and the the total quantitiy is added to the page history. For example, returning to our toy example let’s suppose that at a given instant in time we have the current state of cash and history associated to each page:
And we update page 3. Since it has 6 units of cash it will distribute 2 units to each one of its 3 children. It will also increment its history by 6 units, since this is the total amount of cash it has distributed and its cash will be reset to 0. The following image shows the change in the graph state after a single update.
If we keep performing these updates, making sure that we update all pages, the history of the pages will converge, after normalization, to same value of the power method applied to the following matrix:
You can see that \(\bar{L}\) is very similar to \(L^T\) since its actually the same matrix transposed, where each column has been divided by a normalizing factor, so as to make each column add to 1.
The utility of the OPIC method lies in the fact that it does not matters the order in which we update the pages, nor it matters if we update some pages more than others, as long as all pages are updated periodically.
However, we must remember that the power method sometimes fail to converge and we solved it by adding an small perturbation to the link matrix. We make an equivalente modification to the OPIC algorithm by adding an small modification to the graph: we add a virtual page that links and is linked by all other pages. For our toy problem it would look like this:
Let’s check our claims with some code. The following code will apply the OPIC algorithm to an arbitrary graph. For simplicity the graph is represented as a dictionary where the key is the node name and the value is a list of outgoing nodes.
def add_virtual_page(graph):
new_graph = {node: children + ['V']
for node, children in graph.iteritems()}
new_graph['V'] = graph.keys()
return new_graph
example = add_virtual_page(
{1: [2],
2: [4],
3: [1, 2, 4],
4: []}
)
We now implement the algorithm, where the next page to update is randomly selected.
import random
def opic(graph, n_iter=100):
"""Simple implementation of OPIC where all children get the
same cash"""
# get the list of nodes
nodes = graph.keys()
# build a dictionary to hold cash and history
# we give each page an initial cash of 1
state = {node: (1.0, 0.0) for node in nodes}
for i in xrange(n_iter):
node = random.choice(nodes)
cash, history = state[node]
children = graph[node]
# distribute cash to children
n_children = len(children)
for child in children:
child_cash, child_history = state[child]
state[child] = (child_cash + cash/n_children,
child_history)
# update node state
state[node] = (0.0, history + cash)
# return normalized history
history = [state[n][1] for n in nodes]
total = sum(history)
return zip(nodes, [h/total for h in history])
To compare against the power method we build the equivalent link matrix, where the last row/column corresponds to the virtual page. First without normalization:
And after transposition and normalization:
We can compare now the two methods:
for node, score in opic(example, n_iter=1000):
print node, score
# 1 0.198500025862
# 2 0.298524016644
# 3 0.157148867939
# 4 0.345827089555
L = np.array([
0, 1, 0, 0, 1,
0, 0, 0, 1, 1,
1, 1, 0, 1, 1,
0, 0, 0, 0, 1,
1, 1, 1, 1, 0]).astype(float).reshape(5, 5)
p, k = power(make_stochastic(L.transpose()))
p = p[:4]
p /= np.linalg.norm(p, ord=1)
print p
# [ 0.19801978 0.29702981 0.15841572 0.34653469]
So, to summarize, if we have a matrix \(M\), where each rows sums to 1, we can compute its largest eigenvector either by using the power method on that matrix or by applying the OPIC algorithm in an associated graph which has an edge from \(i\) to \(j\) if \(M_{ji} \neq 0\). When node \(i\) with cash \(c_i\) inside the node is updated it will give \(M_{ji}\cdot c_i\) cash to node \(j\).
Application to HITS¶
Notice that we really cannot apply OPIC as HITS without modifying the problem, since we have the requirement of all columns (or rows, depending what you consider) summing to 1. This is a requirement for the OPIC algorithm to work, since it assures that the amount of cash inside the graph remains constant. Otherwise the cash would go to 0 or to \(\infty\).
We will now use 4 scores for each page: hub cash and history and authority cash and history. When we update a page it will send its hub cash distributed to all its predecessors’ authority cash, and add it to its hub history. It will send also its authority cash distributed to all its children’s hub cash and add it to its authority history.
The following figure shows a single step of the algorithm applied to our toy problem, where we have ignored the virtual page to simplify things. Actually it’s not necessary to update in the same step the two scores however we show here both the hub and authority cash flow.
Notice that with this algorithm what remains constant after each step is the total amount of cash, that is, the sum of the hub and authority cash.
If you want to see how OPIC compares against the power method the following document shows a comparison between the two:
Adaptive OPIC¶
The adaptive version of the OPIC algorithm takes into account the case where the graph is changing. In this case the pages that were from the beginning inside the graph have accumulated more history than the newer ones, even if the later are more important. Of course, if we freeze the graph this unbalance will in the long term be corrected but it could take lot of time.
One possible solution would be to discard all the pages history and restart again, but this is overkill since probably we are already near to the correct solution. The adaptive version of OPIC proposes a middle ground: to discard the history before a certain time starting from now. To do this exactly we would need to save the history of each page at every point of time. A more efficient alternative is to estimate how much history we should discard. This is called in the original paper history interpolation and is the approach that has been taken in this implementation.
Notice that the adaptive version has introduced a new parameter: a time window outside of which we should ignore history.
Adding external relevance measure¶
Although once of the nice properties of HITS scoring is that it only requires link information it has been demonstrated that adding content based information can greatly improve the effectivenes of topic focused crawlers. There are lots of ways in which we could fusion an external relevance score and the HITS scores to give a global score of page relevance and we will describe now the one developed in this case.
We will consider that the external relevance score is another measure of the page authority (a good hub can be empty of content, as long as it points to relevant one) and ranges between 0 (not relevant at all) and 1 (completely relevant).
With this in mind we modify the distribution of authority cash when we update a page. Let’s call \(r_i\) the relevance score for page \(i\), which has \(P_i\) parent pages pointing to it. If page \(i\) is relevant we want to distribute all its authority cash back to its parents, but if it’s not relevant we want to avoid distributing any cash back. Since we don’t want to destroy cash what we will do instead is to give that cash to the virtual page.
Mathematically, if we have \(a_i\) authority cash:
- Give \(a_i\cdot z(r_i)\) cash to each parent page.
- Give \(a_i[1 - P_i\cdot z(r_i)]\) cash to the virtual page.
Where \(z\) is a function of the page relevance and must satisfy the following conditions:
Notice that the third condition is equivalent to saying that when the relevance is 0.5 the algorithm is equivalent to OPIC without external relevance score.
One possible function \(z(r)\) can be build by considern the second order polynomial fitting all the above points:
Implementation¶
The algorithm described in the previous sections is implemented inside this class:
- class crawlfrontier.contrib.backends.opic.opichits.OpicHits(db_graph=None, db_scores=None, time_window=None, db_relevance=None)¶
Implements the OPIC algorithm applied to the HITS scores problem
Parameters: - db_graph (GraphInterface) – Read only graph database. If None create a new one using SQLite
- db_scores (HitsDBInterface) – Scores database. If None create a new one using SQLite
- time_window (float) – Ignore cash flow out of this time window. Set to False/None to ignore.
- db_relevance (RelevanceDBInterface) – Read only relevance database. If None create a new one using SQLite
- a_mean¶
Mean of authority scores
- add_page(page_id)¶
Add a new page
Parameters: page_id (str) – Page identification Returns: HitsScore – The new score assigned to the page
- close()¶
Close any associated database
- get_scores(page_id)¶
Normalized hub and authority score
Parameters: page_id (str) – Page identification Returns: (float, float) – A tuple (hub score, authority score) for the given page_id
- h_mean¶
Mean of hub scores
- iscores()¶
Iterate over (page id, hub score, authority score)
- mark_update(page_id)¶
Add this to the list of pages to update
To decide which pages we should update next we uuse an heuristic: we select the pages with the highest accumulated authority or hub cash. This function makes possible to externally add a given page to the set of pages to be updated, irrespective of its accumulated cash.
Parameters: page_id (str) – Page identification
- update(n_iter=1)¶
Run a full iteration of the OPIC-HITS algorithm
Parameters: n_iter (int) – number of iterations Returns: pair of lists – The first one contains pages with an updated hub score and the second ones with update authority score.
As you can see it has one optional paramter, the time window of the adaptive version, and 3 external database components:
The facilitate adding different database backends their interface has been extracted in different abstract metaclasses.
Graph database¶
Every class implementing the following interface can be a drop-in replacement for the included SQLite backend. It is not a requirement to inherit from this class, but is recommended for checking the interface at compile time.
Notice that this database is only accessed from opichits.OpicHits via read operations.
- class crawlfrontier.contrib.backends.opic.graphdb.GraphInterface¶
Interface definition for a Graph database
The graph does not store any aditional data inside the nodes or edges, only the link structure. All the function below just represent nodes as unique identifier strings and edges as pairs of nodes (python tuples).
- add_edge(start, end)¶
Add a new edge to the graph, from start to end
- add_node(node)¶
Add a new node
- clear()¶
Delete all contents
- close()¶
Close connection and commit changes
- delete_edge(node)¶
Delete edge
- delete_node(node)¶
Delete node, and all edges connecting to this node
- end_batch()¶
Commit pending changes
- has_node(node)¶
True if node present inside graph
- iedges()¶
An iterator for all the edges
- inodes()¶
An iterator for all the nodes
- nedges()¶
Number of edges
- neighbours(node)¶
A set of nodes going to or from ‘node’
- nnodes()¶
Number of nodes
- predecessors(node)¶
A list of the predecessors for the given node
- start_batch()¶
Do not commit changes to the graph until this batch ends
- successors(node)¶
A list of the successors for the given node
The following class is a simple implementation of the above interface using SQLite. It maintains a table with edges and a table with nodes. The table is fully indexed for performance.
- class crawlfrontier.contrib.backends.opic.graphdb.SQLite(db=None)¶
SQLite implementation of GraphInterface
Relevance database¶
If this database is provided opichits.OpicHits will consult the score associated to each page to decide how to distribute the authority score.
Notice that this database is only accessed from opichits.OpicHits via read operations.
As usual, there is an interface which all backends must satisfy:
- class crawlfrontier.contrib.backends.opic.relevancedb.RelevanceDBInterface¶
Interface definition for relevance databases.
Page ID’s are unique string identifiers for pages and page scores are float numbers between 0 and 1.
- add(page_id, page_score)¶
Add a new association
- clear()¶
Delete all contents
- delete(page_id)¶
Delete page
- get(page_id)¶
Get score for the given page
- get_best_scores(n)¶
Get the n highest scores as a list of tuples of type (page_id, page_score)
- iscores()¶
An iterator over all tuples (page_id, page_score)
- set(page_id, page_score)¶
Change score
And there is also a batteries included SQLite implementation of the interface:
- class crawlfrontier.contrib.backends.opic.relevancedb.SQLite(db=None)¶
A SQLite implementation for RelevanceDBInterface
HITS scores database¶
The HITS database contains the state of the OPIC algorithm. Since there are several values to maintain for each page a new class has been created for storing them:
- class crawlfrontier.contrib.backends.opic.hitsdb.HitsScore(h_history, h_cash, h_last, a_history, a_cash, a_last)¶
Just a container for the following (modifiable) fields:
Parameters: - h_history – Accumulated hub cash
- h_cash – Non-spent hub cash
- h_last – Total hub cash the last time it was updated
- a_history – Accumulated authority cash
- a_cash – Non-spent authority cash
- a_last – Total authority cash the last time it was updated
First, the interface.
- class crawlfrontier.contrib.backends.opic.hitsdb.HitsDBInterface¶
Interface definition for every HITS database.
The keys of this database are page id’s, unique string identifiers for each page, and the values are HitsScore instances.
- add(page_id, page_score)¶
Associate page_score with page_id
- clear()¶
Delete all contents
- close()¶
Close connection and commit changes
- delete(page_id)¶
Delete association
- get_a_total()¶
Total accumulated authority score
- get_count()¶
Get number of scores in database
- get_h_total()¶
Total accumulated hub score
- get_highest_a_cash(n=1)¶
Get the highest authority cash
- get_highest_h_cash(n=1)¶
Get the highest hub cash
- increase_a_cash(page_id_list, a_cash)¶
Increase the cash in the given pages in this amount
- increase_all_cash(h_cash, a_cash)¶
Increase the cash in all pages in this amount
- increase_h_cash(page_id_list, h_cash)¶
Increase the cash in the given pages in this amount
- iteritems()¶
Iterate over hits scores
Returns: iterator over pairs – Each pair has the form (page_id, page_score)
And of course, the SQLite already provided implementation:
- class crawlfrontier.contrib.backends.opic.hitsdb.SQLite(db=None)¶
SQLite based implementation for HitsDBInterface
Make a new connection to a HITS scores database or, if None provided make a new in-memory
Scheduler¶
Introduction¶
Having a measure of page importance is only the first step. Now we need to consider which page should we be crawling next. In order for the scheduler to take that decision it must have some information about the pages. We will assume that it has two items of information about each page: the change rate of the page and a measure of the page value. With this information it must give back to us the next pages to crawl. With these considerations we define the interface that all schedulers must satisfy:
- class crawlfrontier.contrib.backends.opic.scheduler.SchedulerInterface¶
Interface that must be satisfied by all schedulers
- close()¶
Close all databases
- delete(page_id)¶
Remove page from scheduler
- get_next_pages(n_pages)¶
Return next pages to crawl
- set_rate(page_id, rate_new)¶
Set change rate for given page
- set_value(page_id, value_new)¶
Set page value for given page
BestFirst¶
One possible strategy for an scheduler is just to be greedy: visit the next page in the frontier that gives the most value. This type of crawler is implemented in:
- class crawlfrontier.contrib.backends.opic.scheduler.BestFirst(rate_value_db=None)¶
A BestFirst crawler always return the next page with highest value.
To be really BestFirst get_next_pages should be called with n_pages=1. However, the crawler runs OK if its asked for the next best n_pages.
Parameters: rate_value_db (SchedulerDBInterface .schedulerdb.SchedulerDBInterface) – database to store page_id, rate, value triplets
Revisiting scheduler¶
BestFirst as is has one obvious limitation: it doesn’t revisit web pages since once a page is crawled it is deleted from the frontier. We could modify it, adding an estimation of each web page change rate and then reconsidering a page after some time has passed depending.
Another possibility is to use an scheduler which is not greedy, but optimal in the sense that maximizes the long term effectiviness of the crawler if some assumptions about page change rate hold. This scheduler is implemented inside:
- class crawlfrontier.contrib.backends.opic.scheduler.Optimal(n_clusters=100, rate_value_db=None, freq_db=None)¶
Compute the optimal refresh frequency, with a fixed computational cost
Parameters: - n_clusters (int) – number of clusters to use for the approximation
- rate_value_db (SchedulerDBInterface .schedulerdb.SchedulerDBInterface) – database to store page_id, rate, value triplets
- freq_db (FreqDBInterface .freqdb.FreqDBInterface) – database to store the scheduler solution
- crawl_rate¶
Get crawl rate
- frequency(page_id)¶
Get the optimal refresh frequency for a page
As you can see there are two databases associated to the scheduler:
The scheduler database is just for maintaining an association between pages and page change rates and values. As usual it can be anything that adheres to the following interface:
- class crawlfrontier.contrib.backends.opic.schedulerdb.SchedulerDBInterface¶
- add(page_id, page_rate, page_value)¶
Add a new association
- clear()¶
Delete all contents
- get(page_id)¶
Get (page_rate, page_value) for the given page
- get_best_value(n_pages=1, delete=False)¶
Get the pages with highest value
Parameters: - n_pages (int) – number of pages to retrieve
- delete (bool) – if True remove the retrieves pages from the database
- iter()¶
An iterator over all tuples (rate, value)
- set(page_id, page_rate, page_value)¶
Change association
Of course there is an already provided SQLite implementation:
- class crawlfrontier.contrib.backends.opic.schedulerdb.SQLite(db=None)¶
A SQLite implementation for the SchedulerDBInterface
The “freqs” database associates pages with optimal frequencies as computed by the scheduler. However it not only takes stores the association but also allows to query what is the next page to be crawled.
The interface is defined in:
- class crawlfrontier.contrib.backends.opic.freqdb.FreqDBInterface¶
Interface definition for every FreqDB database
- add(page_id, page_freq)¶
Associate page_freq with page_id, where:
Parameters: - page_id (str) – an string which identifies the page
- page_freq (float) – refresh frequency of the page (Hz)
- clear()¶
Delete all contents
- close()¶
Close connection and commit changes
- delete(page_id)¶
Delete association
- get_next_pages(n=1)¶
Return the next pages, to maintain the desired frequency
- set(page_id, page_freq)¶
Change the frequency associated with page_id
And the SQLite implementation:
- class crawlfrontier.contrib.backends.opic.freqdb.SQLite(db=None)¶
SQLite based implementation for FreqDBInterface
Make a new connection to a frequency database or, if None provided make a new in-memory
The justification of the optimal scheduler algorithm is quite lengthy and so is not included here. If you are interested in the details the it is described here:
Page change rate estimator¶
Right now it is assumed that the pages probability of change follow a Poisson distribution. The details are given in the description of the scheduler algorithm Optimal revisiting algorithm.
As always any implementation satisfying the following interface can be used:
- class crawlfrontier.contrib.backends.opic.freqest.FreqEstimatorInterface¶
- add(page_id)¶
Add initial frequency estimation for page_id
page_id – An string which identifies the page
- close()¶
Persist or flush all necesary information
- delete(page_id)¶
Stop tracking page_id
- frequency(page_id)¶
Return the estimated refresh frequency for the page. If the page is not being tracked return None
- refresh(page_id, updated)¶
Add new refresh information for page_id
updated – A boolean indicating if the page has changed
And we already have a working SQLite implementation, which just counts the number of updates in a given time interval:
- class crawlfrontier.contrib.backends.opic.freqest.Simple(db=None, clock=None, default_freq=0.0)¶
The simple estimator just computes the frequency as the total number of updates divided by the total observation time
Initialize estimator
- Arguments:
- db – updates database to use. If None provided create a new
- in-memory one.
- clock – A function that returns elapsed time in seconds from a
- fixed origin in time.
- default_freq – Return this frequency if not enough info available to
- compute an estimate
The frequency estimator expects as input whether or not a page has changed but does not make any assumption of what constitutes a change since this is the responsability of a page change estimator, which has a very simple interface: given a page and its body returns True if the change has changed:
- class crawlfrontier.contrib.backends.opic.pagechange.PageChangeInterface¶
Interface for all page change detectors
- update(page_id, page_body)¶
Returns one of the change status for the page
The value returned is one of the fields of Status
The current implementation is very simple since it just tests for any change in the body of the page by computing its hash. Notice that very simple changes, which don’t really add new content, will be detected as a change.
- class crawlfrontier.contrib.backends.opic.pagechange.BodySHA1(db=None)¶