Backends

Frontier Backend is where the crawling logic/policies lies. It’s responsible for receiving all the crawl info and selecting the next pages to be crawled. It’s called by the FrontierManager after Middleware, using hooks for Request and Response processing according to frontier data flow.

Unlike Middleware, that can have many different instances activated, only one Backend can be used per frontier.

Some backends require, depending on the logic implemented, a persistent storage to manage Request and Response objects info.

Activating a backend

To activate the frontier middleware component, set it through the BACKEND setting.

Here’s an example:

BACKEND = 'crawlfrontier.contrib.backends.memory.FIFO'

Keep in mind that some backends may need to be enabled through a particular setting. See each backend documentation for more info.

Writing your own backend

Writing your own frontier backend is easy. Each Backend component is a single Python class inherited from Component.

FrontierManager will communicate with active Backend through the methods described below.

class crawlfrontier.core.components.Backend

Interface definition for a Frontier Backend

Methods

frontier_start()

Called when the frontier starts, see starting/stopping the frontier.

Returns:None.
frontier_stop()

Called when the frontier stops, see starting/stopping the frontier.

Returns:None.
add_seeds(seeds)

This method is called when new seeds are are added to the frontier.

Parameters:seeds (list) – A list of Request objects.
Returns:None.
get_next_requests(max_n_requests)

Returns a list of next requests to be crawled.

Parameters:max_next_requests (int) – Maximum number of requests to be returned by this method.
Returns:list of Request objects.
page_crawled(response, links)

This method is called each time a page has been crawled.

Parameters:
  • response (object) – The Response object for the crawled page.
  • links (list) – A list of Request objects generated from the links extracted for the crawled page.
Returns:

None.

request_error(page, error)

This method is called each time an error occurs when crawling a page

Parameters:
  • request (object) – The crawled with error Request object.
  • error (string) – A string identifier for the error.
Returns:

None.

Class Methods

classmethod from_manager(manager)

Class method called from FrontierManager passing the manager itself.

Example of usage:

def from_manager(cls, manager):
    return cls(settings=manager.settings)

Built-in backend reference

This page describes all each backend documentation components that come with Crawl Frontier. For information on how to use them and how to write your own middleware, see the backend usage guide..

To know the default activated Backend check the BACKEND setting.

Basic algorithms

Some of the built-in Backend objects implement basic algorithms as as FIFO/LIFO or DFS/BFS for page visit ordering.

Differences between them will be on storage engine used. For instance, memory.FIFO and sqlalchemy.FIFO will use the same logic but with different storage engines.

Memory backends

This set of Backend objects will use an heapq object as storage for basic algorithms.

class crawlfrontier.contrib.backends.memory.BASE

Base class for in-memory heapq Backend objects.

class crawlfrontier.contrib.backends.memory.FIFO

In-memory heapq Backend implementation of FIFO algorithm.

class crawlfrontier.contrib.backends.memory.LIFO

In-memory heapq Backend implementation of LIFO algorithm.

class crawlfrontier.contrib.backends.memory.BFS

In-memory heapq Backend implementation of BFS algorithm.

class crawlfrontier.contrib.backends.memory.DFS

In-memory heapq Backend implementation of DFS algorithm.

class crawlfrontier.contrib.backends.memory.RANDOM

In-memory heapq Backend implementation of a random selection algorithm.

SQLAlchemy backends

This set of Backend objects will use SQLAlchemy as storage for basic algorithms.

By default it uses an in-memory SQLite database as a storage engine, but any databases supported by SQLAlchemy can be used.

Request and Response are represented by a declarative sqlalchemy model:

class Page(Base):
    __tablename__ = 'pages'
    __table_args__ = (
        UniqueConstraint('url'),
    )
    class State:
        NOT_CRAWLED = 'NOT CRAWLED'
        QUEUED = 'QUEUED'
        CRAWLED = 'CRAWLED'
        ERROR = 'ERROR'

    url = Column(String(1000), nullable=False)
    fingerprint = Column(String(40), primary_key=True, nullable=False, index=True, unique=True)
    depth = Column(Integer, nullable=False)
    created_at = Column(TIMESTAMP, nullable=False)
    status_code = Column(String(20))
    state = Column(String(10))
    error = Column(String(20))

If you need to create your own models, you can do it by using the DEFAULT_MODELS setting:

DEFAULT_MODELS = {
    'Page': 'crawlfrontier.contrib.backends.sqlalchemy.models.Page',
}

This setting uses a dictionary where key represents the name of the model to define and value the model to use. If you want for instance to create a model to represent domains:

DEFAULT_MODELS = {
    'Page': 'crawlfrontier.contrib.backends.sqlalchemy.models.Page',
    'Domain': 'myproject.backends.sqlalchemy.models.Domain',
}

Models can be accessed from the Backend dictionary attribute models.

For a complete list of all settings used for sqlalchemy backends check the settings section.

class crawlfrontier.contrib.backends.sqlalchemy.BASE

Base class for SQLAlchemy Backend objects.

class crawlfrontier.contrib.backends.sqlalchemy.FIFO

SQLAlchemy Backend implementation of FIFO algorithm.

class crawlfrontier.contrib.backends.sqlalchemy.LIFO

SQLAlchemy Backend implementation of LIFO algorithm.

class crawlfrontier.contrib.backends.sqlalchemy.BFS

SQLAlchemy Backend implementation of BFS algorithm.

class crawlfrontier.contrib.backends.sqlalchemy.DFS

SQLAlchemy Backend implementation of DFS algorithm.

class crawlfrontier.contrib.backends.sqlalchemy.RANDOM

SQLAlchemy Backend implementation of a random selection algorithm.

OPIC backend

The OPIC backend takes its name from the “Online Page Importance Computation” algorithm, described in:

Adaptive On-Line Page Importance Computation
Abiteboul S., Preda M., Cobena G.
2003

The main idea is that we want to crawl pages which are important, and this importance depends on which pages links to and which pages are linked by the current page in a manner similar to PageRank. Implementation is in

class crawlfrontier.contrib.backends.opic.backend.OpicHitsBackend(manager, db_graph=None, db_pages=None, db_hits=None, scheduler=None, freq_estimator=None, change_detector=None, test=False)

Frontier backend implementation based on the OPIC algorithm adaptated to HITS scoring

Parameters:
  • manager (FrontierManager) – Frontier manager.
  • db_graph (GraphInterface) – Graph database. If None use a new instance of graphdb.SQLite
  • db_pages (PageDBInterface) – Page database. If None us a new instance of pagedb.SQLite.
  • db_hits (HitsDBInterface) – HITS database. If None use a new instance of hitsdb.SQLite.
  • scheduler (SchedulerInterface) – Decides which page to crawl next
  • freq_estimator (FreqEstimatorInterface) – Frequency estimator.
  • change_detector (PageChangeInterface) – Change detector.
  • test (bool) – If True compute h_scores and a_scores prior to closing.

For more information about the implementations details read the annex OPIC details

Configuration

The constructor has a lot of arguments, but they will automatically filled correctly from global settings. Apart from the common settings with other backends this backend uses the following additional settings:

  • BACKEND_OPIC_IN_MEMORY (default False)

    If True all information will be kept in-memory. This will make it run faster but also will consume more memory and, more importantly, you will not be able to resume the crawl after you shut down the spider.

  • BACKEND_OPIC_WORKDIR

    If BACKEND_OPIC_IN_MEMORY is False, then all the state information necessary to resume the crawl will be kept inside this directory. If this directory is not set a default one will be generated in the current directory following the pattern:

    crawl-opic-DYYYY.MM.DD-THH.mm.SS
    

    Where YYYY is the current year, MM is the month, etc...

  • BACKEND_OPIC_SCHEDULER

    ‘optimal’ will use Optimal. Any other value, or if not set, will use BestFirst.

  • BACKEND_TEST

    If True the backend will save some information to inspect after the databases are closed to test the backend performance.

Persistence

The following SQLite databases are generated inside the BACKEND_OPIC_WORKDIR:

  • graph.sqlite: the graph database. See Graph database.
  • freqs.sqlite: the output of the scheduler. How frequently each page should be crawled. See Revisiting scheduler.
  • hash.sqlite: a hash of the body of each page the last time it was visited. It allows to track if a page has changed. See Page change rate estimator
  • hits.sqlite: the HITS score information and additional information for the OPIC algorithm. See HITS scores database.
  • pages.sqlite: additional info about pages, like URL and domain.
  • scheduler.sqlite: page value and page change rate used by the scheduler algorithm. See Revisiting scheduler.
  • updates.sqlite: number of updates in a given interval of time. Used for page change rate estimation. See Page change rate estimator

Since they are standard SQLite databases they can be accessed using any tool of your choice (for example sqlitebrowser) which is useful for debugging or interfacing with other tools. An example would be accesing the data inside the databases to compare the precision of OPIC and the power method, like it’s explained in OPIC vs power method.