Basic Site Crawler

Search engines have always interested me a bit and I’ve wondered how to set them up. They consist of a few simple parts:

crawler,
scraper,
query engine, and
ranking system.

I decided to set myself a simple project to build one that could scrape a domain and store content for all the pages. This post will be focused on the crawler/scraper aspect of it.

Scrape A Page

Initially, all we want to do is download a single page and scrape the text from it. This can be done with a few lines of Python quite easily, all you’ll need to install is the requests and lxml packages:

from lxml import html
import requests
import sys

url = sys.argv[1]
page = requests.get(url)
tree = html.fromstring(page.content)
text = tree.xpath('//body//text()')

print('Text:', str(text))

All this does is download the page and then create and then find all text (not including HTML tags) and print them out.

Scrape A Domain

A bit more complex is to recursively scrape a domain by following every link in the site, but also making sure you’re not following the same links twice (duplication). For this, you’ll need to install RabbitMQ, Redis, and Celery.

Now let’s modify the previous code to look like below:

from celery import Celery
from lxml import html
import redis
import requests
import sys

r = redis.Redis(host='localhost', port=6379, db=0)
app = Celery('tasks', broker='pyamqp://guest@localhost//')

@app.task
def scrape(x):
    if r.exists(x):
        return str(x)
    url = x
    page = requests.get(url)
    tree = html.fromstring(page.content)
    r.set(x, 1)

    links = tree.xpath('//a/@href')
    text = tree.xpath('//body//text()')
    for link in links:
        if 'https://adamogrady.id.au' in link and r.exists(link) == 0:
            r.set(x, 1)
            scrape.delay(link)
    return str(x)

scrape.delay('https://adamogrady.id.au/')

When started with Celery using the command celery -A [file name without .py] worker --loglevel=info, you should see that the project sets up a Celery automation task, then runs it with the domain http://adamogrady.id.au/. It checks if the page has already been scraped (exiting if so), then requests and scrapes the page, then goes through all the links on the page and queues them up if they’re under the write domain (although there should be a string start check) and that the link hasn’t already been scraped.

Store The Data

Lastly we need to store the data somewhere. In this case I’m using Elasticsearch, which will also be our query engine and provide our ranking system later on. Once you have Elasticsearch installed and the appropriate Python module ready, let’s modify our code a bit more:

from celery import Celery
from elasticsearch import Elasticsearch
import json
from lxml import html
import redis
import requests
import sys

r = redis.Redis(host='localhost', port=6379, db=0)
app = Celery('tasks', broker='pyamqp://guest@localhost//')
es = Elasticsearch()

@app.task
def scrape(link):
    if r.exists(link):
        return str(link)
    page = requests.get(link)
    tree = html.fromstring(page.content)

    links = tree.xpath('//a/@href')
    text = tree.xpath('//body//text()')
    doc = {
        'link': link,
        'text': ','.join(text)
    }
    res = es.index(index="test-search", doc_type='page', body=doc)
    r.set(link, 1)

    for single_link in links:
        if 'https://adamogrady.id.au' in single_link and r.exists(single_link) == 0:
            scrape.delay(single_link)
    return str(link)

scrape.delay('https://adamogrady.id.au/')

You’ll notice now we’re storing data in an Elasticsearch index (test-search) so it can later be queuried. This code should work, and you can replace the domain with another URL to scrape whatever site you feel.

Future Improvements

There’s a bunch of small improvements that can be made to improve this search.

Moving to Beautiful Soup instead of lxml for better text scraping
Indexing headers as a separate key in the Elasticsearch doc
Indexing the entire page, including HTML tags
Separate the XPath/Beautiful Soup work into a separate task
Expanding any shortened forms of URLs ('/about') to the full form ('https://adamogrady.id.au/about'), then checking the domain and removing any anchors for the same page (thanks @__eater__!).