27-29 November

Conference about Big Data, High Load, Data Science, Machine Learning & AI









DigitalPebble, UK


DigitalPebble, UK


Julien Nioche runs DigitalPebble Ltd, a consultancy based in Bristol, UK and specialising in open source solutions for text engineering. He is a member of the Apache Software Foundation, a committer on Apache Nutch and various other projects. His expertise covers web crawling, natural language processing, machine learning and search.


Crawling the Web at Scale with Elasticsearch

Search systems such as Elasticsearch or Apache SOLR are often used by web crawlers to index and search on the content of web pages, however, they can also serve as a backend for storing the information about the URLs known to the crawler. By doing so, we benefit from the scalability of these tools but also from the insights they provide into our data, e.g. by being able to search in real time on our crawl data and monitor the content of the crawl with tools like Kibana. Web crawling also comes with its own requirements and challenges e.g. politeness, timeliness, scalability, the rate at which new content is discovered, etc…

In this talk, we’ll have a look at the approaches used by StormCrawler, a distributed and low latency web crawler based on Apache Storm, when using Elasticsearch as a backend. After a quick overview of StormCrawler, we’ll explore its Elasticsearch module and see what strategies are used to provide scalability while fulfilling the requirements of a streaming crawl.

Finally, we will compare with other systems and see in which cases one is preferable to the other.

This talk does not require any pre-existing knowledge of web crawling but some familiarity with Elasticsearch is required.


Introduction to web crawling with StormCrawler (and Elasticsearch)

In this course, we will explore [StormCrawler](http://stormcrawler.net), a collection of resources for building low-latency, large-scale web crawlers on Apache Storm. After a short introduction to Apache Storm and an overview of what Storm-Crawler provides, we’ll put it to use straight away for a simple crawl before moving on to the deployed mode of Storm.

In the second part of the session, we will then introduce metrics and index documents with Elasticsearch and Kibana and dive into data extraction. Finally, we’ll cover recursive crawls and scalability. This course will be hands-on: attendees will run the code on their own machines.

This course will suit Java developers with an interest in big data, stream processing, web crawling and search. It will provide a practical introduction to both Apache Storm and Elasticsearch as well of course as StormCrawler and should not require advanced programming skills.