27-29 November, Vilnius

Conference about Big Data, High Load, Data Science, Machine Learning & AI

Early Bird Ends In:

Day(s)

:

Hour(s)

:

Minute(s)

:

Second(s)

JULIEN NIOCHE

DigitalPebble, UK

JULIEN NIOCHE

DigitalPebble, UK

Biography

Julien is Director of DigitalPebble Ltd and is a member of the  Apache Software Foundation. His expertise is in document engineering with a strong focus on  open source tools. Julien has successfully designed and implemented solutions for Information Retrieval, Text Analysis,  Information Extraction, Machine Learning, Web Crawling and  Big Data for DigitalPebble’s clients. Julien contributes to several open source projects, including Apache Nutch, Tika, GATE, UIMA, CrawlerCommons and is the author of projects, such as Behemoth and  StormCrawler which are used by numerous organisations worldwide. He also speaks regularly at conferences and has reviewed several books.

Talk

Crawling the Web at Scale with Elasticsearch

Search systems such as Elasticsearch or Apache SOLR are often used by web crawlers to index and search on the content of web pages, however, they can also serve as a backend for storing the information about the URLs known to the crawler. By doing so, we benefit from the scalability of these tools but also from the insights they provide into our data, e.g. by being able to search in real time on our crawl data and monitor the content of the crawl with tools like Kibana. Web crawling also comes with its own requirements and challenges e.g. politeness, timeliness, scalability, the rate at which new content is discovered, etc…

In this talk, we’ll have a look at the approaches used by StormCrawler, a distributed and low latency web crawler based on Apache Storm, when using Elasticsearch as a backend. After a quick overview of StormCrawler, we’ll explore its Elasticsearch module and see what strategies are used to provide scalability while fulfilling the requirements of a streaming crawl.

Finally, we will compare with other systems and see in which cases one is preferable to the other.

This talk does not require any pre-existing knowledge of web crawling but some familiarity with Elasticsearch is required.

Workshop

Introduction to web crawling with StormCrawler (and Elasticsearch)

In this workshop, we will explore StormCrawler a collection of resources for building lowlatency, large scale web crawlers on Apache Storm. After a short introduction to Apache Storm and an overview of what Storm-Crawler provides, we’ll put it to use for a simple crawl before moving on to the deployed mode of Storm.
In the second part of the session, we will introduce metrics and index documents with Elasticsearch and Kibana and dive into data extraction. Finally, we’ll cover recursive crawls and scalability. This course will be hands-on: attendees will run the code on their own machines.