Html scraping with Page.REST and Foopipes 2017-09-28

Html scraping with Page.REST and Foopipes

Page.REST with Foopipes and Elasticsearch

Html scraping as a service? What a simple but powerful idea!

Fortunate it's becoming more and more common with REST API's for about everything, but I don't know how many times I've programmatically wanted to fetch information from a website that doesn't have a public API.

I found this new service Page.REST which makes HTML scraping simple. It didn't take many minutes to write a small Page.REST Add-In task for Foopipes to make it easier to use.

My imagination isn't enough to tell what you can do with this. So I'll let a couple of examples speak for themselves.

Newspaper email alerts

The first example is to poll a Swedish newspaper and send an email to me when the top headline changes.

Create a file named foopipes.yml in current directory with this content:

addins:
  - url: "https://raw.githubusercontent.com/AreteraAB/Foopipes.Addins/master/Page.REST/Page.REST.csx"
  - url: "https://raw.githubusercontent.com/AreteraAB/Foopipes.Addins/master/Mailgun/mailgun.csx"

services: 
  scheduler: 
    type: scheduler
    interval: "00:05:00"

  mailgun:
    type: mailgun
    apiBaseUrl: "https://api.mailgun.net/v3/sandbox5xxxxxx8.mailgun.org"
    apiKey: key-3xxxxxxc
    
pipelines: 
  - 
    when: 
      - scheduler
    do: 
      - page.rest: "http://www.dn.se/"
        selectors: "article.lp_puff1 h2"
        token: "${ACCESS_TOKEN}"

      - select: "$..text"

      - where: "$.text"
        not: "#{file:current_headline}"

      - set: "file:current_headline"
        value: "#{text}"
      
    to:
      - mailgun.send:
        to: "me@myemail.se"
        from: "info@myemail.se" 
        subject: "#{text}" 

Every five minutes Foopipes will poll http://www.dn.se/ via Page.REST and use a CSS selector to find the headline. When it changes, send an email using Mailgun.

The #{...} expressions are JSONPath data binding expressions on the response from Page.REST.

We keep state by storing the latest headline in a file named current_headline.

If you use Docker, start this example with:

docker run -it --rm -v %CD%:/project -e ACCESS_TOKEN=<Page.REST access token> aretera/foopipes 

Blog to Elasticsearch

Here we'll scrape all Foopipes blog articles and store them in Elasticsearch.

addins:
  - url: "https://raw.githubusercontent.com/AreteraAB/Foopipes.Addins/master/Page.REST/Page.REST.csx"

plugins: 
  - Elasticsearch

services: 
  scheduler: 
    type: scheduler
    interval: "00:05:00"

  elasticsearch:
    type: elasticsearch
    url: http://elasticsearch:9200/

pipelines: 
  - 
    when: 
      - scheduler
    do: 
      - page.rest: "https://foopipes.com/blog"
        selectors: ".row div.col-lg-8 a"
        token: "${ACCESS_TOKEN}"

      - select: "$..href"
    to:
      - queue: href

  - 
    when: 
      - queue: href
    do:
      - set: "metadata:url"
        value: "https://foopipes.com/blog/#{href}"

      - page.rest: "#{metadata:url}"
        selectors: ".blogpost-text;.blogpost-text h1"
        token: "${ACCESS_TOKEN}"

      - map:
        url: "#{metadata:url}"
        headline: "#{selectors.['.blogpost\\-text h1'][0].text}"
        body: "#{selectors.['.blogpost\\-text'][0].text}"
    to:
      - store: elasticsearch
        index: blog
        dataType: article
        key: "#{url}"

First, find all <a href=".." /> and put the URL on a queue. For each URL in the queue, scrape the headline and the body and store them in Elasticsearch.

Again, the expressions inside #{...} are JSONPath data binding expressions. Here they look a bit complicated as we need to escape the . and - in the response from Page.REST.

Start Elasticsearch:

docker run -p 9200:9200 --name elasticsearch elasticsearch

Start Foopipes:

docker run -it --rm --link elasticsearch -v %CD%:/project -e ACCESS_TOKEN=<Page.REST access token> aretera/foopipes 

Now you can search the blog with: http://localhost:9200/blog/_search?q=foopipes

NOTE: Don't publish content from other peoples websites without permission!