backgroundradial

Build a Web Scraper in Rust and Deploy to Wasmer Edge

Learn how to build a powerful web scraper with Rust, and deploy it seamlessly to the cloud with Wasmer Edge! Follow this step-by-step tutorial to scrape news from Hacker News, create a web server, and run your app as a WebAssembly/WASIX application.

dynamite-bud avatar
dynamite-bud
Rudra

wasmer

August 14, 2023

arrowBack to articles
Post cover image

In this article, we will build a web scraper that scrapes news from the Hacker News and prints the news to the console. At the end of the tutorial, we will deploy the scraper to Wasmer Edge and test it.

This tutorial is built on our pre-existing example of wasix-reqwest-proxy.

The application built in this tutorial is available on wasmer registry as wasmer/news-scraper and source code is available here

Let's get started 🚀

Design of the application

The application's purpose is to show news from the Hacker news website. The tutorial is divided in two parts. The first part will be a scraper that will scrape the news from the website and the second part will be a web server that will serve the scraped news and show it in user's console.

News Scraper Design

Building the application

Prerequisites

Creating the project

Let's setup a basic Rust project with cargo.

cargo new news-scraper

Moving on to the next step, we will add the dependencies to the Cargo.toml file.

anyhow = "1.0.72"
reqwest = "0.11.18"
tokio = "1"
scraper = "0.17.0"
colored = "2"
  • reqwest is a HTTP client for Rust.
  • tokio is an asynchronous runtime for Rust.
  • scraper is a HTML scraper for Rust.
  • colored is a library for coloring the console output.

Scraping the news

Let's start by scraping the news from the Hacker news website. We will create a new file src/news_scraper.rs and add the following code to it.

// src/news_scraper.rs
/// Structs for storing the scraped news
struct NewsHeadline {
    headline: String,
    link: String,
    time: String,
    num_points: Option<u32>,
    num_comments: Option<String>,
    author: Option<String>,
}

/// Struct for scraping the news
pub struct NewsScraper {
    news: Vec<NewsHeadline>,
}

We will now implement the NewsScraper struct.

// src/news_scraper.rs
impl NewsScraper {
    pub fn new() -> Self {
        NewsScraper { news: Vec::new() }
    }

    pub fn scrape(&mut self, page: String) -> String {
        // Parse the page into a DOM tree
        let document = Html::parse_document(&page);

        Ok(document.html().to_string())
    }
}

Now, let's check if the scraper is working. We will add the following code to the main.rs file.

// src/main.rs
mod news_scraper;
use news_scraper::NewsScraper;

#[tokio::main]
async fn main() {
    let url = "https://news.ycombinator.com/";
    let page = match async { reqwest::get(url).await?.text().await }.await.unwrap();

    let mut news_scraper = NewsScraper::new();
    let news = news_scraper.scrape(page).unwrap();
    println!("{}", news);
}

Running the application with cargo run will print the HTML of the Hacker news website.

$ cargo run

<!doctype html>
<html lang="en" op="news"><head><meta name="referrer"...
...
...
</html>

Parsing the news

Now, we will parse the news from the HTML. We will use the scraper crate for this.

// src/news_scraper.rs
/// in the NewsScraper impl
pub fn scrape(&mut self, page: String) {
    // Parse the page into a DOM tree
    let document = Html::parse_document(&page);

    // Get all the elements with class "athing"
    let athing_selector = scraper::Selector::parse(".athing").unwrap();
    let athing_elements = document.select(&athing_selector);

    // traverse the elements
    for athing_element in athing_elements {
        ...
    }

    Ok(())
}

The full code for the scrape function is available here

Add another function in our NewsScraper's impl to get the news in a string format.

pub fn get_news(&self) -> String {
    let mut news = String::new();
    news.push_str(&format!(
        "\n{}\n",
        " Hacker News ".bold().on_bright_green().black()
    ));
    for (i, news_headline) in self.news.iter().enumerate() {
        news.push_str(&format!("\n{}. {}", i + 1, news_headline));
    }
    news
}

We also implemented the Display trait for the NewsHeadline struct. You can find the full code here

Serving the news

Adding basic hyper server to serve the scraped news.

# Cargo.toml
hyper = "0.14.27"

Adding the following code to the main.rs file.

use hyper::service::{make_service_fn, service_fn};
use hyper::{Body, Request, Response, Server, StatusCode};
use std::convert::Infallible;
use std::net::SocketAddr;

// src/main.rs
async fn handle(_req: Request<Body>) -> Result<Response<Body>, Infallible> {
    let url = "https://news.ycombinator.com/";
    let mut status = StatusCode::OK;

    let page = match async { reqwest::get(url).await?.text().await }.await {
        Ok(b) => b,
        Err(err) => {
            status = err.status().unwrap_or(StatusCode::BAD_REQUEST);
            format!("{err}")
        }
    };

    let mut news = news_scraper::NewsScraper::new();
    news.scrape(page);
    let response = news.get_news();
    let body = String::from_utf8_lossy(response.as_bytes()).to_string();

    let mut res = Response::new(Body::from(body));
    *res.status_mut() = status;
    Ok(res)
}

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    // check if there's an environment variable for the port
    let port = std::env::var("PORT").unwrap_or_else(|_| "80".to_string());
    // parse the port into a u16
    let port = port.parse::<u16>()?;

    let addr = SocketAddr::from(([127, 0, 0, 1], port));

    println!("Listening on {}", addr);

    // And a MakeService to handle each connection...
    let make_service = make_service_fn(|_conn| async { Ok::<_, Infallible>(service_fn(handle)) });

    // Then bind and serve...
    let server = Server::bind(&addr).serve(make_service);

    // And run forever...
    Ok(server.await?)
}

The above code will start a hyper server on port 80 and serve the scraped news or you can set the PORT environment variable to change the port.

Note: The above code is standard hyper server code. You can find more information about it in the hyper documentation.

Running the application

$ cargo run PORT=3000
Listening on 127.0.0.1:3000

Now, in another terminal:

$ curl localhost:3000

 Hacker News

1. Show HN: LLMs can generate valid JSON 100% of the time
        114 points by remilouf 1 hour ago | 39 comments
        https://github.com/normal-computing/outlines
2. Bezos Earth Fund Grants $400M for Greening Underserved Urban U.S. Communities
        28 points by myroon5 44 minutes ago | 5 comments
        https://www.bezosearthfund.org/news-and-insights/announcing-400-million-greening-americas-cities
3. JWST spots giant black holes all over the early universe
        90 points by Brajeshwar 2 hours ago | 23 comments
        https://www.quantamagazine.org/jwst-spots-giant-black-holes-all-over-the-early-universe-20230814/

Compiling the application to WebAssembly/WASIX

Pinning dependencies

Some of the dependencies we are using are not compatible with WASIX. So they need to be patched and pinned to a specific version.

Change your Cargo.toml file to the following:

[dependencies]
anyhow = "1.0.72"
reqwest = { git = "https://github.com/wasix-org/reqwest.git", default-features = false, features = [
    "blocking",
    "rustls-tls",
] } # 👈🏼 Changed here
tokio = { version = "=1.24.2", git = "https://github.com/wasix-org/tokio.git", branch = "epoll", default-features = false, features = [
    "rt-multi-thread",
    "macros",
    "fs",
    "io-util",
    "net",
    "signal",
] } # 👈🏼 Changed here
hyper = { git = "https://github.com/wasix-org/hyper.git", branch = "v0.14.27", features = [
    "server",
] } # 👈🏼 Changed here
scraper = "0.17.1"
colored = "2"

[patch.crates-io]
socket2 = { git = "https://github.com/wasix-org/socket2.git", branch = "v0.4.9" } # 👈🏼 Added here
tokio = { git = "https://github.com/wasix-org/tokio.git", branch = "epoll" } # 👈🏼 Added here
rustls = { git = "https://github.com/wasix-org/rustls.git", branch = "v0.21.5" } # 👈🏼 Added here
hyper = { git = "https://github.com/wasix-org/hyper.git", branch = "v0.14.27" } # 👈🏼 Added here

This is a temporary solution until the dependencies are updated to support WASIX. You can learn more about this here.

Compiling to WASIX

We will compile our application to WASIX using cargo-wasix

cargo wasix build --release

This will create a target/wasm32-wasmer-wasi/release/news-scraper.wasm file.

Running the application with Wasmer

wasmer run target/wasm32-wasmer-wasi/release/news-scraper.wasm --env PORT=3000

Now, in another terminal:

curl localhost:3000

 Hacker News

1. Show HN: LLMs can generate valid JSON 100% of the time
        114 points by remilouf 1 hour ago | 39 comments
        https://github.com/normal-computing/outlines
2. Bezos Earth Fund Grants $400M for Greening Underserved Urban U.S. Communities
        28 points by myroon5 44 minutes ago | 5 comments
        https://www.bezosearthfund.org/news-and-insights/announcing-400-million-greening-americas-cities
3. JWST spots giant black holes all over the early universe
        90 points by Brajeshwar 2 hours ago | 23 comments
        https://www.quantamagazine.org/jwst-spots-giant-black-holes-all-over-the-early-universe-20230814/

It's all the same 😎. But now we have a WebAssembly/WASIX application.

NOTE: You can test the package locally with wasmer run wasmer/news-scraper --env PORT=3000.

Deploying to Wasmer Edge

For this, we first need to create our configuration files.

We need two files:

You can learn more about them by clicking on the links above.

Wasmer configuration

[package]
name = "wasmer/news-scraper" # 👈🏼 Change this to republish on your user
version = "0.1.1"
description = "Package to showcase a news scraper on Wasmer Edge"
wasmer-extra-flags = "--net --enable-threads --enable-bulk-memory"

[dependencies]

[[module]]
name = "news-scraper"
source = "target/wasm32-wasmer-wasi/release/news-scraper.wasm"

[[command]]
name = "run"
module = "news-scraper"
runner = "wasi@unstable_"  # ℹ️ Runner for WASIX

ℹ️ Note : The wasmer.toml configuration file is for publishing your project to Wasmer Registry.

App configuration

kind: wasmer.io/App.v0
name: news-scraper
package: wasmer/news-scraper # 👈🏼 The package name

ℹ️ Note : The app.yaml configuration file is for deploying your application to Wasmer Edge.

Deploying the application

$ wasmer deploy

Loaded app from: /Volumes/Work/Projects/Rust/news-scrapper/app.yaml

Publish new version of package 'wasmer/news-scraper'? yes
Publishing package...
[1/2] ⬆️   Uploading...
[2/2] 📦  Publishing...
Successfully published package `wasmer/news-scraper@0.1.1`
Waiting for package to become available.......
Package 'wasmer/news-scraper@0.1.1' published successfully!

Deploying app news-scraper...

 ✅ App news-scraper was successfully deployed!

> App URL: https://news-scraper.wasmer.app
> Versioned URL: https://xd1iltdfy9nx.id.wasmer.app
> Admin dashboard: https://wasmer.io/apps/wasmer/news-scraper

Waiting for the app to become reachable...
..
App is now reachable!

Now our application is deployed to Wasmer Edge. Let's test it out.

Running the application on Wasmer Edge

$ curl https://news-scraper.wasmer.app

 Hacker News

1. Show HN: LLMs can generate valid JSON 100% of the time
        114 points by remilouf 1 hour ago | 39 comments
        https://github.com/normal-computing/outlines
2. Bezos Earth Fund Grants $400M for Greening Underserved Urban U.S. Communities
        28 points by myroon5 44 minutes ago | 5 comments
        https://www.bezosearthfund.org/news-and-insights/announcing-400-million-greening-americas-cities
3. JWST spots giant black holes all over the early universe
        90 points by Brajeshwar 2 hours ago | 23 comments
        https://www.quantamagazine.org/jwst-spots-giant-black-holes-all-over-the-early-universe-20230814/

Here's the output:

News Scraper Output

This package is also available on the Wasmer Registry as news-scraper.

You can try it for yourself by running the following command:

$ wasmer run wasmer/news-scraper --net --env PORT=3000

💡 Interested in trying Wasmer Edge ? Join the waitlist

📖 You can learn more about Wasmer Edge in its documentation here for its architecture, examples and tutorials.

Conclusion

In this tutorial, we learned

  • how to create a web scraper using Rust
  • how to use the reqwest and scraper crates
  • how to create a simple HTTP server using hyper
  • We also learned how to compile it to WebAssembly/WASIX and deploy it to Wasmer Edge.

Resources

You can find the source code for this tutorial here

About the Author

Rudra avatar
Rudra
Rudra

Read more
Post cover image

wasmer edgeedgekubernetesdocker

The Rise of the Monolith

November 28, 2023

peoplewasmercompany culturewellnessuccessproductivityfocusWasmerteam

Focus: A Key to Success

Teresa LopezJuly 30, 2024

engineeringwasmerruntimewasmer runtimeperformance

Improving WebAssembly load times with Zero-Copy deserialization

Arshia GhafooriSeptember 7, 2023