Build a Web Scraper in Rust and Deploy to Wasmer Edge
Learn how to build a powerful web scraper with Rust, and deploy it seamlessly to the cloud with Wasmer Edge! Follow this step-by-step tutorial to scrape news from Hacker News, create a web server, and run your app as a WebAssembly/WASIX application.
Rudra
August 14, 2023
In this article, we will build a web scraper that scrapes news from the Hacker News and prints the news to the console. At the end of the tutorial, we will deploy the scraper to Wasmer Edge and test it.
This tutorial is built on our pre-existing example of wasix-reqwest-proxy.
The application built in this tutorial is available on wasmer registry as wasmer/news-scraper
and source code is available here
Let's get started 🚀
Design of the application
The application's purpose is to show news from the Hacker news website. The tutorial is divided in two parts. The first part will be a scraper that will scrape the news from the website and the second part will be a web server that will serve the scraped news and show it in user's console.
Building the application
Prerequisites
Creating the project
Let's setup a basic Rust project with cargo.
cargo new news-scraper
Moving on to the next step, we will add the dependencies to the Cargo.toml
file.
anyhow = "1.0.72"
reqwest = "0.11.18"
tokio = "1"
scraper = "0.17.0"
colored = "2"
reqwest
is a HTTP client for Rust.tokio
is an asynchronous runtime for Rust.scraper
is a HTML scraper for Rust.colored
is a library for coloring the console output.
Scraping the news
Let's start by scraping the news from the Hacker news website. We will create a new file src/news_scraper.rs
and add the following code to it.
// src/news_scraper.rs
/// Structs for storing the scraped news
struct NewsHeadline {
headline: String,
link: String,
time: String,
num_points: Option<u32>,
num_comments: Option<String>,
author: Option<String>,
}
/// Struct for scraping the news
pub struct NewsScraper {
news: Vec<NewsHeadline>,
}
We will now implement the NewsScraper
struct.
// src/news_scraper.rs
impl NewsScraper {
pub fn new() -> Self {
NewsScraper { news: Vec::new() }
}
pub fn scrape(&mut self, page: String) -> String {
// Parse the page into a DOM tree
let document = Html::parse_document(&page);
Ok(document.html().to_string())
}
}
Now, let's check if the scraper is working. We will add the following code to the main.rs
file.
// src/main.rs
mod news_scraper;
use news_scraper::NewsScraper;
#[tokio::main]
async fn main() {
let url = "https://news.ycombinator.com/";
let page = match async { reqwest::get(url).await?.text().await }.await.unwrap();
let mut news_scraper = NewsScraper::new();
let news = news_scraper.scrape(page).unwrap();
println!("{}", news);
}
Running the application with cargo run
will print the HTML of the Hacker news website.
$ cargo run
<!doctype html>
<html lang="en" op="news"><head><meta name="referrer"...
...
...
</html>
Parsing the news
Now, we will parse the news from the HTML. We will use the scraper
crate for this.
// src/news_scraper.rs
/// in the NewsScraper impl
pub fn scrape(&mut self, page: String) {
// Parse the page into a DOM tree
let document = Html::parse_document(&page);
// Get all the elements with class "athing"
let athing_selector = scraper::Selector::parse(".athing").unwrap();
let athing_elements = document.select(&athing_selector);
// traverse the elements
for athing_element in athing_elements {
...
}
Ok(())
}
The full code for the
scrape
function is available here
Add another function in our NewsScraper
's impl to get the news in a string format.
pub fn get_news(&self) -> String {
let mut news = String::new();
news.push_str(&format!(
"\n{}\n",
" Hacker News ".bold().on_bright_green().black()
));
for (i, news_headline) in self.news.iter().enumerate() {
news.push_str(&format!("\n{}. {}", i + 1, news_headline));
}
news
}
We also implemented the
Display
trait for theNewsHeadline
struct. You can find the full code here
Serving the news
Adding basic hyper server to serve the scraped news.
# Cargo.toml
hyper = "0.14.27"
Adding the following code to the main.rs
file.
use hyper::service::{make_service_fn, service_fn};
use hyper::{Body, Request, Response, Server, StatusCode};
use std::convert::Infallible;
use std::net::SocketAddr;
// src/main.rs
async fn handle(_req: Request<Body>) -> Result<Response<Body>, Infallible> {
let url = "https://news.ycombinator.com/";
let mut status = StatusCode::OK;
let page = match async { reqwest::get(url).await?.text().await }.await {
Ok(b) => b,
Err(err) => {
status = err.status().unwrap_or(StatusCode::BAD_REQUEST);
format!("{err}")
}
};
let mut news = news_scraper::NewsScraper::new();
news.scrape(page);
let response = news.get_news();
let body = String::from_utf8_lossy(response.as_bytes()).to_string();
let mut res = Response::new(Body::from(body));
*res.status_mut() = status;
Ok(res)
}
#[tokio::main]
async fn main() -> anyhow::Result<()> {
// check if there's an environment variable for the port
let port = std::env::var("PORT").unwrap_or_else(|_| "80".to_string());
// parse the port into a u16
let port = port.parse::<u16>()?;
let addr = SocketAddr::from(([127, 0, 0, 1], port));
println!("Listening on {}", addr);
// And a MakeService to handle each connection...
let make_service = make_service_fn(|_conn| async { Ok::<_, Infallible>(service_fn(handle)) });
// Then bind and serve...
let server = Server::bind(&addr).serve(make_service);
// And run forever...
Ok(server.await?)
}
The above code will start a hyper server on port 80 and serve the scraped news or you can set the PORT
environment variable to change the port.
Note: The above code is standard hyper server code. You can find more information about it in the hyper documentation.
Running the application
$ cargo run PORT=3000
Listening on 127.0.0.1:3000
Now, in another terminal:
$ curl localhost:3000
Hacker News
1. Show HN: LLMs can generate valid JSON 100% of the time
114 points by remilouf 1 hour ago | 39 comments
https://github.com/normal-computing/outlines
2. Bezos Earth Fund Grants $400M for Greening Underserved Urban U.S. Communities
28 points by myroon5 44 minutes ago | 5 comments
https://www.bezosearthfund.org/news-and-insights/announcing-400-million-greening-americas-cities
3. JWST spots giant black holes all over the early universe
90 points by Brajeshwar 2 hours ago | 23 comments
https://www.quantamagazine.org/jwst-spots-giant-black-holes-all-over-the-early-universe-20230814/
Compiling the application to WebAssembly/WASIX
Pinning dependencies
Some of the dependencies we are using are not compatible with WASIX. So they need to be patched and pinned to a specific version.
Change your Cargo.toml
file to the following:
[dependencies]
anyhow = "1.0.72"
reqwest = { git = "https://github.com/wasix-org/reqwest.git", default-features = false, features = [
"blocking",
"rustls-tls",
] } # 👈🏼 Changed here
tokio = { version = "=1.24.2", git = "https://github.com/wasix-org/tokio.git", branch = "epoll", default-features = false, features = [
"rt-multi-thread",
"macros",
"fs",
"io-util",
"net",
"signal",
] } # 👈🏼 Changed here
hyper = { git = "https://github.com/wasix-org/hyper.git", branch = "v0.14.27", features = [
"server",
] } # 👈🏼 Changed here
scraper = "0.17.1"
colored = "2"
[patch.crates-io]
socket2 = { git = "https://github.com/wasix-org/socket2.git", branch = "v0.4.9" } # 👈🏼 Added here
tokio = { git = "https://github.com/wasix-org/tokio.git", branch = "epoll" } # 👈🏼 Added here
rustls = { git = "https://github.com/wasix-org/rustls.git", branch = "v0.21.5" } # 👈🏼 Added here
hyper = { git = "https://github.com/wasix-org/hyper.git", branch = "v0.14.27" } # 👈🏼 Added here
This is a temporary solution until the dependencies are updated to support WASIX. You can learn more about this here.
Compiling to WASIX
We will compile our application to WASIX using cargo-wasix
cargo wasix build --release
This will create a target/wasm32-wasmer-wasi/release/news-scraper.wasm
file.
Running the application with Wasmer
wasmer run target/wasm32-wasmer-wasi/release/news-scraper.wasm --env PORT=3000
Now, in another terminal:
curl localhost:3000
Hacker News
1. Show HN: LLMs can generate valid JSON 100% of the time
114 points by remilouf 1 hour ago | 39 comments
https://github.com/normal-computing/outlines
2. Bezos Earth Fund Grants $400M for Greening Underserved Urban U.S. Communities
28 points by myroon5 44 minutes ago | 5 comments
https://www.bezosearthfund.org/news-and-insights/announcing-400-million-greening-americas-cities
3. JWST spots giant black holes all over the early universe
90 points by Brajeshwar 2 hours ago | 23 comments
https://www.quantamagazine.org/jwst-spots-giant-black-holes-all-over-the-early-universe-20230814/
It's all the same 😎. But now we have a WebAssembly/WASIX application.
NOTE: You can test the package locally with
wasmer run wasmer/news-scraper --env PORT=3000
.
Deploying to Wasmer Edge
For this, we first need to create our configuration files.
We need two files:
You can learn more about them by clicking on the links above.
Wasmer configuration
[package]
name = "wasmer/news-scraper" # 👈🏼 Change this to republish on your user
version = "0.1.1"
description = "Package to showcase a news scraper on Wasmer Edge"
wasmer-extra-flags = "--net --enable-threads --enable-bulk-memory"
[dependencies]
[[module]]
name = "news-scraper"
source = "target/wasm32-wasmer-wasi/release/news-scraper.wasm"
[[command]]
name = "run"
module = "news-scraper"
runner = "wasi@unstable_" # ℹ️ Runner for WASIX
ℹ️ Note : The
wasmer.toml
configuration file is for publishing your project to Wasmer Registry.
App configuration
kind: wasmer.io/App.v0
name: news-scraper
package: wasmer/news-scraper # 👈🏼 The package name
ℹ️ Note : The
app.yaml
configuration file is for deploying your application to Wasmer Edge.
Deploying the application
$ wasmer deploy
Loaded app from: /Volumes/Work/Projects/Rust/news-scrapper/app.yaml
Publish new version of package 'wasmer/news-scraper'? yes
Publishing package...
[1/2] ⬆️ Uploading...
[2/2] 📦 Publishing...
Successfully published package `wasmer/news-scraper@0.1.1`
Waiting for package to become available.......
Package 'wasmer/news-scraper@0.1.1' published successfully!
Deploying app news-scraper...
✅ App news-scraper was successfully deployed!
> App URL: https://news-scraper.wasmer.app
> Versioned URL: https://xd1iltdfy9nx.id.wasmer.app
> Admin dashboard: https://wasmer.io/apps/wasmer/news-scraper
Waiting for the app to become reachable...
..
App is now reachable!
Now our application is deployed to Wasmer Edge. Let's test it out.
Running the application on Wasmer Edge
$ curl https://news-scraper.wasmer.app
Hacker News
1. Show HN: LLMs can generate valid JSON 100% of the time
114 points by remilouf 1 hour ago | 39 comments
https://github.com/normal-computing/outlines
2. Bezos Earth Fund Grants $400M for Greening Underserved Urban U.S. Communities
28 points by myroon5 44 minutes ago | 5 comments
https://www.bezosearthfund.org/news-and-insights/announcing-400-million-greening-americas-cities
3. JWST spots giant black holes all over the early universe
90 points by Brajeshwar 2 hours ago | 23 comments
https://www.quantamagazine.org/jwst-spots-giant-black-holes-all-over-the-early-universe-20230814/
Here's the output:
This package is also available on the Wasmer Registry as news-scraper.
You can try it for yourself by running the following command:
$ wasmer run wasmer/news-scraper --net --env PORT=3000
💡 Interested in trying Wasmer Edge ? Join the waitlist
📖 You can learn more about Wasmer Edge in its documentation here for its architecture, examples and tutorials.
Conclusion
In this tutorial, we learned
- how to create a web scraper using Rust
- how to use the
reqwest
andscraper
crates - how to create a simple HTTP server using
hyper
- We also learned how to compile it to WebAssembly/WASIX and deploy it to Wasmer Edge.
Resources
You can find the source code for this tutorial here
About the Author
Rudra
Design of the application
Building the application
Prerequisites
Creating the project
Scraping the news
Parsing the news
Serving the news
Cargo.toml
Running the application
Compiling the application to WebAssembly/WASIX
Pinning dependencies
Compiling to WASIX
Running the application with Wasmer
Deploying to Wasmer Edge
Wasmer configuration
App configuration
Deploying the application
Running the application on Wasmer Edge
Conclusion
Resources
Read more
peoplewasmercompany culturewellnessuccessproductivityfocusWasmerteam
Focus: A Key to Success
Teresa LopezJuly 30, 2024
engineeringwasmerruntimewasmer runtimeperformance
Improving WebAssembly load times with Zero-Copy deserialization
Arshia GhafooriSeptember 7, 2023