yamt/toywasm v20.0.0: A WebAssembly interpreter

yamt/toywasm

wasi

Public

wasmer run yamt/toywasm

What's this?

A WebAssembly interpreter.

On-browser demo

You can try it out on webassembly.sh:

$ curl -o cowsay.wasm https://registry-cdn.wapm.io/contents/liftm/cowsay/0.2.2/target/wasm32-wasi/release/cowsay.wasm
$ toywasm --wasi cowsay.wasm hello
 _______
< hello >
 -------
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
               ||----w |
                ||     ||
$

Note: the binary published to wapm.io is built with an ancient wasi-sdk to workaround an webassembly.sh issue.

Goals

Learn the spec by implementing it
Have a fun
Correctness
Clean code
Small footprint

Non-goals

Performance
Stable API/ABI

What are implemented?

Feature	Status
WebAssembly	Complete except `simd`
extended-const	Complete (`TOYWASM_ENABLE_WASM_EXTENDED_CONST`)
multi-memory	Complete (`TOYWASM_ENABLE_WASM_MULTI_MEMORY`)
tail-call	Complete (`TOYWASM_ENABLE_WASM_TAILCALL`)
threads	Complete (`TOYWASM_ENABLE_WASM_THREADS`)
wasi_snapshot_preview1	See the top comment in lib/wasi.c
wasi-threads	Complete (`TOYWASM_ENABLE_WASI_THREADS`)

Note: As this runtime is relatively new, all proposals which had finished when I started this implementation are just included in the WebAssembly in the above table. It includes mutable-global, nontrapping-float-to-int-conversions, sign-extension-ops, multi-value, reference-types, bulk-memory-operations.

Where can this run?

The most tested

macOS/amd64
wasm32-wasi (on toywasm itself)

Tested on CI

Natively

macOS/amd64
Ubuntu/amd64

With qemu (less coverage because of slowness)

Ubuntu/arm64
Ubuntu/armhf (Note: 32-bit)
Ubuntu/s390x (Note: big endian)

with even less coverage

wasm32-wasi

Occasionally tested manually

wasm32-wasi-threads
NuttX/xtensa
NuttX/sim on macOS/amd64
NetBSD/amd64

How slow is this?

A few observations with a simple benchmark with ffmpeg:

toywasm performs reasonably well for an interpreter. It's slower than the fastest interpreters though.
toywasm with annotations disabled is very slow.
toywasm's memory consumption is reasonable as well. (See "maximum resident set size" and "peak memory footprint". I don't know what exactly these metrics reported by macOS's time(1) mean though. Also, for some reasons, these numbers seem a bit unstable and vary among runs. If you know better metrics to show memory footprint, preferably available on macOS, please let us know.)
toywasm is far slower than JIT-based engines as expected.
Many of JIT-based engines use multiple threads to hide its compilation time. (You can see their "user" time is often longer than "real" because of this.)

+++++++++++ Interpreters +++++++++++
===== toywasm v0.0.16-3-ga2d6bc9 (default configuration)
----- ./b/toywasm --wasi --wasi-dir .video --
       86.71 real        86.58 user         0.08 sys
            86257664  maximum resident set size
        929670392407  instructions retired
            66359296  peak memory footprint
===== toywasm v0.0.16-3-ga2d6bc9 (-DTOYWASM_USE_SMALL_CELLS=OFF, likely faster)
----- ./b.fix/toywasm --wasi --wasi-dir .video --
       50.40 real        50.33 user         0.04 sys
            84967424  maximum resident set size
        479797274314  instructions retired
            65101824  peak memory footprint
===== toywasm v0.0.16-3-ga2d6bc9 (annotations disabled, very slow)
----- ./b/toywasm --wasi --wasi-dir .video --disable-jump-table --disable-localtype-cellidx --disable-resulttype-cellidx --
      682.84 real       680.84 user         0.64 sys
            81661952  maximum resident set size
       4898710336993  instructions retired
            61763584  peak memory footprint
===== Wasm3 v0.5.0 on x86_64
----- wasm3 --dir .video --
       17.61 real        17.57 user         0.02 sys
            85032960  maximum resident set size
        119194907960  instructions retired
            84410368  peak memory footprint
===== iwasm 1.1.2 (fast interpreter)
----- iwasm.fast --dir=.video
       21.39 real        21.33 user         0.04 sys
           142237696  maximum resident set size
        144849415636  instructions retired
           141557760  peak memory footprint
===== iwasm 1.1.2 (classic interpreter)
----- iwasm.classic --dir=.video
      202.76 real       202.70 user         0.05 sys
            80588800  maximum resident set size
        957134541184  instructions retired
            79937536  peak memory footprint
===== wasmedge version 0.12.0-alpha.1-18-g656ffd1c (interpreter)
----- wasmedge --dir .video --
      189.95 real       189.61 user         0.23 sys
           656748544  maximum resident set size
       1849723828297  instructions retired
           629514240  peak memory footprint
===== wasmi_cli 0.27.0
----- wasmi_cli --dir .video --
      100.43 real       100.32 user         0.07 sys
           135770112  maximum resident set size
        779646884133  instructions retired
           114585600  peak memory footprint
===== wazero v1.0.0-pre.9 (interpreter)
----- wazero run -interpreter -mount .video --
      156.02 real       159.32 user         0.57 sys
          1653088256  maximum resident set size
       2013758295761  instructions retired
          1645486080  peak memory footprint
+++++++++++ JIT ++++++++++++++++++++
===== iwasm 1.1.2 (fast jit)
----- iwasm.fast-jit --dir=.video --jit-codecache-size=100000000
        3.39 real         5.94 user         0.13 sys
           149753856  maximum resident set size
         55884564376  instructions retired
           139005952  peak memory footprint
===== wasmer 2.3.0
----- wasmer run --dir .video --
        7.05 real        49.11 user         2.37 sys
           791056384  maximum resident set size
        215496279124  instructions retired
           567320576  peak memory footprint
===== wasmtime-cli 3.0.1
----- wasmtime run --dir .video --
        6.49 real        40.89 user         1.42 sys
           437530624  maximum resident set size
        179230894816  instructions retired
           361287680  peak memory footprint
===== wazero v1.0.0-pre.9
----- wazero run -mount .video --
       14.54 real        19.36 user         0.45 sys
          1641086976  maximum resident set size
        164555653646  instructions retired
          1637302272  peak memory footprint

Why is this slow?

Unlike many of interpreters, toywasm aims to execute wasm bytecode directly where possible. That is, it doesn't "compile" wasm bytecode into intermediate code.

Unfortunately, wasm bytecode is simply not efficient to execute that way. It's basically designed to be somehow "compiled" at the load time.

Many of interpreters out there translate wasm bytecode to their internal bytecode for performance reasons. Wasm3 and WAMR "fast" interpreter work exactly that way. Even WAMR "classic" interpreter replaces some critical instructions in-place.

While toywasm maps wasm modules read-only and never modifies them in-place, it still generates a few types of offline annotations on the bytecode to avoid being too slow. While they are smaller than a full translation, you might consider them a kind of translation:
- Jump table.
  
  This is to speed up branching. Wasm branch instruction don't have a cheap way to know the destination address to jump to. Without this table, whenever we execute a forward-branching instruction, we need to parse every instructions the branch would skip over.
  
  This is optional and can be disabled by the --disable-jump-table runtime option.
- Local offset tables.
  
  This is to speed up access to locals (E.g. local.get) in case toywasm is built with variable-sized values, which is the default. (-D TOYWASM_USE_SMALL_CELLS=ON) Without this table, an access to a local is O(x) where x is the number of locals in the function, including function arguments.
  
  You can disable them by --disable-localtype-cellidx and the --disable-resulttype-cellidx runtime options.
  
  When toywasm is built to use fixed-sized values, (-D TOYWASM_USE_SMALL_CELLS=OFF) an access to a local is O(1). In that case, this table is not necessary or used, regardless of the above mentioned cli options. An implementation with fixed-sized values is probably more cpu-efficient especially on a 64-bit host. It's probably more memory-efficient as well because it doesn't involve the static overhead. (this table) The situation might change when we implement larger values. (v128 used by SIMD.) Or, when a module is using many 32-bit values on the operand stack. Or, when you are using many instances.
- Type annotations for value-polymorphic instructions.
  
  Some wasm instructions like drop works on a value of any types and there is no cheap way to know the type at runtime. While validating the bytecode, toywasm annotates these instructions with the sizes of the values so that the necessary infomation is available when executing the code later. While it's theoretically possible to calculate them at the execution time, it would be something like repeating the validation step. An alternative is to annotate each values on the stack. But it's likely even more expensive.
  
  This annotation is unconditionally enabled if and only if toywasm is built with variable-sized values, which is the default. (-D TOYWASM_USE_SMALL_CELLS=ON)
I don't like to use huge-switch statements or labels as values, which are well-known techniques to implement efficient interpreters.

Overhead of the annotations

The memory consumption for the above mentioned annotations depends on the wasm bytecode to annotate. The following is a few examples taken with wasm modules I happened to have.

toywasm

 wasm instructions to annotate       368444 bytes
           jump table overhead        80192 bytes
      type annotation overhead        16568 bytes
  local type cell idx overhead        16480 bytes
 result type cell idx overhead         1006 bytes

spidermonkey

 wasm instructions to annotate      4143921 bytes
           jump table overhead      1166528 bytes
      type annotation overhead       142304 bytes
  local type cell idx overhead       109880 bytes
 result type cell idx overhead         2448 bytes

ffmpeg

 wasm instructions to annotate     15442793 bytes
           jump table overhead      2715248 bytes
      type annotation overhead       335344 bytes
  local type cell idx overhead       388730 bytes
 result type cell idx overhead         5882 bytes

A WebAssembly interpreter