I love Markdown with all my heart. It's a markup language so simple to understand that even people who are not software engineers can use it in a few minutes.
The flip side of that coin if that Markdown is limited. It can let you create various title levels, bold, italics, strikethrough, tables, links, and a bit more, but not so much.
When it comes to more complex documents, most people resort to full fledged office suite like Microsoft Office or LibreOffice. Both have their merits, but office file formats are brittle and heavy.
The alternative is to use another more complex markup language. Academics used to be into LaTeX but it's often tedious to use. Typst emerged more recently as a simpler yet useful markup language to create well formatted documents.
Tinymist is a language server for Typst. It provides the usual services a Language server provides, like semantic highlighting, code actions, formatting, etc.
But it really stands out by providing a live preview feature that keeps your cursor in sync in Helix when you are clicking around in the live preview!
I only had to install it with
$ brew install tinymistI then configured Helix to use tinymist for Typst documents, enabling live preview along the way. This happens of course in ~/.config/helix/languages.toml
[language-server.tinymist] command = "tinymist" config = { preview.background.enabled = true, preview.background.args = ["--data-plane-host=127.0.0.1:23635", "--invert-colors=never", "--open"] } [[language]] name = "typst" language-servers = ["tinymist"]A warm thank you to my lovely friend Felix for showing me the live preview mode of tinymist!
Read more of this story at Slashdot.
Read more of this story at Slashdot.
Read more of this story at Slashdot.
Read more of this story at Slashdot.
Read more of this story at Slashdot.
Read more of this story at Slashdot.
Read more of this story at Slashdot.
Read more of this story at Slashdot.
Read more of this story at Slashdot.
Read more of this story at Slashdot.
Read more of this story at Slashdot.
Read more of this story at Slashdot.
Read more of this story at Slashdot.
Read more of this story at Slashdot.
Over on his excellent blog, Matt Keeter posts some results from having ported a bytecode virtual machine to tail-calling style. He finds that his tail-calling interpreter written in Rust beats his switch-based interpreter, and even beats hand-coded assembly on some platforms.
He also compares tail-calling versus switch-based interpreters on WebAssembly, and concludes that performance of tail-calling interpreters in Wasm is terrible:
1.2× slower on Firefox, 3.7× slower on Chrome, and 4.6× slower in wasmtime. I guess patterns which generate good assembly don't map well to the WASM stack machine, and the JITs aren't smart enough to lower it to optimal machine code.In this article, I would like to argue the opposite: patterns that generate good assembly map just fine to the Wasm stack machine, and the underperformance of V8, SpiderMonkey, and Wasmtime is an accident.
some numbersI re-ran Matt’s experiment locally on my x86-64 machine (AMD Ryzen Threadripper PRO 5955WX). I tested three toolchains:
Compiled natively via cargo / rustc
Compiled to WebAssembly, then run with Wasmtime
Compiled to WebAssembly, then run with Wastrel
For each of these toolchains, I tested Raven as implemented in Rust in both “switch-based” and “tail-calling” modes. Additionally, Matt has a Raven implementation written directly in assembly; I test this as well, for the native toolchain. All results use nightly/git toolchains from 7 April 2026.
My results confirm Matt’s for the native and wasmtime toolchains, but wastrel puts them in context:
We can read this chart from left to right: a switch-based interpreter written in Rust is 1.5× slower than a tail-calling interpreter, and the tail-calling interpreter just about reaches the speed of hand-written assembler. (Testing on AArch64, Matt even sees the tail-calling interpreter beating his hand-written assembler.)
Then moving to WebAssembly run using Wasmtime, we see that Wasmtime takes 4.3× as much time to run the switch-based interpreter, compared to the fastest run from the hand-written assembler, and worse, actually shows 6.5× overhead for the tail-calling interpreter. Hence Matt’s conclusions: there must be something wrong with WebAssembly.
But if we compare to Wastrel, we see a different story: Wastrel runs the basic interpreter with 2.4× overhead, and the tail-calling interpreter improves on this marginally with a 2.3x overhead. Now, granted, two-point-whatever-x is not one; Matt’s Raven VM still runs slower in Wasm than when compiled natively. Still, a tail-calling interpreter is inherently a pretty good idea.
where does the time goWhen I think about it, there’s no reason that the switch-based interpreter should be slower when compiled via Wastrel than when compiled via rustc. Memory accesses via Wasm should actually be cheaper due to 32-bit pointers, and all the rest of it should be pretty much the same. I looked at the assembly that Wastrel produces and I see most of the patterns that I would expect.
I do see, however, that Wastrel repeatedly reloads a struct memory value, containing the address (and size) of main memory. I need to figure out a way to keep this value in registers. I don’t know what’s up with the other Wasm implementations here; for Wastrel, I get 98% of time spent in the single interpreter function, and surely this is bread-and-butter for an optimizing compiler such as Cranelift. I tried pre-compilation in Wasmtime but it didn’t help. It could be that there is a different Wasmtime configuration that allows for higher performance.
Things are more nuanced for the tail-calling VM. When compiling natively, Matt is careful to use a preserve_none calling convention for the opcode-implementing functions, which allows LLVM to allocate more registers to function parameters; this is just as well, as it seems that his opcodes have around 9 parameters. Wastrel currently uses GCC’s default calling convention, which only has 6 registers for non-floating-point arguments on x86-64, leaving three values to be passed via global variables (described here); this obviously will be slower than the native build. Perhaps Wastrel should add the equivalent annotation to tail-calling functions.
On the one hand, Cranelift (and V8) are a bit more constrained than Wastrel by their function-at-a-time compilation model that privileges latency over throughput; and as they allow Wasm modules to be instantiated at run-time, functions are effectively closures, in which the “instance” is an additional hidden dynamic parameter. On the other hand, these compilers get to choose an ABI; last I looked into it, SpiderMonkey used the equivalent of preserve_none, which would allow it to allocate more registers to function parameters. But it doesn’t: you only get 6 register arguments on x86-64, and only 8 on AArch64. Something to fix, perhaps, in the Wasm engines, but also something to keep in mind when making tail-calling virtual machines: there are only so many registers available for VM state.
the value of timeWell friends, you know us compiler types: we walk a line between collegial and catty. In that regard, I won’t deny that I was delighted when I saw the Wastrel numbers coming in better than Wasmtime! Of course, most of the credit goes to GCC; Wastrel is a relatively small wrapper on top.
But my message is not about the relative worth of different Wasm implementations. Rather, it is that performance oracles are a public good: a fast implementation of a particular algorithm is of use to everyone who uses that algorithm, whether they use that implementation or not.
This happens in two ways. Firstly, faster implementations advance the state of the art, and through competition-driven convergence will in time result in better performance for all implementations. Someone in Google will see these benchmarks, turn them into an OKR, and golf their way to a faster web and also hopefully a bonus.
Secondly, there is a dialectic between the state of the art and our collective imagination of what is possible, and advancing one will eventually ratchet the other forward. We can forgive the conclusion that “patterns which generate good assembly don’t map well to the WASM stack machine” as long as Wasm implementations fall short; but having shown that good performance is possible, our toolkit of applicable patterns in source languages also expands to new horizons.
Well, that is all for today. Until next time, happy hacking!
In an earlier blog post we found out that Pystd's simple sorting algorithm implementations were 5-10% slower than their stdlibc++ counterparts. The obvious follow up nerd snipe is to ask "can we make the Pystd implementation faster than stdlibc++?"
For all tests below the data set used was 10 million consecutive 64 bit integers shuffled in a random order. The order was the same for all algorithms.
Stable sortIt turns out that the answer for stable sorting is "yes, surprisingly easily". I made a few obvious tweaks (whose details I don't even remember any more) and got the runtime down to 0.86 seconds. This is approximately 5% faster than std::stable_sort. Done. Onwards to unstable sort.
Unstable sortThis one was not, as they say, a picnic. I suspect that stdlib developers have spent more time optimizing std::sort than std::stable_sort simply because it is used a lot more.
After all the improvements I could think of were done, Pystd's implementation was consistently 5-10% percent slower. At this point I started cheating and examined how stdlibc++'s implementation worked to see if there are any optimization ideas to steal. Indeed there were, but they did not help.
Pystd's insertion sort moves elements by pairwise swaps. Stdlibc++ does it by moving the last item to a temporary, shifting the array elements onwards and then moving the stored item to its final location. I implemented that. It made things slower.
Stdlibc++'s moves use memmove instead of copying (at least according to code comments). I implemented that. It made things slower.
Then I implemented shell sort to see if it made things faster. It didn't. It made them a lot slower. So did radix sort.
Then I reworked the way pivot selection is done and realized that if you do it in a specific way, some elements move to their correct partitions as a side effect of median selection. I implemented that and it did not make things faster. It did not make them slower, either, but the end result should be more resistant against bad pivot selection so I left it in.
At some point the implementation grew a bug which only appeared with very large data sets. For debugging purposes I reduce the limit where introsort switches from qsort to insertion sort from 16 to 8. I got the bug fixed but the change made sorting a lot slower. As it should.
But this raises a question, namely would increasing the limit from 16 to 32 make things faster? It turns out that it did. A lot. Out of all perf improvements I implemented, this was the one that yielded the biggest improvement by a fairly wide margin. Going to 64 elements made it even faster, but that made other algorithms using insertion sort slower, so 32 it is. For now at least.
After a few final tweaks I managed to finally beat stdlibc++. By how much you ask? Pystd's best observed time was 0.754 seconds while stdlibc++'s was 0.755 seconds. And it happened only once. But that's enough for me.