01 November 2023
Some Benchmarks and Issues with Parallel Processing in Crystal
This is all some stuff I’ve been considering for a few weeks. The post itself came about due to 7my own Mastodon thread.
Benben, my VGM player, does just two things: it plays VGM files, and it bulk-converts them to WAV or AU. That’s it. During playback, it uses three processes to help split up the work, and to help further isolate the actual playback code from the user interface code. The three processes it uses are:
- The main process, which handles the actual rendering to PCM data and sending it to the driver.
- A process that reads keystrokes and sends them to whatever is listening via Channels.
- A process that draws the user interface. This gets its data from the main process using Atomics.
Benben is written in Crystal, which provides a Fiber
class to help you split
up your work. Fibers are very similar to operating system threads, except they
are cooperatively scheduled by the Crystal runtime itself1, 2. At
the time of writing, Crystal also defaults to scheduling these on a single
operating system thread2. However, using the compile time flag
-Dpreview_mt
allows the scheduler to use multiple worker threads, where each
worker thread has its own scheduler and a queue of runnable Fibers3.
Which worker thread gets which new Fiber is decided in a round-robin
way3. Given this, I think it’s safe to say that this uses a M:1
model when using the default, and a hybrid M:N threading model when a program is
compiled with the -Dpreview_mt
flag4.
Benben currently enforces the use of -Dpreview_mt
when compiling it. If you
use the included Rakefile, this is taken care of for you. There is an
experimental branch called allow-build-without-preview_mt
that lets you build
without -Dpreview_mt
, but the results leave a lot to be desired. For example,
normal playback of a VGM works5, but the user interface flickers
heavily, and the program will deadlock if you press “q” to quit. Rendering
works, but is very slow.
Still, the design of Benben is such that -Dpreview_mt
is the only model that
really makes sense. Rendering VGMs to WAV or AU files is one of those problems
that are “embarrassingly parallel”6. Each song is self-contained, so
no song depends on any other song during rendering, meaning that one song can be
rendered per thread of execution. This is exactly what Benben does. When used
with its default options, Benben will spawn X number of Fibers, where each Fiber
renders a single song; X is equal to the number of logical cores in this case,
or the total number of songs, whichever is less. However, X can also be
controlled with the --jobs
command line option, which lets a user specifically
request the number of Fibers to use, though this may still be less if they have
fewer total songs than jobs.
However, remember that Fibers are not threads. M Fibers get scheduled over M
operating system threads. Programs compiled with Crystal default to using only
four operating system threads when -Dpreview_mt
is used, but this can be
modified at runtime by changing the CRYSTAL_WORKERS
environment variable. So
if you want all your Fibers to get distributed over 12 operating system threads,
you’d build your program with -Dpreview_mt
, then run it like this:
CRYSTAL_WORKERS=12 ./myprogram
This is pretty neat and reminds me of Go and its goroutines. But this is also where my problems start.
A Few Quick Definitions
- 1:1 = One thread is one operating system thread. No Fibers or such involved.
- 1F:1T = 1 Fiber scheduled on 1 operating system thread. In other words, one thread per Fiber.
- MF:NT = M Fibers scheduled over N operating system threads.
- logical cores = The number of hardware threads a CPU can handle at once (usually 2x the cores when hyperthreading is in play).
- First-Class Thread = The ability to spawn and control a thread directly within a program.
A Hypothetical Example
Let’s take my desktop as an example. It has a Core i9-10850K processor in it,
which means it’s a 10 core/20 thread machine. Let’s just call this “20 logical
cores” for short. In my ~/.bashrc
I have CRYSTAL_WORKERS=20
because I want
Crystal programs to use up to all my cores by default.
So, 20 logical cores, and CRYSTAL_WORKERS=20
. Now lets also assume I’m
rendering 61 VGM files to WAV. Benben knows I have 20 logical cores, so unless
I specify the --jobs
option when I run it, Benben will spawn 20 Fibers to
render 20 VGM files to WAV in parallel. In this situation, Benben is
effectively doing one Fiber per scheduler thread because I have
CRYSTAL_WORKERS
set to 20. And, since the operating system threads are
preemptively scheduled, that means my Fibers are effectively preemptively
scheduled. This is exactly the behavior I want: N songs = N preemptive threads,
a 1F:1T model.
If I reduce CRYSTAL_WORKERS
to fewer (e.g. four, the default for Crystal if
the environment variable was never set), then Benben tries to render 20 songs at
once over 20 Fibers that are scheduled over less than 20 operating system
threads. The MF:NT model absolutely works here, though is slower than when
Benben uses a 1F:1T model.
But, I’m a rather technical user. I not only know what an environment variable
is and how to use one, but I also know how the CRYSTAL_WORKERS
environment
variable works.
Let’s assume a non-technical user with a machine identical to mine downloads a
pre-compiled binary of Benben and tries to render 61 VGM files to WAV.
CRYSTAL_WORKERS
won’t be set for them, so the Crystal runtime will only launch
four worker threads when Benben starts. They are already getting a reduced
experience. Taking this further, let’s assume they are at least technical
enough to open up their system monitoring program, and upon doing so, see that
only four of their CPU cores are doing work. “What? I thought Benben was
supposed to use all my cores to render songs? Is it broken?” Now not only do
they have a subpar experience, but it looks like my software is misbehaving.
Now, let’s assume they have a way of contacting me (Mastodon, email, Fossil
ticket, whatever) and they ask me what’s up with this behavior. I could explain
to them that it’s due to how the language it’s written in (Crystal) works.
Unfortunately this not only makes my program look kinda subpar, but makes
Crystal look bad as well. I could also tell them to use fewer --jobs
, but
them I’m back at a situation where Benben looks like crap software. After all,
they have 20 cores and they want to fully use them, damnit! Without messing
with “environment variables” or whatever the heck those are!
Overall, a workable but underwhelming experience.
Of course it just gets even worse if the program was compiled without
-Dpreview_mt
.
The Problem
The problem basically boils down to Crystal’s MF:NT model of multitasking.
While this is an excellent model for many problems (servers that need to answer
requests immediately come to mind), it’s not a “one-size-fits-all” solution.
And unfortunately, Crystal does not have true First-Class Threads that I can
spawn myself7. This problem manifests itself as performance issues,
which are not only frustrating in nature, but could potentially make Benben look
bad, and possibly also Crystal if push came to shove. Crystal does seem to
have a natural drive to be optimized for backend applications that run on
servers, but it also seems to want to be a general purpose programming language.
If it’s limited to a MF:NT model (or a M:1 model, when -Dpreview_mt
isn’t
used), then it simply misses out on being a good option for a whole set of
programs.
While looking into this problem, I have come across a few threads11, 12 that touch on issues that are similar to my own. In them, the question of what should be the default tends to come up. Should Crystal default to four worker threads? Or should it default to spawning X worker threads where X equals the number of logical CPU cores? If it did the latter, what would that mean for RAM usage? How about contention? No two programs are the same, after all. There was also a suggested solution: a compile time tuning knob that would set the default number of worker threads.
But, to all of these questions, I feel the best solution is for the language to
not limit the programs written with it. Rather, a good program should instead
offer the tuning knobs itself, not the language. This is why Benben has a
--jobs
option. I feel this is especially true for programs that run on
servers, where tuning is vital. If the language imposes a limit itself, then
that’s one less knob for the end user (or at least a more troublesome knob).
To the suggestion of a compile-time knob for the default number of workers, this seems like an even worse idea. I have no way to predict the number of cores an end user will have. Of the computers I own myself, I have a few with four logical cores (some of which are in big.LITTLE), one with 12, one with six (in a big.LITTLE config), and one with 20. A compile-time knob just moves the goal posts, and does nothing to solve the problems I’ve discussed in the previous section.
There are a few stopgap solutions, though.
Stopgap Solutions
One possible solution is to detect the number of CPU cores very, very early at
program launch, then modify the number of worker threads that get spawned by the
runtime. This would require a redefinition of the main
function in Crystal:
fun main(argc : Int32, argv : UInt8**) : Int32 LibC.setenv("CRYSTAL_WORKERS", System.cpu_count.to_s, 1) Crystal.main(argc, argv) end
The problem with this solution is that it is WAY before command line arguments
are processed by Benben, so --jobs
still gets wonky. I’m also not 100% sure
if System.cpu_count
is available at that time. But it’s a start.
A second possible solution is to monkey patch the Crystal runtime’s scheduler:
class Crystal::Scheduler private def self.worker_count env_workers = ENV["CRYSTAL_WORKERS"]? if env_workers && !env_workers.empty? workers = env_workers.to_i? if !workers || workers < 1 Crystal::System.print_error "FATAL: Invalid value for CRYSTAL_WORKERS: #{env_workers}\n" exit 1 end workers else System.cpu_count end end end
Unfortunately this runs into the same problems as when you monkey patch any API: you need to track the upstream code to ensure your monkey patch will continue to work properly with the rest of the program. This does seem like a good possible short-term solution, however (though be sure to check out the Benchmarks section below).
The Ideal Solution (In My Opinion)
The solution I would prefer long-term is for Crystal to give us a proper Thread API. It can still have its Fibers and the scheduler, but it should expose the scheduler as a public API so we can disable it or tweak it to our needs. But most importantly: Allow us to spawn native First-Class Threads (operating system threads). After all, sometimes you just want a real operating system thread, not a Fiber.
The Alternatives
One thing to note is that Benben did not start life as a Crystal program. My usual MO is to prototype stuff in Common Lisp, then examine if it would be beneficial to do actual production code in another language (read: Crystal). Benben, and it’s underlying VGM library YunoSynth, were no different. They started life in Common Lisp, and I still have that code. In fact I’ve even expanded it a bit out of curiosity and have found it to perform almost identically to the Crystal version as far as CPU usage goes8. More importantly: the Common Lisp implementation I target (SBCL) gives me easy-to-use true First-Class Threads. There’s also libraries9 available to let me use Green Threads (which are very similar to a Fiber in Crystal) alongside normal First-Class Threads if I really, really want some cooperative multitasking (highly unlikely for Benben). There’s also the SB-SIMD package that comes with SBCL that gives me a very, very nice interface to do SIMD stuff, potentially making the Common Lisp version faster than the Crystal version currently is.
Basically, I could always just drop Crystal and go back to writing Benben10 in Common Lisp.
I really don’t want to do this, though. Not because of difficulty - it wouldn’t be at all difficult, and only mildly time consuming since all the hard parts are already solved. I don’t want to do this because I don’t want to hurt Crystal. It’s still a new language, and I really hope to bring (positive) attention to it with Benben. Not that Benben is the only program I have written in Crystal… I have many others, including the Gemini server that served you this page. But Benben and it’s cousin, midi123, are probably the most publicly visible of my programs.
But, if push came to shove, the mother in me will come out and protect my own child first and foremost.
Benchmarks
So, I’ve talked about how Benben works in regards to Fibers, -Dpreview_mt
, and
the CRYSTAL_WORKERS
environment variable. How about some actual numbers?
These benchmarks were done on my desktop computer, some of the specs of which are:
- Intel Core i9-10850K CPU
- 64GB RAM (DDR4 3200)
- A bunch of HDDs
- SSD (Samsung 860 EVO, files were rendered onto it)
- Slackware Linux 15.0 with a custom 6.5.9 kernel.
Benben was compiled using Crystal v1.10.1. The compiler was built on my system from source with LLVM 13.0.0. Each run of Benben was performed on a system with just a terminal and Emacs (for record keeping) running in a Sawfish X11 session to keep other programs from interfering as much as possible. There were at least 10 seconds of pause time between each run to allow the CPU to cool to a “normal” base temperature.
The numbers below are geometric means taken from three runs for each configuration. Three things were measured: the time taken, the average samples rendered per second, and the memory used. The time taken and average samples per second are reported directly by Benben and do NOT include the time it takes to startup and load the XSPF playlist The memory measurements were taken using a program called xtime and DO take the startup and XSPF load into account. The source code (Crystal) for xtime is included above the footnotes in its own section.
The configuration permutations aren’t exhaustive, but I think I got the important ones. If someone wants me to add a few more on, I will.
These runs all rendered 61 songs to 44.1Khz 16-bit stereo WAV. The songs were referenced in an XSPF playlist and consisted of songs from the Sharp X68000 computer and the NeoGeo, both of which require somewhat more involved emulation cores (YM2151 and YM2610, respectively).
These benchmarks were done using two checkouts from Benben’s repository:
aebbe73c2eeec1e091e7a9834311b7749bf1e426
for the ones withFiber.yield
added to the source code.f2a7ff0c20cdad6be3fb2566718ffca010a4c148
for the ones withoutFiber.yield
added.
If you wish to examine the source repository, it’s available here.
The YunoSynth code was at checkout 5f351d5e3c31782ef1e56dcaf0b0c8664f79af82
.
It’s repository is
here
Benben was built using this, with -Dpreview_mt
added/removed as-needed:
shards build -p -Dpreview_mt -Dremiconf_no_hjson --release --no-debug -Dyunosynth_wd40
Some issues to note:
- With
-Dpreview_mt
: WhenCRYSTAL_WORKERS
is 2 or 1, the program has problems updating the display (lots of flicker), and it deadlocks on exit. 3 workers or more is fine. - Without
-Dpreview_mt
: the program has problems updating the display (lots of flicker), no issues exiting.
The source code (Common Lisp) used to calculate the geometric means, along with the xtime code, is in its own section above the footnotes. The data file with the raw values can be accessed here (it can be directly processed with the Common Lisp code)
Finally, here are the results when using -Dpreview_mt
:
CRYSTAL_WORKERS=20, --jobs 20, WITHOUT Fiber.yield added Time: 00:00:23:88967938 Samples per Sec: 6,806,318 Memory Used: 112.62 MiB CRYSTAL_WORKERS=20, --jobs 4, WITHOUT Fiber.yield added Time: 00:00:49:561379837 Samples per Sec: 3,157,491 Memory Used: 82.06 MiB CRYSTAL_WORKERS=20, --jobs 1, WITHOUT Fiber.yield added Time: 00:01:25:153285603 Samples per Sec: 1,829,959 Memory Used: 53.44 MiB CRYSTAL_WORKERS=20, --jobs 20, Fiber.yield added Time: 00:00:21:316036046 Samples per Sec: 7,445,727 Memory Used: 152.65 MiB CRYSTAL_WORKERS=4, --jobs 20, Fiber.yield added Time: 00:00:49:71518083 Samples per Sec: 3,186,044 Memory Used: 113.33 MiB CRYSTAL_WORKERS=4, --jobs 4, Fiber.yield added Time: 00:01:03:478200751 Samples per Sec: 2,462,458 Memory Used: 73.96 MiB CRYSTAL_WORKERS=4, --jobs 1, Fiber.yield added Time: 00:01:28:427111020 Samples per Sec: 1,766,431 Memory Used: 58.58 MiB CRYSTAL_WORKERS=1, --jobs 20, Fiber.yield added Time: 00:02:31:215964094 Samples per Sec: 1,030,649 Memory Used: 13.94 GiB CRYSTAL_WORKERS=1, --jobs 1, Fiber.yield added Time: 00:02:26:233878863 Samples per Sec: 1,066,631 Memory Used: 1.20 GiB
And here are results WITHOUT the -Dpreview_mt
flag:
--jobs 20, Fiber.yield added Time: 00:02:25:331626453 Samples per Sec: 1,072,926 Memory Used: 112.62 MiB --jobs 4, Fiber.yield added Time: 00:02:24:504344037 Samples per Sec: 1,078,986 Memory Used: 82.06 MiB --jobs 1, Fiber.yield added Time: 00:02:24:695090870 Samples per Sec: 1,077,629 Memory Used: 53.44 MiB
Benchmark Discussion
Benben is fastest when compiled with -Dpreview_mt
, and is allowed to use 20
worker threads and 20 jobs. This is when it behaves as if it has true
First-Class Threads that are preemptively scheduled for this machine. When
compiled with -Dpreview_mt
, Benben is slowest when using one worker thread
with 20 jobs. I interpret this as meaning a 1F:1T model is ideal, and that a
true 1:1 “no Fibers, all First-Class Threads” would potentially be even more
ideal.
Reducing the worker threads causes a larger performance drop than leaving
CRYSTAL_WORKERS
set to 20 and reducing the number of --jobs
. I interpret
this as meaning the MF:NT model works best when M is <= N, and MF:NT results in
slower performance than a 1F:1T model.
Fiber.yield
usually results in slightly faster performance when -Dpreview_mt
is used, but also slightly higher RAM usage. However, when compiled with
-Dpreview_mt
, using one worker thread, and Fiber.yield
is added, memory usage
is absurdly high (in the gigabytes) compared to using more worker threads.
Fiber.yield
must be used when compiling without -Dpreview_mt
because of the
cooperative multitasking.
When compiled without -Dpreview_mt
, Benben runs its absolute slowest.
Memory allocation and deallocation are almost certainly playing some part to these numbers. However, I don’t believe it’s Benben causing these extra allocations. YunoSynth and Benben are both written specifically to allocate things like audio buffers only once, and minimize the allocation of strings as much as possible. Any extra allocations are more than likely to be in the standard library.
Overall, I feel these benchmarks support these conclusions:
- The
-Dpreview_mt
flag is an absolute must. An M:1 model is highly detrimental to Benben’s performance (not to mention the deadlock). - The 1F:1T model and 1:1 “no Fibers, all First-Class Threads” model would both be ideal for Benben.
- A MF:NT model results in worse performance overall, and sometimes much worse memory usage.
- A situation where we are not limited to cooperatively scheduled Fibers in Crystal, but rather have a proper thread API with true First-Class Threads, is justified for embarrassingly parallel problems, such as rendering VGMs to WAV using Benben.
xtime and Geometric Mean Source Code
xtime source code:
# Based on https://github.com/kostya/benchmarks/blob/master/xtime.rb class EnergyStats PATH = "/sys/class/powercap/intel-rapl/intel-rapl:0/energy_uj" getter? hasEnergyMetrics : Bool = false @accum : UInt64 = 0 @energy : UInt64 = 0 @maxEnergy : UInt64 = 0 def initialize @hasEnergyMetrics = File.readable?(PATH) @maxEnergy = File.read(PATH).to_u64 if @hasEnergyMetrics end @[AlwaysInline] def update return unless @hasEnergyMetrics newVal = File.read(PATH).to_u64 if @energy == 0 # first reading @accum = 0 elsif newVal > @energy @accum += newVal - @energy elsif newVal < @energy # counter has been reset @accum += @maxEnergy - @energy + newVal end @energy = newVal end def val sprintf("%0.3f", 0.000001_f64 * @accum) end end @[AlwaysInline] def mem(pid : Int64) : Int64 `ps p #{pid} -o rss`.lines[-1].to_i64 * 1024 end def usage : Nil STDERR << %|Usage: xtime[args...] xtime [-e] [-s] [args...] -q : Don't print command name -e : Show stderr output from process -s : Show stdout output from process Stderr and Stdout are both hidden by default. | exit 1 end usage if ARGV.empty? # Check options output = Process::Redirect::Close outerr = Process::Redirect::Close quiet = false outfile = nil outname = nil outlang = nil loop do break if ARGV.empty? case ARGV[0] when "-q" quiet = true ARGV.shift when "-e" outerr = Process::Redirect::Inherit ARGV.shift when "-s" output = Process::Redirect::Inherit ARGV.shift when "-o" ARGV.shift outname = ARGV.shift outlang = ARGV.shift outfile = ARGV.shift when "-h" then usage when "--" then break else break end end # ARGV may now be empty... if ARGV.empty? STDERR << "ERROR: No program given\n" exit 1 end # Used to track memory usage highestMem : Int64 = 0 newMem : Int64 = 0 # Start energy tracking, if possible energy = EnergyStats.new # Start the process and timer start = Time.monotonic proc = Process.new(ARGV[0], ARGV[1..], output: output, error: outerr) pid = proc.pid # This fiber checks memory usage spawn do highestMem = mem(pid) loop do sleep 5.milliseconds newMem = mem(pid) highestMem = newMem if newMem > highestMem energy.update end end # All done, report proc.wait finish = Time.monotonic finalMem = highestMem.humanize_bytes(format: :JEDEC) if energy.hasEnergyMetrics? STDERR << "#{ARGV[0]}: == #{finalMem}, #{(finish - start)} (#{energy.val} J energy) ==\n" else STDERR << "#{ARGV[0]}: == #{finalMem}, #{(finish - start)} (? J energy) ==\n" end
Code used to calculate geometric means:
(require 'computable-reals) (require 'cl-sdm) ;; For STRING-REPLACE and SPLIT-STRING ;; Nanoseconds per second (defconstant +ns+ 1000000000) ;; Each time in TIMES should be a STRING in the format: hh:mm:ss.nnnnnnnnn (defun calculate-geometric-mean-time (&rest times) (labels ((parse-time (time) ;; Converts a "hh:mm:ss.nnnnnnnnn" string into a (:HOURS hh :MINUTES mm ;; :SECONDS ss :FRACTION nnnnnnnnn) plist. (loop for part-num in (mapcar #'parse-integer (sdm:split-string (sdm:string-replace time "." ":") #\: :as-list t)) for part in '(:hours :minutes :seconds :fraction) ;; Rather overkill, but meh append (list part part-num) into ret finally (return (+ (getf ret :fraction) (* (getf ret :seconds) +ns+) (* (getf ret :minutes) 60 +ns+) (* (getf ret :hours) 60 60 +ns+)))))) (let ((gm (cr:raw-approx-r (cr:expt-r (reduce #'* (mapcar #'parse-time times)) (/ 1 (length times)))))) (multiple-value-bind (gm-total-sec gm-ns) (truncate gm +ns+) (let* ((gm-hour (truncate (truncate gm-total-sec 60) 60)) (gm-min (truncate (- gm-total-sec (* gm-hour 60 60)) 60)) (gm-sec (- gm-total-sec (* gm-hour 60 60) (* gm-min 60)))) ;; Return a new string (format nil "~2,'0d:~2,'0d:~2,'0d:~2,'0d" gm-hour gm-min gm-sec gm-ns)))))) (defun calculate-geometric-mean-samples (&rest samples) (format nil "~:d" (cr:raw-approx-r (cr:expt-r (reduce #'* samples) (/ 1 (length samples)))))) (defun calculate-geometric-mean-memory (&rest memory) (sdm:pretty-print-bytes (cr:raw-approx-r (cr:expt-r (reduce #'* memory) (/ 1 (length memory)))))) (defun process-file (filename) (format t "~a" (with-output-to-string (out) (with-open-file (in filename) (loop for line = (read-line in nil nil) while line do (unless (or (sdm:empty-string-p line) (sdm:string-starts-with line "#")) (if (sdm:string-starts-with line "name:") (format out "== ~a ==~%" line) (let* ((parts (sdm:split-string line #\, :as-list t))) (format out " GM Time: ~a~%" (apply #'calculate-geometric-mean-time (subseq parts 0 3))) (format out " GM Samples per Sec: ~a~%" (apply #'calculate-geometric-mean-samples (mapcar #'parse-integer (subseq parts 3 6)))) (format out " GM Memory Used: ~a~%" (apply #'calculate-geometric-mean-memory (mapcar #'parse-integer (subseq parts 6)))) (write-char #\Newline out)))))))))
Footnotes
- 1: https://crystal-lang.org/api/1.10.1/Fiber.html
- 2: https://crystal-lang.org/reference/1.10/guides/concurrency.html
- 3: https://crystal-lang.org/2019/09/06/parallelism-in-crystal/
- 4: https://en.wikipedia.org/wiki/Thread_(computing)#Threading_models
- 5: This likely only works properly on the
allow-build-without-preview_mt
branch because the other branches are currently lacking theFiber.yield
calls to implement cooperative multitasking. - 6: https://en.wikipedia.org/wiki/Embarrassingly_parallel
- 7: There is an undocumented internal Thread class, but using an “undocumented” and “internal” API isn’t necessarily a good idea in the long run.
- 8: Memory usage in the Common Lisp version is definitely higher, but just 1.5-3x more. This could probably be improved with optimization, but also remember that Benben is NOT intended as an embedded application, and the amount of memory the Lisp version uses would still be quite small compared to how much RAM machines ship with these days. Your desktop environment almost certainly uses more RAM than it does unless you’re on something like Fluxbox or Sawfish.
- 9: https://github.com/thezerobit/green-threads
- 10: Also midi123, which uses the same architecture as Benben, and thus suffers the exact same MF:NT issues.
- 11: https://forum.crystal-lang.org/t/can-crystal-workers-x-be-set-in-source-code/4451
- 12: https://social.sdf.org/@MistressRemilia/111315950288487313