Python is slow… Wait, it’s actually fast!

VilleKr
11 min readMay 27, 2021

--

Validating WebSocket performance benchmark results

TLDR

In this benchmark there is one clear winner: uWebSockets-library. uWebSockets, which is written in C/C++, was the fastest performing WebSocket server both when used with Node.js and Python. The speed is impressive but so is the resource usage. It felt like uWebSockets was just stretching its legs while Node.js with websocket-library was hammering as hard as it can on benchmarking the client side. And thanks to uWebSockets, in my tests, Python’s WebSocket benchmark server was the fastest performer.

Foreword

At work I have recently been developing an Open Charge Point Protocol (OCPP) system where Central System and Charging Stations communicate with each other over WebSockets. WebSocket is a protocol for bi-directional low overhead communication channel between client and server and is used in various solutions where permanent frequent or infrequent communication is necessary and where backend systems need to send data to clients without client first making a request.

I recently came across a piece of research where different languages and selected WebSocket libraries were benchmarked. The research was conducted and documented meticulously, all the benchmarking code was shared and dockerized images were available. In the research, Python’s various websocket libraries’ performance was terrible and the recommendation was not to use Python in any WebSocket related implementation. The results regarding Python intrigued me. Generally speaking, the performance of Node.js and Python should be roughly in the same ballpark, so when there are differences of multiple orders and at such a magnitude, then the difference comes from either an exceptionally well optimized library or the test setup is not equal (or even incorrect).

The research was especially interesting also because in our OCPP system we’re using Python both in cloud backend as well as in embedded system. In the context of OCPP, WebSocket performance aspects can’t of course be neglected, but that’s not the first and most likely bottleneck for couple of reasons. Firstly, the OCPP Central System reacts to messages sent by Charging Station, which then performs some business logic in the backend (typically involving database access or accessing other AWS services) and then responds back to client. The messaging between client and server is quite infrequent and transaction correctness is much more important that speed of responses as such. Secondly, we utilize AWS infrastructure for both ensuring our system is available and is able to scale up based on the actual load (and then scale down once times are quieter).

Before proceeding further, I must make a disclaimer. It’s quite easy to benchmark, get some results and then draw conclusions. However, executing a rigorous and ground-truth benchmark and make correct conclusions is another thing. I’m not a specialist on WebSocket-protocol as such and I don’t have experience on how to implement such a protocol server, or how to assess the one’s compliance against the WebSocket specification. So, my validation of the original benchmark results as well as added benchmarks should be considered nothing more than tests from and for a curious developer. One important issue from the original benchmark must be raised. The benchmarking client uses Node.js websocket-library, which is stated to be (mostly) pure JavaScript code. Even if benchmark client is run using a more powerful machine compared to a benchmark server, it’s possible that the benchmarking client will be the bottleneck when running benchmark against a highly performing WebSocket server.

Goals of benchmark

Before jumping into benchmark setup and results, let’s summarize the goals and what I aim to achieve:

1. Evaluate the original benchmark from the Python perspective. Does Python really fail to execute benchmark? How valid is the “do not use Python” statement?

2. Evaluate uWebSockets with Python. The fastest websocket server in the original benchmark was uWebSockets with Node.js. uWebSockets also has Python bindings, although project is currently not yet fully finalized.

3. Evaluate how Python ASGI server performs. More about ASGI a bit later.

In order to clarify the point, my focus was the Python perspective and I wasn’t interested in replicating the original benchmark with all the languages.

Benchmark setup

For the benchmarking I ended up running two EC2 z1d.large virtual machine instances in AWS with Ubuntu 20.04 LTS. Instances were running in the same subnet and client connected to server using its private DNS address. For a single threaded workload, z1d instance family is a good choice due high CPU frequency and high memory amount involume in relation with lowto low vCPU count. And in In my tests z1d instance types proved to perform very solidly. At first, I actually attempted to run benchmarks with C5 family (C5.large — c5.2xlarge) instances and ended up with more or less similar results except Node.js benchmark server, which constantly reached only 99/100 rounds. Another and maybe more interesting (or worrying?) issue during the initial benchmarking attempt was the fact that both C5 and z1d type instances running Amazon Linux 2 AMI failed to run benchmarks successfully with any language. Note that I didn’t investigate the issues with C5 instances nor Amazon Linux 2 any further, so if anyone is willing to take a look on at the issues, I’d like to know if there’s any resolution.

The choice of virtual machine of course reduces benefits that are achieved by WebSocket-libraries that rely on multiprocessing instead of ones that utilize asynchronous execution model.

Original benchmarking code was nicely dockerized based on Ubuntu images, although no version information was specified either for Ubuntu operating system or installed Python. When dockerizing python Python for production, it’s a good and recommended practice is to start with official Python docker images, which are based on Debian. My tests for Python are based on official Python:3.9.5-slim version. In the original benchmark the Ubuntu version was 20.04.2 LTS and the installed Python was 3.8.5. If that change explains differences against original benchmark results, then so be it.

Benchmark results part I

First, I’ll present results for the two Python options against other languages I chose to benchmark. Languages and WebSocket libraries are:

· Python / websockets

· Python / uWebSockets

· Node.js / uWebSockets

· Java / Java-WebSocket

· Rust / rust-websocket

I executed benchmark tests 3 times for each language and picked the worst out of these results. It’s good to be pessimistic, right?

Request Time Elapse shows the tIme it takes for each test round to run in seconds.

The benchmarking client establishes WebSocket connection to a server and then sends a message and waits for the server to return that message back with timestamp information. Timestamp information is not utilized as such but it’is just used to simulate some simple work that server must perform before returning the message back to client. The number of WebSocket connections starts from 100 and increases by 100 in each step. In this benchmark 100 rounds were run.

First a comment about benchmark server code. When a server receives a message, it decodes deserialized JSON payload, gets a current timestamp from a system, creates a response as JSON, serializes and sends that out to WebSocket. In other words, there’s nothing asynchronous here i.e. the task can be described to be compute intensive.

Time it takes for WebSocket client and server to establish a connection.

Python’s websockets-library is layered around Python’s Asyncio-framework i.e. it is pure Python implementation. That combined with non-async nature of benchmark server’s code doesn’t do any favors to that particular combination. Compared to the best results there’s quite a difference. However, websockets-library consistently completed the benchmark every time until for all 100 rounds, which credits to its reliability. Time to establish the connection starts to seriously increase after 8000 connections, which might be due compute intensive task assigned to server.

Rust’s performance is a surprise as you would expect it to be on the same level as Java. Rust’s benchmark server’s performance might suffer due to the fact that that it prints out all incoming messages to the console. Rust also spins up huge amount of processes so it might be an issue of context switching. Any Rust specialists there to share some light?

Java and uWebSockets-library along with Node.js and Python are clearly best performers. However, there is also quite a difference. For some reason I wasn’t able to get Node.js with uWebSockets to finish 100/100 rounds all the time. That’s something to keep in mind and investigate further. Nevertheless, Node.js with uWebSockets seems to perform very nicely when monitoring with htop on the server. While benchmarking, uWebSockets put a load to both 2 vCPUs but it only very momentarily pushed them to limits. This implies that the bottleneck when benchmarking uWebSockets was actually on client side. Python with uWebSocket only utilized single vCPU, so compared to Node.js version there still might be something to improve.

The Java duration results are good, but it has some serious delay when establishing connections. Java’s very inefficient resource usage is yet another story. The benchmarking server spins up multiple processes and constantly throttles both vCPUs at 100%. Java also failed to reach the total 100 rounds due running out of memory. So, Java’s claim “Runs everywhere” is true but should be continued with “… but not quite until finish line.” Swapping EC2 with one that is a level or two beefier would probably allow Java to run until the full 100 rounds.

To sum up the time it took each language to finish the benchmark we get quite clear differences.

Total duration it took for each language / websocket server to finish the test

Benchmark results analysis part II — Python specific libraries

In this second part of I’ll take a more detailed look on at a few other options on how to run WebSocket servers with Python.

My special interest here was to see how ASGI servers with a websockets-library would perform against other Python options. Asynchronous Server Gateway Interface (ASGI) is a specification that separates the network protocol server (generally called ASGI server) from the application server (called ASGI application) by defining a standard asynchronous interface between the two. ASGI specification supports HTTP, HTTP/2 and WebSocket protocols. ASGI application don’t have to deal with integrating e.g. websockets-library in their code but will utilize the ones that are available in ASGI server. The ASGI server might support multiple protocol server implementations, so switching to a different websocket server should be a breeze. Also, the ASGI server might employ multiple workers, so parallelizing workloads locally is very easy and efficient. And of course, as there’s standard event-based interface, it’s straightforward to scale in the cloud infrastructure as well. For example FastAPI is a very popular API framework that conforms to ASGI specifications. With FastAPI one could run whole REST API behind AWS API Gateway HTTP API within a single Lambda function. The setup is simple and very, very scalable.

Benchmarked websocket libraries are:

· websockets

· uWebSockets

· uvicorn (ASGI server) with websockets

· autobahn

The same approach is followed here i.e. 3 benchmark rounds for each language and worst results are shown here.

Request Time Elapse shows the tIme it takes for each test round to run.

All versions did run up to 100/100 rounds but there are once again quite significant differences both in execution duration as well as connection times. Autobahn-library is pure Python implementation, except some accelerators, which were installed in this setup. However, its’ performance still loses clearly to the websockets-library.

Time it takes for WebSocket client and server to establish a connection.

I ran Uvicorn with uvloop and 2 workers, which increased the performance a bit more compared to the websockets-library. That seemed to help especially with connection establishment, duration and consistency. uWebSockets-library results are below just to show the baseline.

To sum up this second part, the total time between fastest and slowest websocket server is getting very dramatic.

Conclusion

It’s quite difficult to draw the line to which extend this the benchmark tested the given WebSocket server performance versus features of the given language. It’s definitely an unfair game to compare compiled and interpreted languages in terms of pure execution speed. Same goes for comparing servers running multiple processes vs. the one doing all its work within a one. But that’s life. Competition is rarely fair and so within a given constraints, only results matter. There’s a significant performance different difference between pure Python implementations and the performing WebSocket servers. For applications where every microsecond matters, and WebSocket server’s performance is likely to be bottleneck in the system, it’s definitely advisable to assess very carefully how that functionality will be implemented and which language and library will be utilized.

When it comes to Python’s performance, the results are not unexpected. Python is fast in development but not necessarily in execution as such. Python is so vastly popular and loved because of its clean code syntax, rapid application development and rich ecosystem of libraries and developers. But then again, Python applications can be extremely fast. Combination of Python’s clarity and extremely high performance libraries, typically written in C or C++, are the main reasons why Python is the most popular language in data analytics and machine learning. Data Engineers and Data Scientists would probably suffer greatly if NumPy or Pandas libraries would suddenly disappear. There are also many ways to make ordinary Python code much faster. You could start with pure Python implementation, then profile the execution to identify biggest bottlenecks. Based on findings you can look for better performing libraries to replace existing ones in your application or you could optimize you code by e.g. utilizing Numba JIT or Cython (i.e. translate Python to C/C++ and then compile as native language Python extension). Just remember when going down this rabbit hole “Don’t fix if it isn’t broken”-analogy would be “Don’t optimize it, unless you know where the bottleneck is”.

Yet another thing is how you run your application. Blazing-fast application on a single server doesn’t make your customers happy when that server goes suddenly down. Cloud infrastructure should be utilized so that it makes your application secure, ensures availability, scalability and performance. All of these are important, the priorities how these different aspects should be tackled varies from application to application.

In this benchmark there is one clear winner: uWebSockets-library. uWebSocket, which is written in C/C++, was the fastest performing WebSsocket server both when used with Node.js and Python. The speed is impressive but so is the resource usage. It felt like uWebSockets was just stretching its legs while Node.js websocket-library was hammering as hard as it can on benchmarking the client side. And thanks to uWebSocket, in my tests, Python’s WebSocket benchmark server was the fastest performer. So, I’ll definitely consider using uWebSockets in our Python based applications if we need boost our WebSocket performance.

Here are the benchmark results as well as the code for each WebSocket server.

--

--

VilleKr
VilleKr

Written by VilleKr

Working at NordHero as Cloud Solutions Architect.

Responses (2)