Who uses SockJS in production

node.js, mongodb, redis, with Ubuntu performance drop in production, RAM is free, CPU 100%


As the question title suggests, I have a hard time figuring out what can be improved (or tweaked in the Ubuntu operating system) about my application to achieve acceptable performance. But first I explain the architecture:

The front-end server is an 8-core computer with 8 GB of RAM running Ubuntu 12.04. The application is entirely written in Javascript and runs in node.js v 0.8.22 (as some modules seem to be complaining about newer versions of Node). I'm using Nginx 1.4 to push HTTP traffic from port 80 and 443 to 8 managed node workers and started using the Node Cluster API. I'm using the latest version of socket.io 0.9.14 to handle the websocket connections for which I only have websockets and xhr polling enabled as available transports. I am also running an instance of Redis (2.2) on this machine.

I store persistent data (like users and scores) on a second server on Mongodb (3.6) with 4 GB of RAM and 2 cores.

The app has been in production for a few months (it ran on a single box until a few weeks ago) and is used by around 18,000 users a day. It has always worked very well except for one major issue: degradation in performance. As it is used, the amount of CPU used by each process grows until the worker is statured (which no longer meets requirements). I temporarily solved it by checking the CPU used by every employee every minute and restarting when it hits 98%. So the problem here is mainly the CPU and not the memory. Memory is no longer an issue as I upgraded to socket.io 0.9.14 (the earlier version lost memory) so I doubt it's a memory leak problem, especially as the CPU is now growing pretty fast ( I have to restart each worker about 10-12 times a day!). The RAM used is also growing, to be honest. but very slow, 1 gig every 2-3 days, and the weird thing is that it doesn't release even if I completely restart the whole application. It is only released when I restart the server! I can't really understand that ...

I've now discovered Nodefly, which is amazing, so I can finally see what's going on on my production server and I've been collecting data for a few days. If someone wants to see the diagrams I can give them access, but basically I can see that I have between 80 and 200 simultaneous connections! I was expecting node.js to handle thousands, not hundreds, of requests. The average response time for http traffic is also between 500 and 1500 milliseconds, which is a lot in my opinion. At this moment, with 1300 users online, this is the output of "ss -s":

This shows that I have a lot of closed connections in Timewait. I increased the maximum number of open files to 999999. Here is the output from ulimit -a:

So I thought the problem might be with the http traffic, which is saturating the available ports / sockets (?) For some reasons, but one thing doesn't make sense to me: why when I restart the workers and all clients again within a few seconds establish a connection? The worker's CPU load drops to 1% and can handle requests properly until saturated after about 1 hour (at peak time).

I'm mainly a Javascript programmer, not a system administrator, so I don't know how much load to expect with my servers, but it certainly doesn't work the way it should. The application is otherwise stable and this last problem prevents me from sending the mobile versions of the app that are done as they will obviously add more load and will eventually crash the whole thing!

Hopefully there is something obvious I'm doing wrong and someone will help figure it out ... Feel free to ask me for more information and I'm sorry for the length of the question but I think it was necessary. .. Thanks in advance!





Reply:


After a few days of intense trial and error, I am happy to say that I understood where the bottleneck was and I will publish it here so that other people can benefit from my insights.

The problem lies in the Pub / Sub connections I used with socket.io, and in particular in the RedisStore which socket.io uses for cross-process communication of socket instances.

After realizing that I could easily implement my own version of pub / sub using redis, I decided to give it a try and removed the redisStore from socket.io, keeping the default storage (I don't have to send to all connected Clients, but only between 2 different users who may be connected via different processes)

Initially I only declared 2 global Redis connections x to handle the pub / sub on each connected client, and the application used less resources, but I was still affected by constant growth in CPU usage so not much has changed would have. But then I decided to create two new connections to re-create Redis for each client so that their Pub / Sub is only managed in their sessions, and then close the connections once the user has disconnected. Then after a day in production the CPUs were still at 0-5% ... bingo! No process restarts, no errors, with the performance I expected. Now I can say that node.js rocks and I'm glad I chose it for creating this app.

Fortunately, redis was designed to handle many simultaneous connections (unlike Mongo). By default, it's set to 10,000, leaving room for about 5,000 concurrent users on a single Redis instance, which is fine for me for now, but I've read that it can be rolled out on up to 64,000 concurrent connections, so this architecture should be solid enough in my opinion.

At this point, I was thinking of implementing some sort of connection pool for Redis to tweak it a little further, but I'm not sure that this won't cause the Pub / Sub events to build up on the connections again unless each of them is destroyed and recreated every time to purify them.

Anyway, thank you for your answers and I am curious what you think and if you have any other suggestion.

Cheers.




No answer in itself as your question is more of a story than a question with an answer.

Just to say I successfully created a node.js server using socket.io that handles over 1 million persistent connections with an average message payload of 700 bytes.

The 1 Gbps NIC was initially overloaded and I saw a lot of I / O latency from publish events for all clients.

Removing Nginx from the proxy role had also returned valuable memory, as achieving one million persistent connections with just ONE server is a difficult task to optimize configurations, applications, and operating system parameters. Keep in mind that this is only possible with a lot of RAM (around 1 million websockets connections consume around 16 GB of RAM. With node.js, using sock.js is ideal for low memory consumption, but for now socket .io uses so much).

This link was my starting point to achieve this volume of connections with the node. Aside from being an Erlang app, all of the operating system optimization is pretty much application independent and should be used by anyone targeting a lot of persistent connections (websockets or long polling).

HTH,

We use cookies and other tracking technologies to improve your browsing experience on our website, to show you personalized content and targeted ads, to analyze our website traffic, and to understand where our visitors are coming from.

By continuing, you consent to our use of cookies and other tracking technologies and affirm you're at least 16 years old or have consent from a parent or guardian.

You can read details in our Cookie policy and Privacy policy.