Nodejs http client keepAlive

2020-02-05

Why

A lot of services communicate with APIs that are still using HTTP/1.1, and reusing TCP connections in node is a straightforward way to reduce the CPU load and improve E2E latency. Below is an example on how to that in Nodejs.

This also will be very handy for E2E tests that are calling REST APIs directly. This can shave off a fair amount of time from test execution.

Caveats

It's worth double checking the TCP socket timeout on the target server, if the TCP keep-alive timeout is longer on the client, then the server could close the TCP connection, and then the client will try to use a closed socket, you'll see a ECONNRESET. You have the option to handle these errors, or set a shorter TCP socket timeout on the Client.
This issue is very likely to arise on Linux, as by default Nodejs on Linux will try to keep the socket open for an unlimited amount of time.

A second concern is a backend service that frequently changes its IP address, if the socket is open, then there won't be any DNS resolution that will return the new IP. So this will have to be handled as well.

The third main concern is that the traffic from one client might go to a few specific instances, thus interfering with DNS load balancing.

Server

We'll generate a self signed certificate using openssl for the test https server.

openssl genrsa -out key.pem
openssl req -new -key key.pem -out csr.pem
openssl x509 -req -days 9999 -in csr.pem -signkey key.pem -out cert.pem
rm csr.pem
const https = require("https");
const fs = require("fs");

const options = {
  key: fs.readFileSync("key.pem"),
  cert: fs.readFileSync("cert.pem"),
};

https
  .createServer(options, function (req, res) {
    res.writeHead(200, { "Content-Type": "application/json" });
    res.end('{"result": true}');
  })
  .listen(8000);

Client

const https = require("https");
const { promisify } = require("util");

https.get[promisify.custom] = function getAsync(options) {
  return new Promise((resolve, reject) => {
    https
      .get(options, (response) => {
        response.end = new Promise((resolve) => response.on("end", resolve));
        resolve(response);
      })
      .on("error", reject);
  });
};

const get = promisify(https.get);

const toMilliseconds = (hrtime) => hrtime / 1000000;

const sortNumber = (a, b) => a - b;

const quantile = (array, percentile) => {
  array.sort(sortNumber);
  index = (percentile / 100) * (array.length - 1);
  if (Math.floor(index) == index) {
    result = array[index];
  } else {
    i = Math.floor(index);
    fraction = index - i;
    result = array[i] + (array[i + 1] - array[i]) * fraction;
  }
  return result;
};

async function main(withAgent) {
  const httpAgent = new https.Agent({ keepAlive: true });
  const options = {
    port: 8000,
    rejectUnauthorized: false, // ignore that the certificate is self signed
    agent: withAgent && httpAgent,
  };

  const numberOfSamples = 10000;
  const results = Array(numberOfSamples);
  for (var i = 0; i < numberOfSamples; i++) {
    const hrstart = process.hrtime();

    const res = await get(options);
    res.on("data", (ch) => {});
    await res.end;

    const hrend = process.hrtime(hrstart);
    results[i] = hrend[1];
  }

  const total = results.reduce((total, v) => total + v, 0);

  console.log(withAgent ? "\nWith Agent" : "\nWithout Agent");
  console.log("avg", toMilliseconds(total / numberOfSamples), "ms");
  console.log("min", toMilliseconds(Math.min(...results)), "ms");
  console.log("max", toMilliseconds(Math.max(...results)), "ms");
  console.log("95 percentile", toMilliseconds(quantile(results, 95)), "ms");
}

main();
main(true);

Results

node client.js

With Agent
avg 0.3077465228 ms
min 0.100038 ms
max 11.219779 ms
95 percentile 0.54670905 ms

Without Agent
avg 0.9594316489 ms
min 0.794657 ms
max 20.152331 ms
95 percentile 1.2987119999999999 ms

As you can see the keep-alive results are 2x better across the board, except the max latency, which is still solidly better, but not to be taken too seriously, as it could be a concidence. The 95 percentile is the one that matters, and it's 2x which is solid.

Same performance gains will apply to http clients, it will be a bit less than the TLS one, as there won't be any TLS handshake, but still the gain is solid.