Linux menu

Monday, October 27, 2014

The GNU/Linux Kernel tunning in linux

Linux is the world-leading open-source kernel.
It is designed to peform well on a wide range of hardware.

File Handle Limits

When you're serving a lot of traffic it is usually the case that the traffic you're serving is coming from a large number of local files.
The kernel has built-in limits on the number of files that a process can open, and raising these limits, at a cost of some system memory, is usually a sane thing to attempt.
You can view the current limit on the number of open-files by running:
$ cat /proc/sys/fs/file-max
The limit can be raised interactively by running, as root:
# sysctl -w fs.file-max=100000
If you wish that change to be made persistently you should append to the file /etc/sysctl.conf the line:
fs.file-max = 100000
Then run the following command to make your change take effect:
# sysctl -p

Socket Tuning

For servers which are handling large numbers of concurent sessions, there are some TCP options that should probabaly be tweaked.
With a large number of clients comnunicating with your server it wouldn't be unusual to have a 20,000 open sockets or more. To increase that range you append the following to the bottom of /etc/sysctl.conf:
# Use the full range of ports.
net.ipv4.ip_local_port_range = 1024 65535
You can also increase the recycling time of sockets, avoiding large numbers of them staying in the TIME_WAIT status by adding these values to /etc/sysctl.conf:
# Enables fast recycling of TIME_WAIT sockets.
# (Use with caution according to the kernel documentation!)
net.ipv4.tcp_tw_recycle = 1

# Allow reuse of sockets in TIME_WAIT state for new connections
# only when it is safe from the network stack’s perspective.
net.ipv4.tcp_tw_reuse = 1
Finally one problem you'll find is that if a socket is listening and busy a connection-backlog will pile up. The kernel will keep pending connections in a buffer before failing. You can tweak several values to increase the size of the backlog:
#
# 16MB per socket - which sounds like a lot, but will virtually never
# consume that much.
#
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216

# Increase the number of outstanding syn requests allowed.
# c.f. The use of syncookies.
net.ipv4.tcp_max_syn_backlog = 4096
net.ipv4.tcp_syncookies = 1

# The maximum number of "backlogged sockets".  Default is 128.
net.core.somaxconn = 1024
The trade-off here is that a connecting client will see a slow connection, but this is almost certainly better than a Connection Refused error.
Once you've made those additions you can cause them to be loaded by running:
# sysctl -p
Finally if you've changed these limits you will need to restart the associated daemons. (For example "service nginx restart".)

Process Scheduler

If you're running a recent ( newer than approx 2.6.32) you've got the 'Completely Fair Scheduler' (CFS) For modern systems serving lots of connections on lots of cores, you may hit issues with process migration.
There's a kernel parameter that determines how long a migrated process has to be running before the kernel will consider migrating it again to another core. The sysctl name is sched_migration_cost_ns, default value 50000 (that's ns so 0.5 ms):
$ cat /proc/sys/kernel/sched_migration_cost_ns
(It was renamed from sched_migration_cost at some point between 3.5 and 3.8)
Forking servers, like PostgreSQL or Apache, scale to much higher levels of concurrent connections if this is made larger, by at least an order of magnitude:
The limit can be raised interactively by running, as root:
# sysctl -w kernel.sched_migration_cost_ns=5000000
If you wish that change to be made persistently you should append to the file /etc/sysctl.conf the line:
kernel.sched_migration_cost_ns = 5000000
Another parameter that can dramatically impact forking servers is sched_autogroup_enabled. This setting groups tasks by TTY, to improve perceived responsiveness on an interactive system. On a server with a long running forking daemon, this will tend to keep child processes from migrating away as soon as they should. It can be disabled like so:
# sysctl -w kernel.sched_autogroup_enabled=0
Various PostgreSQL users have reported (on the postgresql performance mailing list) gains up to 30% on highly concurrent workloads on multi-core systems.
If you wish that change to be made persistently you should append to the file /etc/sysctl.conf the line:
kernel.sched_autogroup_enabled = 0
Then run the following command to make your change take effect:
# sysctl -p

Filesystem Tuning

You almost certainly want to disable the "atime" option on your filesystems.
With this disabled that the last time a file was accessed won't be constantly updated every time you read a file, since this information isn't generally useful inand causes extra disk hits, its typically disabled.
To do this, just edit /etc/fstab and add "notime" as a mount option for the filesystem. For example:
    /dev/rd/c0d0p3          /test                    ext3    noatime        1 2

Swap Tuning

  • TODO

RAID Tuning

It seems to be the case that if you have the deadline scheduler this is best for RAID setups, however this is something that you'll want to test yourself.
Boot your kernel with elevator=deadline appended to the command-line and compare the result via your favourite filesystem test.

Comments

Posted 36 weeks ago.
About socket tuning, modifying net.core.rmem_max and net.core.wmem_max couldn't be enough, except if your application(s) update their own socket buffer (with setsockopt(), SO_RCVBUF).
If you want these settings available for any socket, you should also update net.core.rmem_default and net.core.wmem_default, with attention using double of target value...

Regarding "File Handle Limits". Shouldn't the limit be tuned to peak demand? Check dmesg for errors that look like this:
[510822.012643] VFS: file-max limit 400 reached
(I set a limit of 400 to force this message for this example.)
There is also a per-process limit, "/proc/sys/fs/nr_open", which I believe is not the same as a possible ulimit that may also be applied per process (or cgroup, as well, perhaps?).
On a system where you find yourself bumping into the "file-max" limit it is likely not due to a single service because a single service should hit the "nr-open" limit before the system hits the "file-max" limit. I'm speculating here. Perhaps applications such as Apache2 where they may prefork many processes may circumvent the "nr-open" limit and get the entire system up to the "file-max" limit. I'd be interested to hear from someone who has actually had to increase the file descriptor limits.

No comments: