Linux menu

Monday, October 27, 2014

Linux Kernel Tuning For Faster Performance

Every now and then, questions come up here internally and with many enthusiasts on what Gluster has to say about kernel tuning, if anything.
The rarity of kernel tuning is on account of the Linux kernel doing a pretty good job on most workloads. But there is a flip side to this design. The Linux kernel historically has eagerly eaten up a lot of RAM, provided there is some, or driven towards caching as the primary way to improve performance.
For most cases, this is fine, but as the amount of workload increases over time and clustered load is thrown upon the servers, this turns out to be troublesome, leading to catastrophic failures of jobs etc.
Having had a fair bit of experience looking at large memory systems with heavily loaded regressions, be it CAD, EDA or similar tools, we've sometimes encountered stability problems with Gluster. We had to carefully analyse the memory footprint and amount of disk wait times over days. This gave us a rather remarkable story of disk trashing, huge iowaits, kernel oops, disk hangs etc.
This article is the result of my many experiences with tuning options which were performed on many sites. The tuning not only helped with overall responsiveness, but it dramatically stabilized the cluster overall.
When it comes to memory tuning the journey starts with the 'VM' subsystem which has a bizarre number of options, which can cause a lot of confusion.

vm.swappiness

vm.swappiness is a tunable kernel parameter that controls how much the kernel favors swap over RAM. At the source code level, it’s also defined as the tendency to steal mapped memory. A high swappiness value means that the kernel will be more apt to unmap mapped pages. A low swappiness value means the opposite, the kernel will be less apt to unmap mapped pages. In other words, the higher the vm.swappiness value, the more the system will swap.
High system swapping has very undesirable effects when there are huge chunks of data being swapped in and out of RAM. Many have argued for the value to be set high, but in my experience, setting the value to '0' causes a performance increase.
Conforming with the details here - http://lwn.net/Articles/100978/
But again these changes should be driven by testing and due diligence from the user for their own applications. Heavily loaded, streaming apps should set this value to '0'. By changing this value to '0', the system's responsiveness improves.

vm.vfs_cache_pressure

This option controls the tendency of the kernel to reclaim the memory which is used for caching of directory and inode objects.
At the default value of vfs_cache_pressure=100 the kernel will attempt to reclaim dentries and inodes at a "fair" rate with respect to pagecache and swapcache reclaim. Decreasing vfs_cache_pressure causes the kernel to prefer to retain dentry and inode caches. When vfs_cache_pressure=0, the kernel will never reclaim dentries and inodes due to memory pressure and this can easily lead to out-of-memory conditions. Increasing vfs_cache_pressure beyond 100 causes the kernel to prefer to reclaim dentries and inodes.
With GlusterFS, many users with a lot of storage and many small files easily end up using a lot of RAM on the server side due to 'inode/dentry' caching, leading to decreased performance when the kernel keeps crawling through data-structures on a 40GB RAM system. Changing this value higher than 100 has helped many users to achieve fair caching and more responsiveness from the kernel.

vm.dirty_background_ratio

vm.dirty_ratio

The first of the two (vm.dirty_background_ratio) defines the percentage of memory that can become dirty before a background flushing of the pages to disk starts. Until this percentage is reached no pages are flushed to disk. However when the flushing starts, then it's done in the background without disrupting any of the running processes in the foreground.
Now the second of the two parameters (vm.dirty_ratio) defines the percentage of memory which can be occupied by dirty pages before a forced flush starts. If the percentage of dirty pages reaches this threshold, then all processes become synchronous, and they are not allowed to continue until the io operation they have requested is actually performed and the data is on disk. In cases of high performance I/O machines, this causes a problem as the data caching is cut away and all of the processes doing I/O become blocked to wait for I/O. This will cause a large number of hanging processes, which leads to high load, which leads to an unstable system and crappy performance.
Lowering them from standard values causes everything to be flushed to disk rather than storing much in RAM. It helps large memory systems, which would normally flush a 45G-90G pagecache to disk, causing huge wait times for front-end applications, decreasing overall responsiveness and interactivity.

"1" > /proc/sys/vm/pagecache

Page Cache is a disk cache which holds data from files and executable programs, i.e. pages with actual contents of files or block devices. Page Cache (disk cache) is used to reduce the number of disk reads. A value of '1' indicates 1% of the RAM is used for this, so that most of them are fetched from disk rather than RAM. This value is somewhat fishy after the above values have been changed. Changing this option is not necessary, but if you are still paranoid about controlling the pagecache, this value should help.

"deadline" > /sys/block/sdc/queue/scheduler

The I/O scheduler is a component of the Linux kernel which decides how the read and write buffers are to be queued for the underlying device. Theoretically 'noop' is better with a smart RAID controller because Linux knows nothing about (physical) disk geometry, therefore it can be efficient to let the controller, well aware of disk geometry, handle the requests as soon as possible. But 'deadline' seems to enhance performance. You can read more about them in the Linux kernel source documentation: linux/Documentation/block/*iosched.txt . I have also seen 'read' throughput increase during mixed-operations (many writes).

"256" > /sys/block/sdc/queue/nr_requests

This is the size of I/O requests which are buffered before they are communicated to the disk by the Scheduler. The internal queue size of some controllers (queue_depth) is larger than the I/O scheduler's nr_requests so that the I/O scheduler doesn't get much of a chance to properly order and merge the requests. Deadline or CFQ scheduler likes to have nr_requests to be set 2 times the value of queue_depth, which is the default for a given controller. Merging the order and requests helps the scheduler to be more responsive during huge load.

echo "16" > /proc/sys/vm/page-cluster

page-cluster controls the number of pages which are written to swap in a single attempt. It defines the swap I/O size, in the above example adding '16' as per the RAID stripe size of 64k. This wouldn't make sense after you have used swappiness=0, but if you defined swappiness=10 or 20, then using this value helps when your have a RAID stripe size of 64k.

blockdev --setra 4096 /dev/<devname> (eg:- sdb, hdc or dev_mapper)

Default block device settings often result in terrible performance for many RAID controllers. Adding the above option, which sets read-ahead to 4096 * 512-byte sectors, at least for the streamed copy, increases the speed, saturating the HD's integrated cache by reading ahead during the period used by the kernel to prepare I/O. It may put in cached data which will be requested by the next read. Too much read-ahead may kill random I/O on huge files if it uses potentially useful drive time or loads data beyond caches.
A few other miscellaneous changes which are recommended at filesystem level but haven't been tested yet are the following. Make sure that your filesystem knows about the stripe size and number of disks in the array. E.g. for a raid5 array with a stripe size of 64K and 6 disks (effectively 5, because in every stripe-set there is one disk doing parity). These are built on theoretical assumptions and gathered from various other blogs/articles provided by RAID experts.
-> ext4 fs, 5 disks, 64K stripe, units in 4K blocks
mkfs -text4 -E stride=$((64/4))
-> xfs, 5 disks, 64K stripe, units in 512-byte sectors
mkfs -txfs -d sunit=$((64*2)) -d swidth=$((5*64*2))
You may want to consider increasing the above stripe sizes for streaming large files.
WARNING: Above changes are highly subjective with certain types of applications. This article doesn't guarantee any benefits whatsoever without prior due diligence from the user for their respective applications. It should only be applied at the behest of an expected increase in overall system responsiveness or if it resolves ongoing issues.
More informative and interesting articles/emails/blogs to read

comment:jdarcy

Some additional tuning ideas:
   * The choice of scheduler is *really* hardware- and workload-dependent, and some schedulers have unique features other than performance.  For example, last time I looked cgroups support was limited to the cfq scheduler.  Different tests regularly do best on any of cfq, deadline, or noop.  The best advice here is not to use a particular scheduler but to try them all for a specific need.
   * It's worth checking to make sure that /sys/.../max_sectors_kb matches max_hw_sectors_kb.  I haven't seen this problem for a while, but back when I used to work on Lustre I often saw that these didn't match and performance suffered.
   * For read-heavy workloads, experimenting with /sys/.../readahead_kb is definitely worthwhile.
   * Filesystems should be built with -I 512 or similar so that more xattrs can be stored in the inode instead of requiring an extra seek.
   * Mounting with noatime or relatime is usually good for performance.

reply:y4m4

   Agreed i was about write those parameters you mentioned. I should write another elaborate article on FS changes. 
y4m4

comment:eco

       1 year ago
   This article is the model on which all articles should be written.  Detailed information, solid examples and a great selection of references to let readers go more in depth on topics they choose.  Great benchmark for others to strive to attain.
       Eco

comment:y4m4

   sysctl -w net.core.{r,w}mem_max = 4096000 - this helped us to Reach 800MB/sec with replicated GlusterFS on 10gige  - Thanks to Ben England for these test results. 
       y4m4

comment:bengland

   After testing Gluster 3.2.4 performance with RHEL6.1, I'd suggest some changes to this article's recommendations:
   vm.swappiness=10 not 0 -- I think 0 is a bit extreme and might lead to out-of-memory conditions, but 10 will avoid just about all paging/swapping.  If you still see swapping, you need to probably focus on restricting dirty pages with vm.dirty_ratio.
   vfs_cache_pressure > 100 -- why?   I thought this was a percentage.
   vm.pagecache=1 -- some distros (e.g. RHEL6) don't have vm.pagecache parameter. 
   vm.dirty_background_ratio=1 not 10 (kernel default?) -- the kernel default is a bit dependent on choice of Linux distro, but for most workloads it's better to set this parameter very low to cause Linux to push dirty pages out to storage sooner.    It means that if dirty pages exceed 1% of RAM then it will start to asynchronously write dirty pages to storage. The only workload where this is really bad: apps that write temp files and then quickly delete them (compiles) -- and you should probably be using local storage for such files anyway. 
   Choice of vm.dirty_ratio is more dependent upon the workload, but in other contexts I have observed that response time fairness and stability is much better if you lower dirty ratio so that it doesn't take more than 2-5 seconds to flush all dirty pages to storage. 
   block device parameters:
   I'm not aware of any case where cfq scheduler actually helps Gluster server.   Unless server I/O threads correspond directly to end-users, I don't see how cfq can help you.  Deadline scheduler is a good choice.  I/O request queue has to be deep enough to allow scheduler to reorder requests to optimize away disk seeks.  The parameters max_sectors_kb and nr_requests are relevant for this.  For read-ahead, consider increasing it to the point where you prefetch for longer period of time than a disk seek (on order of 10 msec), so that you can avoid unnecessary disk seeks for multi-stream workloads.  This comes at the expense of I/O latency so don't overdo it.
   network:
   jumbo frames can increase throughput significantly for 10-GbE networks.
   Raise net.core.{r,w}mem_max to 540000 from default of 131071  (not 4 MB above, my previous recommendation).  Gluster 3.2 does setsockopt() call to use 1/2 MB mem for TCP socket buffer space.
       bengland

comment:hjmangalam

   Thanks very much for noting this info - the descriptions are VERY good.. I'm in the midst of debugging a misbehaving gluster that can't seem to handle small writes over IPoIB and this contains some useful pointers.
   Some suggestions that might make this more immediately useful:
   - I'm assuming that this discussion refers to the gluster server nodes, not to the gluster native client nodes, yes?  If that's the case, are there are also kernel parameters or recommended settings for the client nodes?
   -  While there are some cases where you mention that a value should be changed to a particular # or %, in a number of cases you advise just increasing/decreasing the values, which for something like  a kernel parameter is probably not a useful suggestion.  Do I raise it by 10?  10%  2x? 10x?  

   I also ran across a complimentary page, which might be of  interest - it explains more of the vm variables, especially as it relates to writing.
   "Theory of Operation and Tuning for Write-Heavy Loads" 
      <http://www.westnet.com/~gsmith/content/linux-pdflush.htm>
   and refs therein.
       hjmangalam

comment:bengland

   Here are some additional suggestions based on recent testing:
   - scaling out number of clients -- you need to increase the size of the ARP tables on Gluster server if you want to support more than 1K clients mounting a gluster volume.  The defaults for RHEL6.3 were too low to support this, we used this:
   net.ipv4.neigh.default.gc_thresh2 = 2048
   net.ipv4.neigh.default.gc_thresh3 = 4096
   In addition, tunings common to webservers become relevant at this number of clients as well, such as netdev_max_backlog, tcp_fin_timeout, and somaxconn.
   Bonding mode 6 has been observed to increase replication write performance, I have no experience with bonding mode 4 but it should work if switch is properly configured, other bonding modes are a waste of time.
       bengland
       3 months ago

No comments: