Linux menu

Wednesday, September 17, 2014

How to do Linux NFS Performance Tuning and Optimization In Linux

Our introductory guide to NFS did not contain some major topics, that requires special attention when we talk about NFS. These topics must always be given an extra bit of care, while configuring NFS. We purposely skipped out most of the serious topics in NFS in that tutorial guide, because of the simple reason that they are serious topics and must always be discussed separate. For those who did not read our NFS introductory guide, i will recommend reading that before beginning this tutorial.
The things which we skipped in the above tutorial are 1. NFS Performance Tuning Guidelines,and 2. Securing NFSWe will be doing a separate post for security related stuff. In this post we will be discussing topics that in some or the other way affects the performance of NFS.
NFS Performance tuning can be classified to three different areas. We will be discussing them separately in this tutorial. Lets have a look at these classifications first.
  • Underlying Disk Related Performance that affects NFS
  • NFS Application based Performance
  • And finally Network Related NFS tuning (NFS is a technology that relies heavily on network)
Tuning both the NFS server and NFS client, both are very much important, because they are the ones who take part in this network file system communication. So let's begin this with some mount command options, that can be used to tune NFS performance, primarily from the client side.

 

Mount command Block Size Settings to improve NFS performance

The amount and size of data, that the server and the client uses, for passing data between them is very much important. Most of the NFS versions has a default value for this settings. However you can always tune these values to suite your needs. We will be working with the same NFS server and client, that we have used for our previous tutorial.
Assume that you have an NFS share mounted on one of your NFS client system. Let's have a look at the default properties of this mount.
?
1
2
3
4
5
6
7
[root@slashroot2 ~]# mount 192.168.0.103:/data /mnt
[root@slashroot2 ~]# df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1              38G  5.6G   31G  16% /
tmpfs                 252M     0  252M   0% /dev/shm
192.168.0.103:/data    38G  2.8G   34G   8% /mnt
[root@slashroot2 ~]#
Let's have a look at the properties and options that the NFS client selected by default to mount this share. We can easily get that information from the file /proc/mounts.
?
1
2
3
[root@slashroot2 ~]# cat /proc/mounts
192.168.0.103:/data /mnt nfs rw,vers=3,rsize=32768,wsize=32768,hard,proto=tcp,timeo=600,retrans=2,sec=sys,addr=192.168.0.103 0 0
[root@slashroot2 ~]#
You will also get details about other file systems mounted on your system from the above file, however i have only shown you the details reated to our NFS share mounted, to avoid confusion.
The details, shows you the default options that were used, while mounting that particular share on the client.
rw Tells that the file system is mounted in read/write mode
vers=3 This means we are using NFS version 3 for this mount
rsize=32768  wsize=32768 This specifies the size of the data chunk that each RPC packet takes while reading and writing. Tuning them will sometimes increase performance and can also sometimes reduce the performance. Let's see why.
Tuning rsize wsize must always be done by keeping the capacity of your network, as well as the processing and performance power of your client and the server. So let's say you have decided to decrease the size of rsize & wsize in your mount. Decreasing the size of read and write in RPC packets, will increase the total number of network IP packet's that need to be send over the network.
Which means if you have 1 MB of data, dividing it into equal chunks of 32KB will increase the number of chunks, and if you divide it in equal chunks of 64KB the number of chunks will be reduced. Which means you need to send a high number of IP packet's over the network if you decrease these values, and if you increase these values, you will have to send less number of IP packets over the network.
So our decision on modifying this parameter, must always depend on the network capability. If suppose you have 1 Gigabit port on your NFS server and client, and your network switches connecting these server's also are capable of 1G ports, then i would suggest to tweak these parameter's to a higher value.
You can easily modify rsize and wsize values while mounting as shown below(The maximum value that can be set is 65536, which depends on the current kernel version you have).
?
1
[root@slashroot2 ~]# mount 192.168.0.104:/data  /mnt -o rsize=65536,wsize=65536
Like the above shown mount command, you can modify the rsize and wsize options in NFS. Or otherwise you can modify it permanently in the fstab mount entry.
The best method to select a good rsize and wsize value for you is to alter them to different values and do a read/write performance test. And then select the value that gives you the best performance. You can refer to our post read/write performance test in linux , to test the speed.

Modifying Network MTU Size for NFS

MTU stands for Maximum Transmission Unit. Its the highest amount of data that can be passed in one Ethernet frame. Most of the machine's have them configured to the default value of 1500 bytes.
To get the current value of your MTU, on your NIC cards, you can run the below command.
?
1
2
3
4
[root@slashroot2 ~]# netstat -i
Kernel Interface table
Iface       MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0       1500   0      227      0      0      0      235      0      0      0
Or alternatively you can also get the value of MTU from ifconfig command in Linux.
Suppose let's say that your rsize and wsize value is 8 KBand you are using 1500 bytes MTU size, then data will still be fragmented while sending because the maximum size is 1500 bytes.
In that case if you modify your MTU size to 9000 bytes, it will be able to send the whole 8kb read/write data without fragmenting in one frame. But to get the thing accomplished, you need to change the MTU of both the server and the client to the same value.
Changing MTU is quite simple in linux. You can specify the MTU size of your required interface card configuration file. Suppose you need to change the MTU for your eth0 interface. You simply need to edit the file /etc/sysconfig/network-scripts/ifcfg-eth0and add the line "MTU=9000"
Otherwise you can also change MTU with the help of ifconfig command as shown below.
?
1
2
3
4
5
[root@slashroot2 ~]# ifconfig eth0 mtu 9000 up
[root@slashroot2 ~]# ifconfig
eth0      Link encap:Ethernet  HWaddr 08:00:27:55:D1:CC
          inet6 addr: fe80::a00:27ff:fe55:d1cc/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:9000  Metric:1
Note: Changing the MTU size is quite risky in production network, because it can affect your current running applications sometimes. And also some ISP's do not accept frames that are larger than their specified MTU size.

 

timeo and retrans options in NFS

The above two options affect the number of retry attempts made by the client to the server in case of a delayed response from the server or sometimes no response from the server.
timeo option in NFS decides the time the client needs to wait before it comes to a conclusion that it must retransmit the packet. The default value is 0.7(It is calculated in tenth of the second given. So if you give a value of 5 timeo then it means the client will wait for 5/10 seconds before deciding that it needs to resend the packet.)
And the second option retrans decides the total number of attempts made by the client, incase it gets a timeout (after waiting for timeo seconds you provided).
So if you give retrans value as 3, the client will resend the RPC packet 3 times(and each time it will wait for timeo seconds) before coming to a conclusion that the server is not available and will give you a message called "Server not responding". Also after the message the counter resets and the client will still keep on trying(With the same timeo and retrans values).
You can modify timeo and retrans values as an option in mount command as shown below.

?
1
2
[root@slashroot2 ~]# mount 192.168.0.102:/data /mnt -o timeo=5,retrans=4
[root@slashroot2 ~]#
If you want to see the current nfst statistics for retranmission of packets, then you can use nfsstat command as shown below.
?
1
2
3
4
[root@slashroot2 ~]# nfsstat -r
Client rpc stats:
calls      retrans    authrefrsh
5          0          0
On a conjusted network, where you client get's a reply from the server but is a little delayed(Due to which retrans happens too often), you can increase the timeo value. This will result in a little bit increase in performance.

 

Number of NFS threads on the NFS server

Another important factor that needs to be taken care of while working with NFS is the total number of NFS threads that are available on the NFS server. If you have a large number of clients that access your NFS server, then it will be better to increase the number of threads on the NFS server.
You can have a look at the current number of threads on your NFS server by the below command.
?
1
2
3
4
5
6
7
8
9
10
[root@slashroot1 ~]# ps aux | grep nfs
root      4794  0.0  0.0      0     0 ?        S<   03:18   0:00 [nfsd4]
root      4795  0.0  0.0      0     0 ?        S    03:18   0:00 [nfsd]
root      4796  0.0  0.0      0     0 ?        S    03:18   0:00 [nfsd]
root      4797  0.0  0.0      0     0 ?        S    03:18   0:00 [nfsd]
root      4798  0.0  0.0      0     0 ?        S    03:18   0:00 [nfsd]
root      4799  0.0  0.0      0     0 ?        S    03:18   0:00 [nfsd]
root      4800  0.0  0.0      0     0 ?        S    03:18   0:00 [nfsd]
root      4801  0.0  0.0      0     0 ?        S    03:18   0:00 [nfsd]
root      4802  0.0  0.0      0     0 ?        S    03:18   0:00 [nfsd]
If you count the total number of nfsd process it will be 8 (Which is the default number). Which means if you have a large number of clients accessing this NFS server, they will experience some amount of lag in their operations.
Let's increase this number to some higher number like 20. You can modify this value in/etc/sysconfig/nfs  file.
?
1
2
3
# Number of nfs server processes to be started.
# The default is 8.
RPCNFSDCOUNT=16
After modifying that value, you need to restart the nfs service. You should now get 16 instead of 8 in the process list.

 

Async and Sync in NFS mount

These are the two values that determines how data is written on the server on a client request.
Both has their own advantages and disadvantages. Let's first understand what is async and sync in NFS mount.
Whatever you do on an NFS client is converted to an RPC equivalent operation, so that it can be send to the server using RPC protocol. So if you are using async option in NFS, when the server reieves an RPC operation for writing, it first converts that operation to a VFS(Virtual File System) operation to write the data in the underlying disk system.
As soon as the VFS handle's the write operation to the underlying disk, even before getting an acknowledgement that the write operation is completed, the Server becomes ready to accept further RPC write operations. In this case the NFS server increases the performance for writing, by reducing the time needed to complete the write operation.
But this method can sometimes cause data loss and corruption, because the NFS server starts to accept more write operations even before the underlying disk system has completed doing its job.
Using sync option will do the reverse. In this case the server will reply only after a write operation has successfully completed (Which means only after the data is completely written to the disk.).
If you are dealing with critical data then i will never suggest to use async option, however async is a good choice where your data is not that highly critical.
?
1
[root@slashroot2 ~]# mount 192.168.0.101:/data /mnt -o rw,async
Similarly as shown above you can also use sync option according to your requirement. You can make this mount permanent by making an entry in fstab.
?
1
192.168.0.101:/data  /mnt  nfs  rw,async   0   0

Tuning Input and output Socket Queue for NFS performance

Transferring large file's over network requires high memory on the server as well as the client. However the Linux machine,  by default never allocates a high amount of memory for this purpose, as it requires memory for other applications as well.
You can further tune it and allocate a higher memory, if you are having heavy input and output through network.
There are two values that can be modified to tune them. One is the socket input queue and the other is the socket output queue. Input queue is the place where requests that needs to be processed queue up.
Output queue is the place where the requests that are going out side queue up.  We have already seen that increasing the number of NFS server threads on the server can improve performance. Imagine you have 16 threads on your server, and each are processing requests from separate clients. Each of them uses the same socket input and output queue (and even other applications on the server will use this queue for processing their request.). Which means if you have a higher input and output socket queue size, then all of your threads can effectively send and receive data.
You can modify those values by modifying the sysctl.conf file, or if you want, you can directly modify the files in /proc (you need to restart nfs server after modifying this)
?
1
2
echo 219136 > /proc/sys/net/core/rmem_default
echo 219136 > /proc/sys/net/core/rmem_max
And you can also modify the output queue by modifying the wmem_default & wmem_max values as shown below.
?
1
2
echo 219136 > /proc/sys/net/core/wmem_default
echo 219136 > /proc/sys/net/core/wmem_max
Anything that you modify in /proc file system is temporary, because its the value that's stored in the RAM, which does not persist across reboots. You can make these entries permanent by making an entry in sysctl.conf as shown below.
?
1
2
[root@slashroot2 ~]#echo 'net.core.wmem_max=219136' >> /etc/sysctl.conf
[root@slashroot2 ~]#echo 'net.core.rmem_max=219136' >> /etc/sysctl.conf

Underlying Disk Configuration in NFS server

The configuration and make of the underlying DISK, which you expose as an NFS share on the server plays a significant role in the performance. If you have your NFS share on a RAID array, then that can improve the read and write performance depending upon the raid level configured.
The best raid level to prefer is always raid level 10. But its pretty costly because of the number of disk's used. If you want a nice read speed, then you can always go for raid level 5 or 6. But raid level 5 and 6 are bit slow for write.
Tune each and every parameter's suggested in this article, by continuously performing the read/write performance test, to reach an optimum level of tuning.

No comments: