Linux menu

Wednesday, September 17, 2014

Guide to Advanced Linux Command Mastery For DBA Users

Note that these commands may differ based on the specific version of Linux you use or which specific kernel is compiled, but if so, probably only slightly.

Painless Changes to Owner, Group, and Permissions

In Sheryl's article you learned how to use chown and chgrp commands to change ownership and group of the files. Say you have several files like this:
# ls -l
total 8
-rw-r--r--    1 ananda   users          70 Aug  4 04:02 file1
-rwxr-xr-x    1 oracle   dba           132 Aug  4 04:02 file2
-rwxr-xr-x    1 oracle   dba           132 Aug  4 04:02 file3
-rwxr-xr-x    1 oracle   dba           132 Aug  4 04:02 file4
-rwxr-xr-x    1 oracle   dba           132 Aug  4 04:02 file5
-rwxr-xr-x    1 oracle   dba           132 Aug  4 04:02 file6
and you need to change the permissions of all the files to match those of file1. Sure, you could issue chmod 644 * to make that change—but what if you are writing a script to do that, and you don’t know the permissions beforehand? Or, perhaps you are making several permission changes and based on many different files and you find it infeasible to go though the permissions of each of those and modify accordingly.
A better approach is to make the permissions similar to those of another file. This command makes the permissions of file2 the same as file1:
chmod --reference file1 file2
Now if you check:
# ls -l file[12]
total 8
-rw-r--r--    1 ananda   users          70 Aug  4 04:02 file1
-rw-r--r--    1 oracle   dba           132 Aug  4 04:02 file2
The file2 permissions were changed exactly as in file1. You didn’t need to get the permissions of file1 first.
You can also use the same trick in group membership in files. To make the group of file2 the same as file1, you would issue:
# chgrp --reference file1 file2
# ls -l file[12]
-rw-r--r--    1 ananda   users          70 Aug  4 04:02 file1
-rw-r--r--    1 oracle   users         132 Aug  4 04:02 file2
Of course, what works for changing groups will work for owner as well. Here is how you can use the same trick for an ownership change. If permissions are like this:
# ls -l file[12] 
-rw-r--r--    1 ananda   users          70 Aug  4 04:02 file1
-rw-r--r--    1 oracle   dba           132 Aug  4 04:02 file2
You can change the ownership like this:
# chown --reference file1 file2
# ls -l file[12] 
-rw-r--r--    1 ananda   users          70 Aug  4 04:02 file1
-rw-r--r--    1 ananda   users         132 Aug  4 04:02 file2
Note that the group as well as the owner have changed.

Tip for Oracle Users

This is a trick you can use to change ownership and permissions of Oracle executables in a directory based on some reference executable. This proves especially useful in migrations where you can (and probably should) install as a different user and later move them to your regular Oracle software owner.

More on Files

The ls command, with its many arguments, provides some very useful information on files. A different and less well known command –stat – offers even more useful information.
Here is how you can use it on the executable “oracle”, found under $ORACLE_HOME/bin.
# cd $ORACLE_HOME/bin
# stat oracle
  File: `oracle'
  Size: 93300148        Blocks: 182424     IO Block: 4096   Regular File
Device: 343h/835d       Inode: 12009652    Links: 1    
Access: (6751/-rwsr-s--x)  Uid: (  500/  oracle)   Gid: (  500/     dba)
Access: 2006-08-04 04:30:52.000000000 -0400
Modify: 2005-11-02 11:49:47.000000000 -0500
Change: 2005-11-02 11:55:24.000000000 -0500
Note the information you got from this command: In addition to the usual filesize (which you can get from ls -l anyway), you got the number of blocks this file occupies. The typical Linux block size is 512 bytes, so a file of 93,300,148 bytes would occupy (93300148/512=) 182226.85 blocks. Since blocks are used in full, this file uses some whole number of blocks. Instead of making a guess, you can just get the exact blocks.
You also get from the output above the GID and UID of the ownership of the file and the octal representation of the permissions (6751). If you want to reinstate it back to the same permissions it has now, you could use chmod 6751 oracle instead of explicitly spelling out the permissions.
The most useful part of the above output is the file access timestamp information. It shows you that the file was accessed on 2006-08-04 04:30:52 (as shown next to “Access:”), or August 4, 2006 at 4:30:52 AM. This is when someone started to use the database. The file was modified on 2005-11-02 11:49:47 (as shown next to Modify:). Finally, the timestamp next to “Change:” shows when the status of the file was changed.
-f, a modifier to the stat command, shows the information on the filesystem instead of the file:
# stat -f oracle
  File: "oracle"
    ID: 0        Namelen: 255     Type: ext2/ext3
Blocks: Total: 24033242   Free: 15419301   Available: 14198462   Size: 4096
Inodes: Total: 12222464   Free: 12093976  
Another option, -t, gives exactly the same information but on one line:
# stat -t oracle 
oracle 93300148 182424 8de9 500 500 343 12009652 1 0 0 1154682061 
1130950187 1130950524 4096
This is very useful in shell scripts where a simple cut command can be used to extract the values for further processing.

Tip for Oracle Users

When you relink Oracle (often done during patch installations), it moves the existing executables to a different name before creating the new one. For instance, you could relink all the utilities by
relink utilities
It recompiles, among other things, the sqlplus executable. It moves the exiting executable sqlplus to sqlplusO. If the recompilation fails for some reason, the relink process renames sqlplusO to sqlplus and the changes are undone. Similarly, if you discover a functionality problem after applying a patch, you can quickly undo the patch by renaming the file yourself.
Here is how you can use stat on these files:
# stat sqlplus*
  File: 'sqlplus'
  Size: 9865            Blocks: 26         IO Block: 4096   Regular File
Device: 343h/835d       Inode: 9126079     Links: 1    
Access: (0751/-rwxr-x--x)  Uid: (  500/  oracle)   Gid: (  500/     dba)
Access: 2006-08-04 05:15:18.000000000 -0400
Modify: 2006-08-04 05:15:18.000000000 -0400
Change: 2006-08-04 05:15:18.000000000 -0400
 
  File: 'sqlplusO'
  Size: 8851            Blocks: 24         IO Block: 4096   Regular File
Device: 343h/835d       Inode: 9125991     Links: 1    
Access: (0751/-rwxr-x--x)  Uid: (  500/  oracle)   Gid: (  500/     dba)
Access: 2006-08-04 05:13:57.000000000 -0400
Modify: 2005-11-02 11:50:46.000000000 -0500
Change: 2005-11-02 11:55:24.000000000 -0500
It shows sqlplusO was modified on November 11, 2005, while sqlplus was modified on August 4, 2006, which also corresponds to the status change time of sqlplusO . It indicates that the original version of sqlplus was in effect from Nov 11, 2005 to Aug 4, 2006. If you want to diagnose some functionality issues, this is a great place to start. In addition to the file changes, as you know the permission's change time, you can correlate it with any perceived functionality issues.
Another important output is size of the file, which is different—9865 bytes for sqlplus as opposed to 8851 for sqlplusO—indicating that the versions are not mere recompiles; they actually changed with additional libraries (perhaps). This also indicates a potential cause of some problems.

File Types

When you see a file, how do you know what type of file it is? The command file tells you that. For instance:
# file alert_DBA102.log
alert_DBA102.log: ASCII text
The file alert_DBA102.log is an ASCII text file. Let’s see some more examples:
# file initTESTAUX.ora.Z
initTESTAUX.ora.Z: compress'd data 16 bits
This tells you that the file is a compressed file, but how do you know the type of the file was compressed? One option is to uncompress it and run file against it; but that would make it virtually impossible. A cleaner option is to use the parameter -z:
# file -z initTESTAUX.ora.Z
initTESTAUX.ora.Z: ASCII text (compress'd data 16 bits)
Another quirk is the presence of symbolic links:
# file spfile+ASM.ora.ORIGINAL   
spfile+ASM.ora.ORIGINAL: symbolic link to 
/u02/app/oracle/admin/DBA102/pfile/spfile+ASM.ora.ORIGINAL
This is useful; but what type of file is that is being pointed to? Instead of running file again, you can use the option -l:
# file -L spfile+ASM.ora.ORIGINAL
spfile+ASM.ora.ORIGINAL: data
This clearly shows that the file is a data file. Note that the spfile is a binary one, as opposed to init.ora; so the file shows up as data file.

Tip for Oracle Users

Suppose you are looking for a trace file in the user dump destination directory but are unsure if the file is located on another directory and merely exists here as a symbolic link, or if someone has compressed the file (or even renamed it). There is one thing you know: it’s definitely an ascii file. Here is what you can do:
file -Lz * | grep ASCII | cut -d":" -f1 | xargs ls -ltr
This command checks the ASCII files, even if they are compressed, and lists them in chronological order.

Comparing Files

How do you find out if two files—file1 and file2—are identical? There are several ways and each approach has its own appeal.
diffThe simplest command is diff, which shows the difference between two files. Here are the contents of two files:
# cat file1
In file1 only
In file1 and file2
# cat file2
In file1 and file2
In file2 only
If you use the diff command, you will be able to see the difference between the files as shown below:
# diff file1 file2
1d0
< In file1 only
2a2
> In file2 only
#
In the output, a "<" in the first column indicates that the line exists on the file mentioned first,—that is, file1. A ">" in that place indicates that the line exists on the second file (file2). The characters 1d0 in the first line of the output shows what must be done in sed to operate on the file file1 to make it same as file2.
Another option, -y, shows the same output, but side by side:
# diff -y file1 file2 -W 120
In file1 only                             <
In file1 and file2                             In file1 and file2
                                          >    In file2 only

The -W option is optional; it merely instructs the command to use a 120-character wide screen, useful for files with long lines.
If you just want to just know if the files differ, not necessarily how, you can use the -q option.
# diff -q file3 file4
# diff -q file3 file2
Files file3 and file2 differ
Files file3 and file4 are the same so there is no output; in the other case, the fact that the files differ is reported.
If you are writing a shell script, it might be useful to produce the output in such a manner that it can be parsed. The -u option does that:
# diff -u file1 file2        
--- file1       2006-08-04 08:29:37.000000000 -0400
+++ file2       2006-08-04 08:29:42.000000000 -0400
@@ -1,2 +1,2 @@
-In file1 only
 In file1 and file2
+In file2 only
The output shows contents of both files but suppresses duplicates, the + and - signs in the first column indicates the lines in the files. No character in the first column indicates presence in both files.
The command considers whitespace into consideration. If you want to ignore whitespace, use the -b option. Use the -B option to ignore blank lines. Finally, use -i to ignore case.
The diff command can also be applied to directories. The command
diff dir1 dir2
shows the files present in either directories; whether files are present on one of the directories or both. If it finds a subdirectory in the same name, it does not go down to see if any individual files differ. Here is an example:
# diff DBA102 PROPRD     
Common subdirectories: DBA102/adump and PROPRD/adump
Only in DBA102: afiedt.buf
Only in PROPRD: archive
Only in PROPRD: BACKUP
Only in PROPRD: BACKUP1
Only in PROPRD: BACKUP2
Only in PROPRD: BACKUP3
Only in PROPRD: BACKUP4
Only in PROPRD: BACKUP5
Only in PROPRD: BACKUP6
Only in PROPRD: BACKUP7
Only in PROPRD: BACKUP8
Only in PROPRD: BACKUP9
Common subdirectories: DBA102/bdump and PROPRD/bdump
Common subdirectories: DBA102/cdump and PROPRD/cdump
Only in PROPRD: CreateDBCatalog.log
Only in PROPRD: CreateDBCatalog.sql
Only in PROPRD: CreateDBFiles.log
Only in PROPRD: CreateDBFiles.sql
Only in PROPRD: CreateDB.log
Only in PROPRD: CreateDB.sql
Only in DBA102: dpdump
Only in PROPRD: emRepository.sql
Only in PROPRD: init.ora
Only in PROPRD: JServer.sql
Only in PROPRD: log
Only in DBA102: oradata
Only in DBA102: pfile
Only in PROPRD: postDBCreation.sql
Only in PROPRD: RMANTEST.sh
Only in PROPRD: RMANTEST.sql
Common subdirectories: DBA102/scripts and PROPRD/scripts
Only in PROPRD: sqlPlusHelp.log
Common subdirectories: DBA102/udump and PROPRD/udump
Note that the common subdirectories are simply reported as such but no comparison is made. If you want to drill down even further and compare files under those subdirectories, you should use the following command:
diff -r dir1 dir2
This command recursively goes into each subdirectory to compare the files and reports the difference between the files of the same names.

Tip for Oracle Users

One common use of diff is to differentiate between different init.ora files. As a best practice, I always copy the file to a new name—e.g. initDBA102.ora to initDBA102.080306.ora (to indicate August 3,2006)—before making a change. A simple diff between all versions of the file tells quickly what changed and when.
This is a pretty powerful command to manage your Oracle home. As a best practice, I never update an Oracle Home when applying patches. For instance, suppose the current Oracle version is 10.2.0.1. The ORACLE_HOME could be /u01/app/oracle/product/10.2/db1. When the time comes to patch it to 10.2.0.2, I don’t patch this Oracle Home. Instead, I start a fresh installation on /u01/app/oracle/product/10.2/db2 and then patch that home. Once it’s ready, I use the following:
# sqlplus / as sysdba
SQL> shutdown immediate
SQL> exit
# export ORACLE_HOME=/u01/app/oracle/product/10.2/db2
# export PATH=$ORACLE_HOME/bin:$PATH
# sqlplus / as sysdba
SQL> @$ORACLE_HOME/rdbms/admin/catalog
...
and so on.
The purpose of this approach is that the original Oracle Home is not disturbed and I can easily fall back in case of problems. This also means the database is down and up again, pretty much immediately. If I installed the patch directly on the Oracle Home, I would have had to shut the database for a long time—for the entire duration of the patch application. In addition, if the patch application had failed due to any reason, I would not have a clean Oracle Home.
Now that I have several Oracle Homes, how can I see what changed? It’s really simple; I can use:
diff -r /u01/app/oracle/product/10.2/db1 /u01/app/oracle/product/10.2/db2 | 
grep -v Common
This tells me the differences between the two Oracle Homes and the differences between the files of the same name. Some important files like tnsnames.ora, listener.ora, and sqlnet.ora should not show wide differences, but if they do, then I need to understand why.
cmp. The command cmp is similar to diff:
# cmp file1 file2   
file1 file2 differ: byte 10, line 1
The output comes back as the first sign of difference. You can use this to identify where the files might be different. Like diffcmp has a lot of options, the most important being the -s option, that merely returns a code:
  • 0, if the files are identical
  • 1, if they differ
  • Some other non-zero number, if the comparison couldn’t be made
Here is an example:
# cmp -s file3 file4
# echo $?
0
The special variable $? indicates the return code from the last executed command. In this case it’s 0, meaning the files file1 and file2 are identical.
# cmp -s file1 file2
# echo $?
1
means file1 and file2 are not the same.
This property of cmp can prove very useful in shell scripting where you merely want to check if two files differ in any way, but not necessarily check what the difference is. Another important use of this command is to compare binary files, where diff may not be reliable.

Tip for Oracle Users

Recall from a previous tip that when you relink Oracle executables, the older version is kept prior to being overwritten. So, when you relink, the executable sqlplus is renamed to “sqlplusO” and the newly compiled sqlplus is placed in the $ORACLE_HOME/bin. So how do you ensure that the sqlplus that was just created is any different? Just use:
# cmp sqlplus sqlplusO
sqlplus sqlplusO differ: byte 657, line 7
If you check the size:
# ls -l sqlplus*
-rwxr-x--x    1 oracle   dba          8851 Aug  4 05:15 sqlplus
-rwxr-x--x    1 oracle   dba          8851 Nov  2  2005 sqlplusO
Even though the size is the same in both cases, cmp proved that the two programs differ.
comm. The command comm is similar to the others but the output comes in three columns, separated by tabs. Here is an example:
# comm file1 file2
        In file1 and file2
In file1 only
In file1 and file2
        In file2 only

Summary of Commands in This Installment


CommandUse
chmod
To change permissions of a file, using the - -reference parameter
chown
To change owner of a file, using the - -reference parameter
chgrp
To change group of a file, using the - -reference parameter
stat
To find out about the extended attributes of a file, such as date last accessed
file
To find out about the type of file, such ASCII, data, and so on
diff
To see the difference between two files
cmp
To compare two files
comm
To see what’s common between two files, with the output in three columns
md5sum
To calculate the MD5 hash value of files, used to determine if a file has changed
This command is useful when you may want to see the contents of a file not in the other, not just a difference—sort of a MINUS utility in SQL language. The option -1suppresses the contents found in first file:
# comm -1 file1 file2
In file1 and file2
In file2 only
md5sum. This command generates a 32-bit MD5 hash value of the files:
# md5sum file1
ef929460b3731851259137194fe5ac47  file1
Two files with the same checksum can be considered identical. However, the usefulness of this command goes beyond just comparing files. It can also provide a mechanism to guarantee the integrity of the files.
Suppose you have two important files—file1 and file2—that you need to protect. You can use the --check option check to confirm the files haven't changed. First, create a checksum file for both these important files and keep it safe:
# md5sum file1 file2 > f1f2
Later, when you want to verify that the files are still untouched:
# md5sum --check f1f2      
file1: OK
file2: OK
This shows clearly that the files have not been modified. Now change one file and check the MD5:
# cp file2 file1
# md5sum --check f1f2
file1: FAILED
file2: OK
md5sum: WARNING: 1 of 2 computed checksums did NOT match
The output clearly shows that file1 has been modified.

Tip for Oracle Users

md5sum is an extremely powerful command for security implementations. Some of the configuration files you manage, such as listener.ora, tnsnames.ora, and init.ora, are extremely critical in a successful Oracle infrastructure and any modification may result in downtime. These are typically a part of your change control process. Instead of just relying on someone’s word that these files have not changed, enforce it using MD5 checksum. Create a checksum file and whenever you make a planned change, recreate this file. As a part of your compliance, check this file using the md5sum command. If someone inadvertently updated one of these key files, you would immediately catch the change.
In the same line, you can also create MD5 checksums for all executables in $ORACLE_HOME/bin and compare them from time to time for unauthorized modifications.

alias and unalias

Suppose you want to check the ORACLE_SID environment variable set in your shell. You will have to type:
echo $ORACLE_HOME
As a DBA or a developer, you frequently use this command and will quickly become tired of typing the entire 16 characters. Is there is a simpler way?
There is: the alias command. With this approach you can create a short alias, such as "os", to represent the entire command:
alias os='echo $ORACLE_HOME'
Now whenever you want to check the ORACLE_SID, you just type "os" (without the quotes) and Linux executes the aliased command.
However, if you log out and log back in, the alias is gone and you have to enter the alias command again. To eliminate this step, all you have to do is to put the command in your shell's profile file. For bash, the file is .bash_profile (note the period before the file name, that's part of the file's name) in your home directory. For bourne and korn shells, it's .profile, and for c-shell, .chsrc.
You can create an alias in any name. For instance, I always create an alias for the command rm as rm -i, which makes the rm command interactive.
alias rm=’rm -i’
Whenever I issue an rm command, Linux prompts me for confirmation, and unless I provide "y", it doesn't remove the file—thus I am protected form accidentally removing an important file. I use the same for mv (for moving the file to a new name), which prevents accidental overwriting of existing files, and cp (for copying the file).
Here is a list of some very useful aliases I like to define:
alias bdump='cd $ORACLE_BASE/admin/$ORACLE_SID/bdump'
alias l='ls -d .* --color=tty'
alias ll='ls -l --color=tty'
alias mv='mv -i'
alias oh='cd $ORACLE_HOME'
alias os='echo $ORACLE_SID'
alias rm='rm -i'
alias tns='cd $ORACLE_HOME/network/admin'
To see what aliases have been defined in your shell, use alias without any parameters.
However, there is a small problem. I have defined an alias, rm, that executes rm -i. This command will prompt for my confirmation every time I try to delete a file. But what if I want to remove a lot of files and am confident they can be deleted without my confirmation?
The solution is simple: To suppress the alias and use the command only, I will need to enter two single quotes:
$ ''rm *
Note, these are two single quotes (') before the rm command, not two double quotes. This will suppress the alias rm. Another approach is to use a backslash (\):
$ \rm *
To remove an alias previously defined, just use the unalias command:
$ unalias rm

ls

The humble ls command is frequently used but rarely to its full potential. Without any options, ls merely displays all files and directories in tabular format.
$ ls
admin            has                  mesg         precomp
apex             hs                   mgw          racg
assistants       install              network      rdbms
                               
... output snipped ...
                            
To show them in a list, use the -1 (this is the number 1, not the letter "l") option.
$ ls -1
admin
apex
assistants
                               
... output snipped ...
                            
This option is useful in shell scripts where the files names need to be fed into another program or command for manipulation.
You have most definitely used the -l (the letter "l", not the number "1") that displays all the attributes of the files and directories. Let's see it once again:
$ ls -l 
total 272
drwxr-xr-x    3 oracle   oinstall     4096 Sep  3 03:27 admin
drwxr-x---    7 oracle   oinstall     4096 Sep  3 02:32 apex
drwxr-x---    7 oracle   oinstall     4096 Sep  3 02:29 assistants
The first column shows the type of file and the permissions on it: "d" means directory, "-" means a regular file, "c" means a character device, "b" means a block device, "p" means named pipe, and "l" (that's a lowercase letter L, not I) means symbolic link.
One very useful option is --color, which shows the files in many different colors based on the type of file. Here is an example screenshot:
Note that files file1 and file2 are regular files. link1 is a symbolic link, shown in red; dir1 is a directory, shown in yellow; and pipe1 is a named pipe, shown in different colors for easier identification.
In some distros, the ls command comes pre-installed with an alias (described in the previous section) as ls --color; so you can see the files in color when you type "ls". This approach may be undesirable, however, especially if you have an output like that above. You can change the colors, but a quicker way may be just to turn off the alias:
$ alias ls="''ls"  
Another useful option is the -F option, which appends a symbol after each file to show the type of the file - a "/" after directories, "@" after symbolic links, and "|" after named pipes.
$ ls -F
dir1/  file1  file2  link1@  pipe1|
If you have a subdirectory under a directory and you want to list only that directory, ls -l will show you the contents of the subdirectory as well. For instance, suppose the directory structure is like the following:
/dir1
+-->/subdir1
+--> subfile1
+--> subfile2
The directory dir1 has a subdirectory subdir1 and two files: subfile1 and subfile2. If you just want to see the attributes of the directory dir1, you issue:
$ ls -l dir1
total 4
drwxr-xr-x    2 oracle   oinstall     4096 Oct 14 16:52 subdir1
-rw-r--r--    1 oracle   oinstall        0 Oct 14 16:48 subfile1
-rw-r--r--    1 oracle   oinstall        0 Oct 14 16:48 subfile2
Note that the directory dir1 is not listed in the output. Rather, the contents of the directory are displayed. This is expected behavior when processing directories. To show the directory dir1 only, you will have to use the -d command.
$ ls -dl dir1
drwxr-xr-x    3 oracle   oinstall     4096 Oct 14 16:52 dir1
If you notice the output of the following ls -l output:
-rwxr-x--x    1 oracle   oinstall 10457761 Apr  6  2006 rmanO
-rwxr-x--x    1 oracle   oinstall 10457761 Sep 23 23:48 rman
-rwsr-s--x    1 oracle   oinstall 93300507 Apr  6  2006 oracleO
-rwx------    1 oracle   oinstall 93300507 Sep 23 23:49 oracle
You will notice that the sizes of the files are shown in bytes. This may be easy in small files but when file sizes are pretty large, a long number may not be very easy to read. The option "-h" comes handy then, to display the size in a human readable form.
$ ls -lh

-rwxr-x--x    1 oracle   oinstall      10M Apr  6  2006 rmanO
-rwxr-x--x    1 oracle   oinstall      10M Sep 23 23:48 rman
-rwsr-s--x    1 oracle   oinstall      89M Apr  6  2006 oracleO
-rwx------    1 oracle   oinstall      89M Sep 23 23:49 oracle
Note how the size has been shown in M (for megabytes), K (for kilobytes), and so on.
$ ls -lr
The parameter -r shows the output in the reverse order. In this command, the files will be shown in the reverse alphabetical order.
$ ls -lR
The -R operator makes the ls command execute recursively—that is, go under to the subdirectories and show those files too.
What if you want to show the largest to the smallest files? This can be done with the -S parameter.
$ ls -lS

total 308
-rw-r-----    1 oracle   oinstall    52903 Oct 11 18:31 sqlnet.log
-rwxr-xr-x    1 oracle   oinstall     9530 Apr  6  2006 root.sh
drwxr-xr-x    2 oracle   oinstall     8192 Oct 11 18:14 bin
drwxr-x---    3 oracle   oinstall     8192 Sep 23 23:49 lib

xargs

Most Linux commands are about getting an output: a list of files, a list of strings, and so on. But what if you want to use some other command with the output of the previous one as a parameter? For example, the file command shows the type of the file (executable, ascii text, and so on); you can manipulate the output to show only the filenames and now you want to pass these names to the ls -lcommand to see the timestamp. The command xargs does exactly that. It allows you to execute some other commands on the output. Remember this syntax from Part 1:
file -Lz * | grep ASCII | cut -d":" -f1 | xargs ls -ltr
Let's dissect this command string. The first, file -Lz *, finds files that are symbolic links or compressed. It passes the output to the next command, grep ASCII, which searches for the string "ASCII" in them and produces the output similar to this:
alert_DBA102.log:        ASCII English text
alert_DBA102.log.Z:      ASCII text (compress'd data 16 bits)
dba102_asmb_12307.trc.Z: ASCII English text (compress'd data 16 bits)
dba102_asmb_20653.trc.Z: ASCII English text (compress'd data 16 bits)
Since we are interested in the file names only, we applied the next command, cut -d":" -f1, to show the first field only:
alert_DBA102.log
alert_DBA102.log.Z
dba102_asmb_12307.trc.Z
dba102_asmb_20653.trc.Z
Now, we want to use the ls -l command and pass the above list as parameters, one at a time. The xargs command allowed you to to that. The last part, xargs ls -ltr, takes the output and executes the command ls -ltr against them, as if executing:
ls -ltr alert_DBA102.log
ls -ltr alert_DBA102.log.Z
ls -ltr dba102_asmb_12307.trc.Z
ls -ltr dba102_asmb_20653.trc.Z
Thus xargs is not useful by itself, but is quite powerful when combined with other commands.
Here is another example, where we want to count the number of lines in those files:
$ file * | grep ASCII | cut -d":" -f1  | xargs wc -l
  47853 alert_DBA102.log
     19 dba102_cjq0_14493.trc
  29053 dba102_mmnl_14497.trc
    154 dba102_reco_14491.trc
     43 dba102_rvwr_14518.trc
  77122 total
(Note: the above task can also be accomplished with the following command:)
$ wc -l ‘file * | grep ASCII | cut -d":" -f1 | grep ASCII | cut -d":" -f1‘
The xargs version is given to illustrate the concept. Linux has several ways to achieve the same task; use the one that suits your situation best.
Using this approach you can quickly rename files in a directory.
$ ls | xargs -t -i mv {} {}.bak
The -i option tells xargs to replace {} with the name of each item. The -t option instructs xargs to print the command before executing it.
Another very useful operation is when you want to open the files for editing using vi:
$ file * | grep ASCII | cut -d":" -f1 | xargs vi
This command opens the files one by one using vi. When you want to search for many files and open them for editing, this comes in very handy.
It also has several options. Perhaps the most useful is the -p option, which makes the operation interactive:
$ file * | grep ASCII | cut -d":" -f1 | xargs -p vi
vi alert_DBA102.log dba102_cjq0_14493.trc dba102_mmnl_14497.trc
                              


  dba102_reco_14491.trc dba102_rvwr_14518.trc ?...
                            
Here xarg asks you to confirm before running each command. If you press "y", it executes the command. You will find it immensely useful when you take some potentially damaging and irreversible operations on the file—such as removing or overwriting it.
The -t option uses a verbose mode; it displays the command it is about to run, which is a very helpful option during debugging.
What if the output passed to the xargs is blank? Consider:
$ file * | grep SSSSSS | cut -d":" -f1 | xargs -t wc -l
wc -l 
            0
$
Here searching for "SSSSSS" produces no match; so the input to xargs is all blanks, as shown in the second line (produced since we used the -t, or the verbose option). Although this may be useful, In some cases you may want to stop xargs if there is nothing to process; if so, you can use the -r option:
$ file * | grep SSSSSS | cut -d":" -f1 | xargs -t -r wc -l
$
The command exits if there is nothing to run.
Suppose you want to remove the files using the rm command, which should be the argument to the xargs command. However, rm can accept a limited number of arguments. What if your argument list exceeds that limit? The -n option to xargs limits the number of arguments in a single command line.
Here is how you can limit only two arguments per command line: Even if five files are passed to xargs ls -ltr, only two files are passed to ls -ltr at a time.
$ file * | grep ASCII | cut -d":" -f1 | xargs -t -n2 ls -ltr  
ls -ltr alert_DBA102.log dba102_cjq0_14493.trc 
-rw-r-----    1 oracle   dba           738 Aug 10 19:18 dba102_cjq0_14493.trc
-rw-r--r--    1 oracle   dba       2410225 Aug 13 05:31 alert_DBA102.log
ls -ltr dba102_mmnl_14497.trc dba102_reco_14491.trc 
-rw-r-----    1 oracle   dba       5386163 Aug 10 17:55 dba102_mmnl_14497.trc
-rw-r-----    1 oracle   dba          6808 Aug 13 05:21 dba102_reco_14491.trc
ls -ltr dba102_rvwr_14518.trc 
-rw-r-----    1 oracle   dba          2087 Aug 10 04:30 dba102_rvwr_14518.trc
Using this approach you can quickly rename files in a directory.
$ ls | xargs -t -i mv {} {}.bak
The -i option tells xargs to replace {} with the name of each item.

rename

As you know, the mv command renames files. For example,
$ mv oldname newname
renames the file oldname to newname. However, what if you don't know the filenames yet? The rename command comes in really handy here.
rename .log .log.‘date +%F-%H:%M:%S‘ *
replaces all files with the extension .log with .log.<dateformat>. So sqlnet.log becomes sqlnet.log.2006-09-12-23:26:28.

find

Among the most popular for Oracle users is the find command. By now you know about using find to find files on a given directory. Here is an example to find files starting with the word "file" in the current directory:
$ find . -name "file*"
./file2
./file1
./file3
./file4
However, what if you want to search for names like FILE1, FILE2, and so on? The -name "file*" will not match them. For a case-insensitive search, use the -iname option:
$ find . -iname "file*"
./file2
./file1
./file3
./file4
./FILE1
./FILE2
You can limit your search to a specific type of files only. For instance, the above command will get the files of all types: regular files, directories, symbolic links, and so on. To search for only regular files, you can use the -type f parameter.
$ find . -name "orapw*" -type f 
./orapw+ASM
./orapwDBA102
./orapwRMANTEST
./orapwRMANDUP
./orapwTESTAUX
The -type can take the modifiers f (for regular files), l (for symbolic links), d (directories), b (block devices), p (named pipes), c (character devices), s (sockets).
A slight twist to the above command is to combine it with the file command you learned about in Part 1. The command file tells you what type of file it is. You can pass it as a post processor for the output from find command. The -exec parameter executes the command following the parameter. In this case, the command to execute after the find is file:
$ find . -name "*oraenv*" -type f -exec file {} \;
./coraenv: Bourne shell script text executable
./oraenv: Bourne shell script text executable
This is useful when you want to find out if the ASCII text file could be some type of shell script.
If you substitute -exec with -ok, the command is executed but it asks for your confirmation first. Here's an example:
$ find . -name "sqlplus*" -ok {} \;      
< {} ... ./sqlplus > ? y
 
SQL*Plus: Release 9.2.0.5.0 - Production on Sun Aug 6 11:28:15 2006
 
Copyright (c) 1982, 2002, Oracle Corporation.  All rights reserved.
 
Enter user-name: / as sysdba
 
Connected to:
Oracle9i Enterprise Edition Release 9.2.0.5.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Oracle Data Mining options
JServer Release 9.2.0.5.0 - Production
 
SQL> exit
Disconnected from Oracle9i Enterprise Edition Release 9.2.0.5.0 - 64bit Production
With the Partitioning, Real Application Clusters, OLAP and Oracle Data Mining options
JServer Release 9.2.0.5.0 - Production
< È* ... ./sqlplusO > ? n
$
Here we have asked the shell to find all programs starting with "sqlplus", and execute them. Note there is nothing between -ok and {}, so it will just execute the files it finds. It finds two files—sqlplus and sqlplusO—and asks in each case if you want to execute it. We answered "y" to the prompt against "sqlplus" and it executed. After exiting, it prompted the second file it found (sqlplusO) and for confirmation again again, to which we answered "n"—thus, it was not executed.

Tip for Oracle Users

Oracle produces many extraneous files: trace files, log files, dump files, and so on. Unless they are cleaned periodically, they can fill up the filesystem and bring the database to a halt.
To ensure that doesn't happen, simply search for the files with extension "trc" and remove them if they are more than three days old. A simple command does the trick:
find . -name "*.trc" -ctime +3 -exec rm {} \;
To forcibly remove them prior to the three-day limit, use the -f option.
           
find . -name "*.trc" -ctime +3 -exec rm -f {} \;
If you just want to list the files:
find . -name "*.trc" -ctime +3 -exec ls -l {} \;

m4

This command takes an input file and substitutes strings inside it with the parameters passed, similar to substituting for variables. For example, here is an input file:
$ cat temp
The COLOR fox jumped over the TYPE fence.
Were you to substitute the strings "COLOR" by "brown" and "TYPE" by "broken", you could use:
$ m4 -DCOLOR=brown -DTYPE=broken temp
The brown fox jumped over the broken fence.
Else, if you want to substitute "white" and "high" for the same:

$ m4 -DCOLOR=white -DTYPE=high temp  
The white fox jumped over the high fence.

whence and which

These commands are used to find out the where the executables mentioned are stored in the PATH of the user. When the executable is found in the path, they behave pretty much the same way and display the path:
                               
$ which sqlplus  
/u02/app/oracle/products/10.2.0.1/db1/bin/sqlplus
$ whence sqlplus 
/u02/app/oracle/products/10.2.0.1/db1/bin/sqlplus

                            
The output is identical. However, if the executable is not found in the path, the behavior is different. The which command produces an explicit message:
$ which sqlplus1
/usr/bin/which: no sqlplus1 in (/u02/app/oracle/products/10.2.0.1/db1/bin:/usr
                              


                                /kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin)
                            
whereas the whence command produces no message:
$ whence sqlplus1]
and returns to shell prompt. This is useful in cases where the executable is not found in the path (instead of displaying the message):
  
$ whence invalid_command
$ which invalid_command
which: no invalid_command in (/usr/kerberos/sbin:/usr/kerberos/bin:/bin:/sbin:
                              


                              /usr/bin:/usr/sbin:/usr/local/bin:/usr/local/sbin:
                              /usr/bin/X11:/usr/X11R6/bin:/root/bin)     
                            
When whence does not find an executable in the path, it returns without any message but the return code is not zero. This fact can be exploited in shell scripts; for example:
RC=‘whence myexec‘
If [ $RC -ne "0" ]; then
   echo "myexec is not in the $PATH"
fi
A very useful option to which is the -i option, which displays the alias as well as the executable, if present. For example, you saw the use of the alias at the beginning of this article. The rm command is actually an alias in my shell, and there is an rm command elsewhere in the system as well.
$ which ls
/bin/ls

$ which -i ls
alias ls='ls --color=tty'
        /bin/ls
The default behavior of which is to show the first occurrence of the executable in the path. If the executable exists in different directories in the path, the subsequent occurrences are ignored. You can see all the occurrences of the executable via the -a option.
$ which java   
/usr/bin/java

$ which -a java
/usr/bin/java
/home/oracle/oracle/product/11.1/db_1/jdk/jre/bin/java

top

The top command is probably the most useful one for an Oracle DBA managing a database on Linux. Say the system is slow and you want to find out who is gobbling up all the CPU and/or memory. To display the top processes, you use the command top.
Note that unlike other commands, top does not produce an output and sits still. It refreshes the screen to display new information. So, if you just issue top and leave the screen up, the most current information is always up. To stop and exit to shell, you can press Control-C.
$ top

18:46:13  up 11 days, 21:50,  5 users,  load average: 0.11, 0.19, 0.18 
151 processes: 147 sleeping, 4 running, 0 zombie, 0 stopped 
CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle 
           total   12.5%    0.0%    6.7%   0.0%     0.0%    5.3%   75.2% 
Mem:  1026912k av,  999548k used,   27364k free,       0k shrd,  116104k buff 
                    758312k actv,  145904k in_d,   16192k in_c 
Swap: 2041192k av,  122224k used, 1918968k free                  590140k cached 
 
  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND 
  451 oracle    15   0  6044 4928  4216 S     0.1  0.4   0:20   0 tnslsnr 
 8991 oracle    15   0  1248 1248   896 R     0.1  0.1   0:00   0 top 
    1 root      19   0   440  400   372 S     0.0  0.0   0:04   0 init 
    2 root      15   0     0    0     0 SW    0.0  0.0   0:00   0 keventd 
    3 root      15   0     0    0     0 SW    0.0  0.0   0:00   0 kapmd 
    4 root      34  19     0    0     0 SWN   0.0  0.0   0:00   0 ksoftirqd/0 
    7 root      15   0     0    0     0 SW    0.0  0.0   0:01   0 bdflush 
    5 root      15   0     0    0     0 SW    0.0  0.0   0:33   0 kswapd 
    6 root      15   0     0    0     0 SW    0.0  0.0   0:14   0 kscand 
    8 root      15   0     0    0     0 SW    0.0  0.0   0:00   0 kupdated 
    9 root      25   0     0    0     0 SW    0.0  0.0   0:00   0 mdrecoveryd
                               
... output snipped ...
                            
Let's examine the different types of information produced. The first line:
18:46:13  up 11 days, 21:50,  5 users,  load average: 0.11, 0.19, 0.18
shows the current time (18:46:13), that system has been up for 11 days; that the system has been working for 21 hours 50 seconds. The load average of the system is shown (0.11, 0.19, 0.18) for the last 1, 5 and 15 minutes respectively. (By the way, you can also get this information by issuing the uptime command.)
If the load average is not required, press the letter "l" (lowercase L); it will turn it off. To turn it back on press l again. The second line:
151 processes: 147 sleeping, 4 running, 0 zombie, 0 stopped
shows the number of processes, running, sleeping, etc. The third and fourth lines:
CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle 
           total   12.5%    0.0%    6.7%   0.0%     0.0%    5.3%   75.2%
show the CPU utilization details. The above line shows that user processes consume 12.5% and system consumes 6.7%. The user processes include the Oracle processes. Press "t" to turn these three lines off and on. If there are more than one CPU, you will see one line per CPU.
The next two lines:
Mem:  1026912k av, 1000688k used,  26224k free,    0k shrd,  113624k buff 
                    758668k actv,  146872k in_d,  14460k in_c
Swap: 2041192k av, 122476k used,   1918716k free             591776k cached
show the memory available and utilized. Total memory is "1026912k av", approximately 1GB, of which only 26224k or 26MB is free. The swap space is 2GB; but it's almost not used. To turn it off and on, press "m".
The rest of the display shows the processes in a tabular format. Here is the explanation of the columns:
Column Description 
PIDThe process ID of the process
USER The user running the process
PRI The priority of the process
NIThe nice value: The higher the value, the lower the priority of the task
SIZE Memory used by this process (code+data+stack)
RSS The physical memory used by this process
SHARE The shared memory used by this process
STAT 
The status of this process, shown in code. Some major status codes are:
R – Running
S –Sleeping
Z – Zombie
T – Stopped
You can also see second and third characters, which indicate:
W – Swapped out process
N – positive nice value
%CPU The percentage of CPU used by this process
%MEM The percentage of memory used by this process
TIME The total CPU time used by this process
CPU If this is a multi-processor system, this column indicates the ID of the CPU this process is running on.
COMMANDThe command issued by this process
While the top is being displayed, you can press a few keys to format the display as you like. Pressing the uppercase M key sorts the output by memory usage. (Note that using lowercase m will turn the memory summary lines on or off at the top of the display.) This is very useful when you want to find out who is consuming the memory. Here is sample output:
PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND 
31903 oracle    15   0 75760  72M 72508 S     0.0  7.2   0:01   0 ora_smon_PRODB2 
31909 oracle    15   0 68944  66M 64572 S     0.0  6.6   0:03   0 ora_mmon_PRODB2 
31897 oracle    15   0 53788  49M 48652 S     0.0  4.9   0:00   0 ora_dbw0_PRODB2
Now that you learned how to interpret the output, let's see how to use command line parameters.
The most useful is -d, which indicates the delay between the screen refreshes. To refresh every second, use top -d 1.
The other useful option is -p. If you want to monitor only a few processes, not all, you can specify only those after the -p option. To monitor processes 13609, 13608 and 13554, issue:
top -p 13609 -p 13608 -p 13554
This will show results in the same format as the top command, but only those specific processes.

Tip for Oracle Users

It's probably needless to say that the top utility comes in very handy for analyzing the performance of database servers. Here is a partialtop output.
20:51:14  up 11 days, 23:55,  4 users,  load average: 0.88, 0.39, 0.27 
113 processes: 110 sleeping, 2 running, 1 zombie, 0 stopped 
CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle 
           total    1.0%    0.0%    5.6%   2.2%     0.0%   91.2%    0.0% 
Mem:  1026912k av, 1008832k used,   18080k free,       0k shrd,   30064k buff 
                    771512k actv,  141348k in_d,   13308k in_c 
Swap: 2041192k av,   66776k used, 1974416k free                  812652k cached 
 
  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND 
16143 oracle    15   0 39280  32M 26608 D     4.0  3.2   0:02   0 oraclePRODB2...
    5 root      15   0     0    0     0 SW    1.6  0.0   0:33   0 kswapd
                                 
... output snipped ...
                              
Let's analyze the output carefully. The first thing you should notice is the "idle" column under CPU states; it's 0.0%—meaning, the CPU is completely occupied doing something. The question is, doing what? Move your attention to the column "system", just slightly left; it shows 5.6%. So the system itself is not doing much. Go even more left to the column marked "user", which shows 1.0%. Since user processes include Oracle as well, Oracle is not consuming the CPU cycles. So, what's eating up all the CPU?
The answer lies in the same line, just to the right under the column "iowait", which indicates 91.2%. This explains it all: the CPU is waiting for IO 91.2% of the time.
So why so much IO wait? The answer lies in the display. Note the PID of the highest consuming process: 16143. You can use the following query to determine what the process is doing:
select s.sid, s.username, s.program
from v$session s, v$process p
where spid = 16143
and p.addr = s.paddr
/

       SID USERNAME PROGRAM
------------------- -----------------------------
       159 SYS      rman@prolin2 (TNS V1-V3)    
The rman process is taking up the IO waits related CPU cycles. This information helps you determine the next course of action.

skill and snice

From the previous discussion you learned how to identify a CPU consuming resource. What if you find that a process is consuming a lot of CPU and memory, but you don't want to kill it? Consider the top output below:
$ top -c -p 16514

23:00:44  up 12 days,  2:04,  4 users,  load average: 0.47, 0.35, 0.31 
1 processes: 1 sleeping, 0 running, 0 zombie, 0 stopped 
CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle 
           total    0.0%    0.6%    8.7%   2.2%     0.0%   88.3%    0.0% 
Mem:  1026912k av, 1010476k used,   16436k free,       0k shrd,   52128k buff 
                    766724k actv,  143128k in_d,   14264k in_c 
Swap: 2041192k av,   83160k used, 1958032k free                  799432k cached 
 
  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND 
16514 oracle    19   4 28796  26M 20252 D N   7.0  2.5   0:03   0 oraclePRODB2...
Now that you confirmed the process 16514 is consuming a lot of memory, you can "freeze" it—but not kill it—using the skill command.
$ skill -STOP 1
After this, check the top output:
23:01:11  up 12 days,  2:05,  4 users,  load average: 1.20, 0.54, 0.38 
1 processes: 0 sleeping, 0 running, 0 zombie, 1 stopped 
CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle 
           total    2.3%    0.0%    0.3%   0.0%     0.0%    2.3%   94.8% 
Mem:  1026912k av, 1008756k used,   18156k free,       0k shrd,    3976k buff 
                    770024k actv,  143496k in_d,   12876k in_c 
Swap: 2041192k av,   83152k used, 1958040k free                  851200k cached 
 
  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND 
16514 oracle    19   4 28796  26M 20252 T N   0.0  2.5   0:04   0 oraclePRODB2...
The CPU is now 94% idle from 0%. The process is effectively frozen. After some time, you may want to revive the process from coma:
$ skill -CONT 16514
This approach is immensely useful for temporarily freezing processes to make room for more important processes to complete.
The command is very versatile. If you want to stop all processes of the user "oracle", only one command does it all:
$ skill -STOP oracle
You can use a user, a PID, a command or terminal id as argument. The following stops all rman commands.
$ skill -STOP rman
As you can see, skill decides that argument you entered—a process ID, userid, or command—and acts appropriately. This may cause an issue in some cases, where you may have a user and a command in the same name. The best example is the "oracle" process, which is typically run by the user "oracle". So, when you want to stop the process called "oracle" and you issue:
$ skill -STOP oracle
all the processes of user "oracle" stop, including the session you may be on. To be completely unambiguous you can optionally give a new parameter to specify the type of the parameter. To stop a command called oracle, you can give:
$ skill -STOP -c oracle
The command snice is similar. Instead of stopping a process it makes its priority a lower one. First, check the top output:
  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND 
    3 root      15   0     0    0     0 RW    0.0  0.0   0:00   0 kapmd 
13680 oracle    15   0 11336  10M  8820 T     0.0  1.0   0:00   0 oracle 
13683 oracle    15   0  9972 9608  7788 T     0.0  0.9   0:00   0 oracle 
13686 oracle    15   0  9860 9496  7676 T     0.0  0.9   0:00   0 oracle 
13689 oracle    15   0 10004 9640  7820 T     0.0  0.9   0:00   0 oracle 
13695 oracle    15   0  9984 9620  7800 T     0.0  0.9   0:00   0 oracle 
13698 oracle    15   0 10064 9700  7884 T     0.0  0.9   0:00   0 oracle 
13701 oracle    15   0 22204  21M 16940 T     0.0  2.1   0:00   0 oracle
Now, drop the priority of the processes of "oracle" by four points. Note that the higher the number, the lower the priority.
$ snice +4 -u oracle

  PID USER     PRI  NI  SIZE  RSS SHARE STAT %CPU %MEM   TIME CPU COMMAND 
16894 oracle    20   4 38904  32M 26248 D N   5.5  3.2   0:01   0 oracle
Note how the NI column (for nice values) is now 4 and the priority is now set to 20, instead of 15. This is quite useful in reducing priorities.

In this installment, learn advanced Linux commands for monitoring physical components
A Linux system comprises several key physical components such as CPU, memory, network card, and storage devices. To effectively manage a Linux environment, you should be able to measure the various metrics of these resources—how much each component is processing, if there is a bottleneck, and so on—with reasonable accuracy.
In the other parts of this series you learned some commands for measuring metrics at a macro level. In this installment, however, you will learn advanced Linux commands for monitoring physical components specifically. Specifically, you will learn about the commands in the following categories:
Component
Commands
Memory
free, vmstat, mpstat, iostat, sar
CPU
vmstat, mpstat, iostat, sar
I/O
vmstat, mpstat, iostat, sar
Processes
ipcs, ipcrm
As you can see, some commands appear in more than one category. This is due to the fact that the commands can perform many tasks. Some commands are better suited to some components--e.g. iostat for I/O--but you should understand the differences in their workings and use the ones you are more comfortable with.
In most cases, a single command may not be useful to understand what really is going on. You should know multiple commands to get the information you want.

free

One common question is, “How much memory is being used by my applications and various server, user, and system processes?” Or, “How much memory is free right now?” If the memory used by the running processes is more than the available RAM, the processes are moved to swap. So an ancillary question is, “How much swap is being used?”
The free command answers all those questions. What’s more, a very useful option, –m , shows free memory in megabytes:
# free -m
             total       used       free     shared    buffers     cached
Mem:          1772       1654        117          0         18        618
-/+ buffers/cache:       1017        754
Swap:         1983       1065        918
The above output shows that the system has 1,772 MB of RAM, of which 1,654 MB is being used, leaving 117 MB of free memory.  The second line shows the buffers and cache size changes in the physical memory. The third line shows swap utilization.
To show the same in kilobytes and gigabytes, replace the -m option with -k or -g respectively. You can get down to byte level as well, using the –b option.
# free -b
             total       used       free     shared    buffers     cached
Mem:    1858129920 1724039168  134090752          0   18640896  643194880
-/+ buffers/cache: 1062203392  795926528
Swap:   2080366592 1116721152  963645440
The –t option shows the total at the bottom of the output (sum of physical memory and swap):
# free -m -t
             total       used       free     shared    buffers     cached
Mem:          1772       1644        127          0         16        613
-/+ buffers/cache:       1014        757
Swap:         1983       1065        918
Total:        3756       2709       1046
Although free does not show the percentages, we can extract and format specific parts of the output to show used memory as a percentage of the total only:
# free -m | grep Mem | awk '{print ($3 / $2)*100}' 
98.7077
This comes handy in shell scripts where the specific numbers are important. For instance, you may want to trigger an alert when the percentage of free memory falls below a certain threshold.
Similarly, to find the percentage of swap used, you can issue:
free -m | grep -i Swap | awk '{print ($3 / $2)*100}'
You can use free to watch the memory load exerted by an application. For instance, check the free memory before starting the backup application and then check it immediately after starting. The difference could be attributed to the consumption by the backup application.

Usage for Oracle Users

So, how can you use this command to manage the Linux server running your Oracle environment? One of the most common causes of performance issues is the lack of memory, causing the system to “swap” memory areas into the disk temporarily. Some degree of swapping is probably inevitable but a lot of swapping is indicative of lack of free memory.
Instead, you can use free to get the free memory information now and follow it up with the sar command (shown later) to check the historical trend of the memory and swap consumption. If the swap usage is temporary, it’s probably a one-time spike; but if it’s a pronounced over a period of time, you should take notice. There are a few obvious and possible suspects of chronic memory overloads:
  • A large SGA that is more that memory available
  • Very large allocation on PGA
  • Some process with bugs that leaks memory
For the first case, you should make sure SGA is less that available memory. A general rule of thumb is to use about 40 percent of the physical memory for SGA, but of course you should define that parameter based on your specific situation. In the second case, you should try to reduce the large buffer allocation in queries. In the third case you should use the ps command (described in an earlier installment of this series) to identify the specific process that might be leaking memory.

ipcs

When a process runs, it grabs from the “shared memory”. There could be one or many shared memory segments by this process. The processes send messages to each other (“inter-process communication”, or IPC) and use semaphores. To display information about shared memory segments, IPC message queues, and semaphores, you can use a single command: ipcs.
The –m option is very popular; it displays the shared memory segments.
# ipcs -m
 
------ Shared Memory Segments --------
key        shmid      owner      perms      bytes      nattch     status      
0xc4145514 2031618    oracle    660        4096       0                       
0x00000000 3670019    oracle    660        8388608    108                     
0x00000000 327684     oracle    600        196608     2          dest         
0x00000000 360453     oracle    600        196608     2          dest         
0x00000000 393222     oracle    600        196608     2          dest         
0x00000000 425991     oracle    600        196608     2          dest         
0x00000000 3702792    oracle    660        926941184  108                     
0x00000000 491529     oracle    600        196608     2          dest         
0x49d1a288 3735562    oracle    660        140509184  108                     
0x00000000 557067     oracle    600        196608     2          dest         
0x00000000 1081356    oracle    600        196608     2          dest         
0x00000000 983053     oracle    600        196608     2          dest         
0x00000000 1835023    oracle    600        196608     2          dest         
This output, taken on a server running Oracle software, shows the various shared memory segments. Each one is uniquely identified by a shared memory ID, shown under the “shmid” column. (Later you will see how to use this column value.) The “owner”, of course, shows the owner of the segment, the “perms” column shows the permissions (same as unix permissions), and “bytes” shows the size in bytes.
The -u option shows a very quick summary:
# ipcs -mu

------ Shared Memory Status --------
segments allocated 25
pages allocated 264305
pages resident  101682
pages swapped   100667
Swap performance: 0 attempts     0 successes
The –l option shows the limits (as opposed to the current values):
# ipcs -ml
 
------ Shared Memory Limits --------
max number of segments = 4096
max seg size (kbytes) = 907290
max total shared memory (kbytes) = 13115392
min seg size (bytes) = 1
If you see the current values at or close the limit values, you should consider upping the limit.
You can get a detailed picture of a specific shared memory segment using the shmid value. The –i option accomplishes that. Here is how you will see details of the shmid 3702792:
# ipcs -m -i 3702792
 
Shared memory Segment shmid=3702792
uid=500 gid=502 cuid=500        cgid=502
mode=0660       access_perms=0660
bytes=926941184 lpid=12225      cpid=27169      nattch=113
att_time=Fri Dec 19 23:34:10 2008  
det_time=Fri Dec 19 23:34:10 2008  
change_time=Sun Dec  7 05:03:10 2008    
Later you will an example of how you to interpret the above output.
The -s shows the semaphores in the system:
# ipcs -s
 
------ Semaphore Arrays --------
key        semid      owner      perms      nsems     
0x313f2eb8 1146880    oracle    660        104       
0x0b776504 2326529    oracle    660        154     
… and so on …  
This shows some valuable data. It shows the semaphore array with the ID 1146880 has 104 semaphores, and the other one has 154. If you add them up, the total value has to be below the maximum limit defined by the kernel parameter (semmax). While installing Oracle Database software, the pre-install checker has a check for the setting for semmax. Later, when the system attains steady state, you can check for the actual utilization and then adjust the kernel value accordingly.

Usage for Oracle Users

How can you find out the shared memory segments used by the Oracle Database instance? To get that, use the oradebug command. First connect to the database as sysdba:
# sqlplus / as sysdba
In the SQL, use the oradebug command as shown below:
SQL> oradebug setmypid
Statement processed.
SQL> oradebug ipc
Information written to trace file.
To find out the name of the trace file:
SQL> oradebug TRACEFILE_NAME
/opt/oracle/diag/rdbms/odba112/ODBA112/trace/ODBA112_ora_22544.trc
Now, if you open that trace file, you will see the shared memory IDs. Here is an excerpt from the file:
Area #0 `Fixed Size' containing Subareas 0-0
  Total size 000000000014613c Minimum Subarea size 00000000
   Area  Subarea    Shmid      Stable Addr      Actual Addr
      0        0  
                              
                                 
17235970
                               0x00000020000000 0x00000020000000
                              Subarea size     Segment size
                          0000000000147000 000000002c600000
 Area #1 `Variable Size' containing Subareas 4-4
  Total size 000000002bc00000 Minimum Subarea size 00400000
   Area  Subarea    Shmid      Stable Addr      Actual Addr
      1        4  
                              
                                 
17235970
                               0x00000020800000 0x00000020800000
                              Subarea size     Segment size
                          000000002bc00000 000000002c600000
 Area #2 `Redo Buffers' containing Subareas 1-1
  Total size 0000000000522000 Minimum Subarea size 00000000
   Area  Subarea    Shmid      Stable Addr      Actual Addr
      2        1  
                              
                                 
17235970
                               0x00000020147000 0x00000020147000
                              Subarea size     Segment size
                          0000000000522000 000000002c600000
... and so on ... 
                            
The shared memory id has been shown in bold red. You can use this shared memory ID to get the details of the shared memory:
# ipcs -m -i  
                              
                                 
17235970
                              
                            
Another useful observation is the value of lpid – the process ID of the process that last touched the shared memory segment. To demonstrate the value in that attribute, use SQL*Plus to connect to the instance from a different session.
# sqlplus / as sysdba
In that session, find out the PID of the server process:
SQL> select spid from v$process
  2  where addr = (select paddr from v$session
  3     where sid =
  4        (select sid from v$mystat where rownum < 2)
  5  );
 
SPID
------------------------
13224
Now re-execute the ipcs command against the same shared memory segment:
# ipcs -m -i 17235970
 
Shared memory Segment shmid=17235970
uid=500 gid=502 cuid=500        cgid=502
mode=0660       access_perms=0660
bytes=140509184 lpid=13224      cpid=27169      nattch=113
att_time=Fri Dec 19 23:38:09 2008  
det_time=Fri Dec 19 23:38:09 2008  
change_time=Sun Dec  7 05:03:10 2008
Note the value of lpid, which was changed to 13224, from the original value 12225. The lpid shows the PID of the last process that touched the shared memory segment, and you saw how that value changes.
The command by itself provides little value. The next command – ipcrm – allows you to act based on the output, as you will see in the next section.

ipcrm

Now that you identified the shared memory and other IPC metrics, what do you do with them? You saw some usage earlier, such as identifying the shared memory used by Oracle, making sure the kernel parameter for shared memory is set, and so on. Another common application is to remove the shared memory, the IPC message queue, or the semaphore arrays.
To remove a shared memory segment, note its shmid from the ipcs command output. Then use the –m option to remove the segment. To remove segment with ID 3735562, use:
# ipcrm –m 3735562
This will remove the shared memory. You can also use this to kill semaphores and IPC message queues as well (using –s and –q parameters).

Usage for Oracle Users

Sometimes when you shutdown the database instance, the shared memory segments may not be completely cleaned up by the Linux kernel. The shared memory left behind is not useful; but it hogs the system resources making less memory available to the other processes. In that case, you can check any lingering shared memory segments owned by the “oracle” user and then remove them, if any using the ipcrm command.

vmstat

When called, the grand-daddy of all memory and process related displays, vmstat, continuously runs and posts its information. It takes two arguments:
# vmstat <interval> <count>
<interval> is the interval in seconds between two runs. <count> is the number of repetitions vmstat makes. Here is a sample when we want vmstat to run every five seconds and stop after the tenth run. Every line in the output comes after five seconds and shows the stats at that time.
# vmstat 5 10
 procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b    swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 0  0 1087032 132500  15260 622488   89   19     9     3    0     0  4 10 82  5
 0  0 1087032 132500  15284 622464    0    0   230   151 1095   858  1  0 98  1
 0  0 1087032 132484  15300 622448    0    0   317    79 1088   905  1  0 98  0
… shows up to 10 times.
The output shows a lot about the system resources. Let’s examine them in detail:
procs
Shows the number of processes
r
Processs waiting to be run. The more the load on the system, the more the number of processes waiting to get CPU cycles to run.
b
Uninterruptible sleeping processes, also known as “blocked” processes. These processes are most likely waiting for I/O but could be for something else too.
Sometimes there is another column as well, under heading “w”, which shows the number of processes that can be run but have been swapped out to the swap area.
The numbers under “b” should be close to 0. If the number under “w” is high, you may need more memory.
The next block shows memory metrics:
swpd
Amount of virtual memory or swapped memory (in KB)
free
Amount of free physical memory (in KB)
buff
Amount of memory used as buffers (in KB)
cache
Kilobytes of physical memory used as cache
The buffer memory is used to store file metadata such as i-nodes and data from raw block devices. The cache memory is used for file data itself.
The next block shows swap activity:
si
Rate at which the memory is swapped back from the disk to the physical RAM (in KB/sec)
so
Rate at which the memory is swapped out to the disk from physical RAM (in KB/sec)
The next block slows I/O activity:
bi
Rate at which the system sends data to the block devices (in blocks/sec)
bo
Rate at which the system reads the data from block devices (in blocks/sec)
The next block shows system related activities:
in
Number of interrupts received by the system per second
cs
Rate of context switching in the process space (in number/sec)
The final block is probably the most used – the information on CPU load:
us
Shows the percentage of CPU spent in user processes. The Oracle processes come in this category.
sy
Percentage of CPU used by system processes, such as all root processes
id
Percentage of free CPU
wa
Percentage spent in “waiting for I/O”
Let’s see how to interpret these values. The first line of the output is an average of all the metrics since the system was restarted. So, ignore that line since it does not show the current status. The other lines show the metrics in real time.
Ideally, the number of processes waiting or blocking (under the “ procs” heading) should be 0 or close to 0. If they are high, then the system either does not have enough resources like CPU, memory, or I/O. This information comes useful while diagnosing performance issues.
The data under “swap” indicates if excessive swapping is going on. If that is the case, then you may have inadequate physical memory. You should either reduce the memory demand or increase the physical RAM.
The data under “ io” indicates the flow of data to and from the disks. This shows how much disk activity is going on, which does not necessarily indicate some problem. If you see some large number under “ proc” and then “ b” column (processes being blocked) and high I/O, the issue could be a severe I/O contention.
The most useful information comes under the “ cpu” heading. The “ id” column shows idle CPU. If you subtract that number from 100, you get how much percent the CPU is busy. Remember the top command described in another installment of this series? That also shows a CPU free% number. The difference is: top shows that free% for each CPU whereas vmstat shows the consolidated view for all CPUs.
The vmstat command also shows the breakdown of CPU usage: how much is used by the Linux system, how much by a user process, and how much on waiting for I/O. From this breakdown you can determine what is contributing to CPU consumption. If system CPU load is high, could there be some root process such as backup running?
The system load should be consistent over a period of time. If the system shows a high number, use the top command to identify the system process consuming CPU.

Usage for Oracle Users

Oracle processes (the background processes and server processes) and the user processes (sqlplus, apache, etc.) come under “ us”. If this number is high, use top to identify the processes. If the “ wa” column shows a high number, it indicates the I/O system is unable to catch up with the amount of reading or writing. This could occasionally shoot up as a result of spikes in heavy updates in the database causing log switch and a subsequent spike in archiving processes. But if it consistently shows a large number, then you may have an I/O bottleneck.
I/O blockages in an Oracle database can cause serious problems. Apart from performance issues, the slow I/O could cause controlfile writes to be slow, which may cause a process to wait to acquire a controlfile enqueue. If the wait is more that 900 seconds, and the waiter is a critical process like LGWR, it brings down the database instance.
If you see a lot of swapping, perhaps the SGA is sized too large to fit in the physical memory. You should either reduce the SGA size or increase the physical memory.

mpstat

Another useful command to get CPU related stats is mpstat. Here is an example output:
# mpstat -P ALL 5 2
Linux 2.6.9-67.ELsmp (oraclerac1)       12/20/2008
 
10:42:38 PM  CPU   %user   %nice %system %iowait    %irq   %soft   %idle    intr/s
10:42:43 PM  all    6.89    0.00   44.76    0.10    0.10    0.10   48.05   1121.60
10:42:43 PM    0    9.20    0.00   49.00    0.00    0.00    0.20   41.60    413.00
10:42:43 PM    1    4.60    0.00   40.60    0.00    0.20    0.20   54.60    708.40
 
10:42:43 PM  CPU   %user   %nice %system %iowait    %irq   %soft   %idle    intr/s
10:42:48 PM  all    7.60    0.00   45.30    0.30    0.00    0.10   46.70   1195.01
10:42:48 PM    0    4.19    0.00    2.20    0.40    0.00    0.00   93.21   1034.53
10:42:48 PM    1   10.78    0.00   88.22    0.40    0.00    0.00    0.20    160.48
 
Average:     CPU   %user   %nice %system %iowait    %irq   %soft   %idle    intr/s
Average:     all    7.25    0.00   45.03    0.20    0.05    0.10   47.38   1158.34
Average:       0    6.69    0.00   25.57    0.20    0.00    0.10   67.43    724.08
Average:       1    7.69    0.00   64.44    0.20    0.10    0.10   27.37    434.17
It shows the various stats for the CPUs in the system. The –P ALL options directs the command to display stats for all the CPUs, not just a specific one. The parameters 5 2 directs the command to run every 5 seconds and for 2 times. The above output shows the metrics for all the CPUs first (aggregated) and for each CPU individually. Finally, the average for all the CPUs has been shown at the end.
Let’s see what the column values mean:
%user
Indicates the percentage of the processing for that CPU consumes by user processes. User processes are non-kernel processes used for applications such as an Oracle database. In this example output, the user CPU %age is very little.


%nice
Indicates the percentage of CPU when a process was downgraded by nice command. The command nice has been described in an earlier installment. It brief, the command nice changes the priority of a process.


%system
Indicates the CPU percentage consumed by kernel processes


%iowait
Shows the percentage of CPU time consumed by waiting for an I/O to occur


%irq
Indicates the %age of CPU used to handle system interrupts


%soft
Indicates %age consumed for software interrupts


%idle
Shows the idle time of the CPU


%intr/s
Shows the total number of interrupts received by the CPU per second
You may be wondering about the purpose of the mpstat command when you have vmstat, described earlier. There is a huge difference: mpstat can show the per processor stats, whereas vmstat shows a consolidated view of all processors. So, it’s possible that a poorly written application not using multi-threaded architecture runs on a multi-processor machine but does not use all the processors. As a result, one CPU overloads while others remain free. You can easily diagnose these sorts of issues via mpstat.

Usage for Oracle Users

Similar to vmstat, the mpstat command also produces CPU related stats so all the discussion related to CPU issues applies to mpstat as well. When you see a low %idle figure, you know you have CPU starvation. When you see a higher %iowait figure, you know there is some issue with the I/O subsystem under the current load. This information comes in very handy in troubleshooting Oracle database performance.

iostat

A key part of the performance assessment is disk performance. The iostat command gives the performance metrics of the storage interfaces.
# iostat
Linux 2.6.9-55.0.9.ELlargesmp (prolin3)     12/27/2008
 
avg-cpu:  %user   %nice    %sys %iowait   %idle
          15.71    0.00    1.07    3.30   79.91
 
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
cciss/c0d0        4.85        34.82       130.69  307949274 1155708619
cciss/c0d0p1      0.08         0.21         0.00    1897036       3659
cciss/c0d0p2     18.11        34.61       130.69  306051650 1155700792
cciss/c0d1        0.96        13.32        19.75  117780303  174676304
cciss/c0d1p1      2.67        13.32        19.75  117780007  174676288
sda               0.00         0.00         0.00        184          0
sdb               1.03         5.94        18.84   52490104  166623534
sdc               0.00         0.00         0.00        184          0
sdd               1.74        38.19        11.49  337697496  101649200
sde               0.00         0.00         0.00        184          0
sdf               1.51        34.90         6.80  308638992   60159368
sdg               0.00         0.00         0.00        184          0
... and so on ...
The beginning portion of the output shows metrics such as CPU free and I/O waits as you have seen from the mpstat command.
The next part of the output shows very important metrics for each of the disk devices on the system. Let’s see what these columns mean:
Device
The name of the device
tps  
Number of transfers per second, i.e. number of I/O operations per second. Note: this is just the number of I/O operations; each operation could be huge or small.
Blk_read/s  
Number of blocks read from this device per second. Blocks are usually of 512 bytes in size. This is a better value of the disk’s utilization.
Blk_wrtn/s  
Number of blocks written to this device per second
Blk_read  
Number of blocks read from this device so far. Be careful; this is not what is happening right now. These many blocks have already been read from the device. It’s possible that nothing is being read now. Watch this for some time to see if there is a change.
Blk_wrtn
Number of blocks written to the device
In a system with many devices, the output might scroll through several screens—making things a little bit difficult to examine, especially if you are looking for a specific device. You can get the metrics for a specific device only by passing that device as a parameter.
# iostat sdaj   
Linux 2.6.9-55.0.9.ELlargesmp (prolin3)     12/27/2008
 
avg-cpu:  %user   %nice    %sys %iowait   %idle
          15.71    0.00    1.07    3.30   79.91
 
Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sdaj              1.58        31.93        10.65  282355456   94172401
The CPU metrics shown at the beginning may not be very useful. To suppress the CPU related stats shown in the beginning of the output, use the -d option.

You can place optional parameters at the end to let iostat display the device stats in regular intervals. To get the stats for this device every 5 seconds for 10 times, issue the following:
# iostat -d sdaj 5 10

You can display the stats in kilobytes instead of just bytes using the -k option:

# iostat -k -d sdaj    
Linux 2.6.9-55.0.9.ELlargesmp (prolin3)     12/27/2008
 
Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
sdaj              1.58        15.96         5.32  141176880   47085232
While the above output can be helpful, there is lot of information not readily displayed. For instance, one of the key causes of disk issues is the disk service time, i.e. how fast the disk gets the data to the process that is asking for it. To get that level of metrics, we have to get the “extended” stats on the disk, using the -x option.
# iostat -x sdaj
Linux 2.6.9-55.0.9.ELlargesmp (prolin3)     12/27/2008
 
avg-cpu:  %user   %nice    %sys %iowait   %idle
          15.71    0.00    1.07    3.30   79.91
 
Device:    rrqm/s wrqm/s   r/s   w/s  rsec/s  wsec/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await  svctm  %util
sdaj         0.00   0.00  1.07  0.51   31.93   10.65    15.96     5.32    27.01     0.01    6.26   6.00   0.95
Let’s see what the columns mean:
Device  
The name of the device
rrqm/s
The number of read requests merged per second. The disk requests are queued. Whenever possible, the kernel tries to merge several requests to one. This metric measures the merge requests for read transfers.
wrqm/s  
Similar to reads, this is the number of write requests merged.
r/s  
The number of read requests per second issued to this device
w/s 
Likewise, the number of write requests per second
rsec/s 
The number of sectors read from this device per second
wsec/s   
The number of sectors written to the device per second
rkB/s   
Data read per second from this device, in kilobytes per second
wkB/s
Data written to this device, in kb/s
avgrq-sz
Average size of the read requests, in sectors
avgqu-sz  
Average length of the request queue for this device
await 
Average elapsed time (in milliseconds) for the device for I/O requests. This is a sum of service time + waiting time in the queue.
svctm 
Average service time (in milliseconds) of the device
%util
Bandwidth utilization of the device. If this is close to 100 percent, the device is saturated.
Well, that’s a lot of information and may present a challenge as to how to use it effectively. The next section shows how to use the output.

How to Use It

You can use a combination of the commands to get some meaning information from the output. Remember, disks could be slow in getting the request from the processes. The amount of time the disk takes to get the data from it to the queue is called service time. If you want to find out the disks with the highest service times, you issue:
# iostat -x | sort -nrk13
sdat         0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00    18.80     0.00   64.06  64.05   0.00
sdv          0.00   0.00  0.00  0.00    0.00    0.00     0.00     0.00    17.16     0.00   18.03  17.64   0.00
sdak         0.00   0.00  0.00  0.14    0.00    1.11     0.00     0.55     8.02     0.00   17.00  17.00   0.24
sdm          0.00   0.00  0.00  0.19    0.01    1.52     0.01     0.76     8.06     0.00   16.78  16.78   0.32
... and so on ...
This shows that the disk sdat has the highest service time (64.05 ms). Why is it so high? There could be many possibilities but three are most likely:
  1. The disk gets a lot of requests so the average service time is high.
  2. The disk is being utilized to the maximum possible bandwidth.
  3. The disk is inherently slow.
Looking at the output we see that reads/sec and writes/sec are 0.00 (almost nothing is happening), so we can rule out #1. The utilization is also 0.00% (the last column), so we can rule out #2. That leaves #3. However, before we draw a conclusion that the disk is inherently slow, we need to observe that disk a little more closely. We can examine that disk alone every 5 seconds for 10 times.
# iostat -x sdat 5 10
If the output shows the same average service time, read rate and utilization, we can conclude that #3 is the most likely factor. If they change, then we can get further clues to understand why the service time is high for this device.
Similarly, you can sort on the read rate column to display the disk under constant read rates.
# iostat -x | sort -nrk6 
sdj          0.00   0.00  1.86  0.61   56.78   12.80    28.39     6.40    28.22     0.03   10.69   9.99   2.46
sdah         0.00   0.00  1.66  0.52   50.54   10.94    25.27     5.47    28.17     0.02   10.69  10.00   2.18
sdd          0.00   0.00  1.26  0.48   38.18   11.49    19.09     5.75    28.48     0.01    3.57   3.52   0.61
... and so on ...
   
The information helps you to locate a disk that is “hot”—that is, subject to a lot of reads or writes. If the disk is indeed hot, you should identify the reason for that; perhaps a filesystem defined on the disk is subject to a lot of reading. If that is the case, you should consider striping the filesystem across many disks to distribute the load, minimizing the possibility that one specific disk will be hot.

sar

From the earlier discussions, one common thread emerges: Getting real time metrics is not the only important thing; the historical trend is equally important.
Furthermore, consider this situation: how many times has someone reported a performance problem, but when you dive in to investigate, everything is back to normal? Performance issues that have occurred in the past are difficult to diagnose without any specific data as of that time. Finally, you will want to examine the performance data over the past few days to decide on some settings or to make adjustments.
The sar utility accomplishes that goal. sar stands for System Activity Recorder, which records the metrics of the key components of the Linux system—CPU, Memory, Disks, Network, etc.—in a special place: the directory /var/log/sa. The data is recorded for each day in a file named sa<nn> where <nn> is the two digit day of the month. For instance the file sa27 holds the data for the date 27th of that month. This data can be queried by the command sar.
The simplest way to use sar is to use it without any arguments or options. Here is an example:
# sar
Linux 2.6.9-55.0.9.ELlargesmp (prolin3)     12/27/2008
 
12:00:01 AM       CPU     %user     %nice   %system   %iowait     %idle
12:10:01 AM       all     14.99      0.00      1.27      2.85     80.89
12:20:01 AM       all     14.97      0.00      1.20      2.70     81.13
12:30:01 AM       all     15.80      0.00      1.39      3.00     79.81
12:40:01 AM       all     10.26      0.00      1.25      3.55     84.93
... and so on ...
The output shows the CPU related metrics collected in 10 minute intervals. The columns mean:
CPU
The CPU identifier; “all” means all the CPUs
%user
The percentage of CPU used for user processes. Oracle processes come under this category.
%nice
The %ge of CPU utilization while executing under nice priority
%system
The %age of CPU executing system processes
%iowait
The %age of CPU waiting for I/O
%idle
The %age of CPU idle waiting for work
From the above output, you can see that the system has been well balanced; actually severely under-utilized (as seen from the high degree of %age idle number). Going further through the output we see the following:
... continued from above ...
03:00:01 AM       CPU     %user     %nice   %system   %iowait     %idle
03:10:01 AM       all     44.99      0.00      1.27      2.85     40.89
03:20:01 AM       all     44.97      0.00      1.20      2.70     41.13
03:30:01 AM       all     45.80      0.00      1.39      3.00     39.81
03:40:01 AM       all     40.26      0.00      1.25      3.55     44.93
... and so on ...
This tells a different story: the system was loaded by some user processes between 3:00 and 3:40. Perhaps an expensive query was executing; or perhaps an RMAN job was running, consuming all that CPU. This is where the sar command is useful--it replays the recorded data showing the data as of a certain time, not now. This is exactly what you wanted to accomplish the three objectives outlined in the beginning of this section: getting historical data, finding usage patterns and understanding trends.
If you want to see a specific day’s sar data, merely open sar with that file name, using the -f option as shown below (to open the data for 26th)
# sar -f /var/log/sa/sa26
It can also display data in real time, similar to vmstat or mpstat. To get the data every 5 seconds for 10 times, use:
# sar 5 10
Linux 2.6.9-55.0.9.ELlargesmp (prolin3)     12/27/2008
 
01:39:16 PM       CPU     %user     %nice   %system   %iowait     %idle
01:39:21 PM       all     20.32      0.00      0.18      1.00     78.50
01:39:26 PM       all     23.28      0.00      0.20      0.45     76.08
01:39:31 PM       all     29.45      0.00      0.27      1.45     68.83
01:39:36 PM       all     16.32      0.00      0.20      1.55     81.93
… and so on 10 times …
Did you notice the “all” value under CPU? It means the stats were rolled up for all the CPUs. In a single processor system that is fine; but in multi-processor systems you may want to get the stats for individual CPUs as well as an aggregate one. The -P ALL option accomplishes that.
#sar -P ALL 2 2
Linux 2.6.9-55.0.9.ELlargesmp (prolin3)     12/27/2008
 
01:45:12 PM       CPU     %user     %nice   %system   %iowait     %idle
01:45:14 PM       all     22.31      0.00     10.19      0.69     66.81
01:45:14 PM         0      8.00      0.00     24.00      0.00     68.00
01:45:14 PM         1     99.00      0.00      1.00      0.00      0.00
01:45:14 PM         2      6.03      0.00     18.59      0.50     74.87
01:45:14 PM         3      3.50      0.00      8.50      0.00     88.00
01:45:14 PM         4      4.50      0.00     14.00      0.00     81.50
01:45:14 PM         5     54.50      0.00      6.00      0.00     39.50
01:45:14 PM         6      2.96      0.00      7.39      2.96     86.70
01:45:14 PM         7      0.50      0.00      2.00      2.00     95.50
 
01:45:14 PM       CPU     %user     %nice   %system   %iowait     %idle
01:45:16 PM       all     18.98      0.00      7.05      0.19     73.78
01:45:16 PM         0      1.00      0.00     31.00      0.00     68.00
01:45:16 PM         1     37.00      0.00      5.50      0.00     57.50
01:45:16 PM         2     13.50      0.00     19.00      0.00     67.50
01:45:16 PM         3      0.00      0.00      0.00      0.00    100.00
01:45:16 PM         4      0.00      0.00      0.50      0.00     99.50
01:45:16 PM         5     99.00      0.00      1.00      0.00      0.00
01:45:16 PM         6      0.50      0.00      0.00      0.00     99.50
01:45:16 PM         7      0.00      0.00      0.00      1.49     98.51
 
Average:          CPU     %user     %nice   %system   %iowait     %idle
Average:          all     20.64      0.00      8.62      0.44     70.30
Average:            0      4.50      0.00     27.50      0.00     68.00
Average:            1     68.00      0.00      3.25      0.00     28.75
Average:            2      9.77      0.00     18.80      0.25     71.18
Average:            3      1.75      0.00      4.25      0.00     94.00
Average:            4      2.25      0.00      7.25      0.00     90.50
Average:            5     76.81      0.00      3.49      0.00     19.70
Average:            6      1.74      0.00      3.73      1.49     93.03
Average:            7      0.25      0.00      1.00      1.75     97.01
This shows the CPU identifier (starting with 0) and the stats for each. At the very end of the output you will see the average of runs against each CPU.
The command sar is not only fro CPU related stats. It’s useful to get the memory related stats as well. The -r option shows the extensive memory utilization.
# sar -r
Linux 2.6.9-55.0.9.ELlargesmp (prolin3)     12/27/2008
 
12:00:01 AM kbmemfree kbmemused  %memused kbbuffers  kbcached kbswpfree kbswpused  %swpused  kbswpcad
12:10:01 AM    712264  32178920     97.83   2923884  25430452  16681300     95908      0.57       380
12:20:01 AM    659088  32232096     98.00   2923884  25430968  16681300     95908      0.57       380
12:30:01 AM    651416  32239768     98.02   2923920  25431448  16681300     95908      0.57       380
12:40:01 AM    651840  32239344     98.02   2923920  25430416  16681300     95908      0.57       380
12:50:01 AM    700696  32190488     97.87   2923920  25430416  16681300     95908      0.57       380
Let’s see what each column means:
kbmemfree
The free memory available in KB at that time
kbmemused 
The memory used in KB at that time
%memused
%age of memory used
kbbuffers 
This %age of memory was used as buffers
kbcached
This %age of memory was used as cache
kbswpfree
The free swap space in KB at that time
kbswpused 
The swap space used in KB at that time
%swpused 
The %age of swap used at that time
kbswpcad
The cached swap in KB at that time
At the very end of the output, you will see the average figure for time period.
You can also get specific memory related stats. The -B option shows the paging related activity.
# sar -B
Linux 2.6.9-55.0.9.ELlargesmp (prolin3)     12/27/2008
 
12:00:01 AM  pgpgin/s pgpgout/s   fault/s  majflt/s
12:10:01 AM    134.43    256.63   8716.33      0.00
12:20:01 AM    122.05    181.48   8652.17      0.00
12:30:01 AM    129.05    253.53   8347.93      0.00
... and so on ...
The column shows metrics at that time, not currently.
pgpgin/s
The amount of paging into the memory from disk, per second
pgpgout/s
The amount of paging out to the disk from memory, per second
fault/s
Page faults per second
majflt/s
Major page faults per second
To get a similar output for swapping related activity, you can use the -W option.
# sar -W
Linux 2.6.9-55.0.9.ELlargesmp (prolin3)     12/27/2008
 
12:00:01 AM  pswpin/s pswpout/s
12:10:01 AM      0.00      0.00
12:20:01 AM      0.00      0.00
12:30:01 AM      0.00      0.00
12:40:01 AM      0.00      0.00
... and so on ...
The columns are probably self-explanatory; but here is the description of each anyway:
pswpin/s
Pages of memory swapped back into the memory from disk, per second
pswpout/s
Pages of memory swapped out to the disk from memory, per second
If you see a lot of swapping, you may be running low on memory. It’s not a foregone conclusion but rather something that may be a strong possibility.
To get the disk device statistics, use the -d option:
# sar -d
Linux 2.6.9-55.0.9.ELlargesmp (prolin3)     12/27/2008
 
12:00:01 AM       DEV       tps  rd_sec/s  wr_sec/s
12:10:01 AM    dev1-0      0.00      0.00      0.00
12:10:01 AM    dev1-1      5.12      0.00    219.61
12:10:01 AM    dev1-2      3.04     42.47     22.20
12:10:01 AM    dev1-3      0.18      1.68      1.41
12:10:01 AM    dev1-4      1.67     18.94     15.19
... and so on ...
Average:      dev8-48      4.48    100.64     22.15
Average:      dev8-64      0.00      0.00      0.00
Average:      dev8-80      2.00     47.82      5.37
Average:      dev8-96      0.00      0.00      0.00
Average:     dev8-112      2.22     49.22     12.08
Here is the description of the columns. Again, they show the metrics at that time.
tps
Transfers per second. Transfers are I/O operations. Note: this is just number of operations; each operation may be large or small. So, this, by itself, does not tell the whole story.
rd_sec/s
Number of sectors read from the disk per second
wr_sec/s
Number of sectors written to the disk per second
To get the historical network statistics, you use the -n option:
# sar -n DEV | more
Linux 2.6.9-42.0.3.ELlargesmp (prolin3)     12/27/2008
 
12:00:01 AM     IFACE   rxpck/s   txpck/s   rxbyt/s   txbyt/s   rxcmp/s   txcmp/s  rxmcst/s
12:10:01 AM        lo      4.54      4.54    782.08    782.08      0.00      0.00      0.00
12:10:01 AM      eth0      2.70      0.00    243.24      0.00      0.00      0.00      0.99
12:10:01 AM      eth1      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:10:01 AM      eth2      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:10:01 AM      eth3      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:10:01 AM      eth4    143.79    141.14  73032.72  38273.59      0.00      0.00      0.99
12:10:01 AM      eth5      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:10:01 AM      eth6      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:10:01 AM      eth7      0.00      0.00      0.00      0.00      0.00      0.00      0.00
12:10:01 AM     bond0    146.49    141.14  73275.96  38273.59      0.00      0.00      1.98
… and so on …
Average:        bond0    128.73    121.81  85529.98  27838.44      0.00      0.00      1.98
Average:         eth8      0.00      0.00      0.00      0.00      0.00      0.00      0.00
Average:         eth9      3.52      6.74    251.63  10179.83      0.00      0.00      0.00
Average:         sit0      0.00      0.00      0.00      0.00      0.00      0.00      0.00

In summary, you have these options for the sar command to get the metrics for the components:
Use this option …
… to get the stats on:
-P
Specific CPU(s)
-d
Disks
-r
Memory
-B
Paging
-W
Swapping
-n
Network
What if you want to get the all the available stats on one output? Instead of calling sar with all these options, you can use the -A option which shows all the stats stored in the sar files.

Conclusion

In summary, using these limited set of commands you can handle most of the tasks involved with resource management in a Linux environment. I suggest you practice these in your environment to make yourself familiar with these commands, along with the options described here.
In the next installments, you will learn how to monitor and manage the network. You will also learn various commands that help you manage a Linux environment: finding out who has logged in, setting shell profiles, backing up using cpio and tar, and so on.
In this installment, learn how to manage the Linux environment effectively through these commonly used commands.

ifconfig

The ifconfig command shows the details of the network interface(s) defined in the system. The most common option is -a , which shows all the interfaces.
# ifconfig -a
The usual name of the primary Ethernet network interface is eth0. To find out the details of a specific interface, e.g. eth0, you can use:
# ifconfig eth0
The output is show below, with explanation:
Figure1

Here are some key parts of the output:
  • Link encap: the type of the hardware physical medium supported by this interface (Ethernet, in this case)
  • HWaddr: the unique identifier of the NIC card. Every NIC card has a unique identifier assigned by the manufacturer, called MAC or MAC address. The IP address is attached to the MAC to the server. If this IP address is changed, or this card is moved from this server to a different one, the MAC does not change.
  • Mask: the netmask
  • inet addr: the IP address attached to the interface
  • RX packets: the number of packets received by this interface
  • TX packets: the number of packets sent
  • errors: the number of errors in sending or receiving
The command is not just used to check the settings; it’s used to configure and manage the interface as well. Here is a short list of parameters and options for this command:
up/down – enables or disables a specific interface. You can use the down parameter to shutdown an interface (or disable it):
# ifconfig eth0 down
Similarly to bring it up (or enable) it, you would use:
# ifconfig eth0 up
media – sets the type of the Ethernet media such as 10baseT, 10 Base 2, etc. Common values for the media parameter are 10base2, 10baseT, and AUI. If you want Linux to sense the media automatically, you can specify “auto”, as shown below:
# ifconfig eth0 media auto
add – sets a specific IP address for the interface. To set an IP address of 192.168.1.101 to the interface eth0, you would issue:
# ifconfig eth0 add  192.168.1.101
netmask – sets the netmask parameter of the interface. Here is an example where you can set the netmask of the eth0 interface to 255.255.255.0
# ifconfig eth0 netmask  255.255.255.0
In an Oracle Real Application Clusters environment you have to set the netmask in a certain way, using this command.
In some advanced configurations, you can change the MAC address assigned to the network interface. The hw parameter accomplishes that. The general format is:
ifconfig  
                              
<Interface> hw  
                              
<TypeOfInterface>  <MAC>
                            
The <TypeOfInterface> shows the type of the interface, e.g. ether, for Ethernet. Here is how the MAC address is changed for eth0 to 12.34.56.78.90.12 (Note: the MAC address shown here is fictional. If it matches any actual MAC, it’s purely coincidental.):
# ifconfig eth0 hw ether  12.34.56.78.90.12
This is useful when you add a new card (with a new MAC address) but do not want to change the Linux-related configuration such as network interfaces.

Usage for the Oracle User

The command, along with nestat described below, is one of the most widely used in managing Oracle RAC. Oracle RAC’s performance depends heavily on the interconnect used between the nodes of the cluster. If the interconnect is saturated (that is, it no longer carries any additional traffic) or is failing, you may see reduced performance. The best course of action in this case is to look at the ifconfig output to view any failures. Here is a typical example:
# ifconfig eth9
eth9      Link encap:Ethernet   HWaddr 00:1C:23:CE:6F:82  
          inet addr:10.14.104.31   Bcast:10.14.104.255   Mask:255.255.255.0
          inet6 addr: fe80::21c:23ff:fece:6f82/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST   MTU:1500  Metric:1
          RX packets:1204285416 errors:0  
                              
                                 
dropped:560923
                               overruns:0 frame:0
          TX packets:587443664 errors:0  
                              
                                 
dropped:623409
                               overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:1670104239570 (1.5 TiB)  TX bytes:42726010594 (39.7 GiB)
          Interrupt:169 Memory:f8000000-f8012100
                            
Note the text highlighted in red. The dropped count is extremely high; the number should ideally be 0 or close to it. A high number more than half a million sounds like a faulty interconnect that drops packets, causing the interconnect to resend packets—which should be a clue in the issue diagnosis.

netstat

The status of the input and output through a network interface is assessed via the command netstat. This command can provide the complete information on how the network interface is performing, down to even socket level. Here is an example:
# netstat
Active Internet connections  (w/o servers)
Proto Recv-Q Send-Q Local Address Foreign Address  State      
tcp        0      0 prolin1:31027 prolin1:5500     TIME_WAIT 
tcp        4      0 prolin1l:1521 applin1:40205    ESTABLISHED 
tcp        0      0 prolin1l:1522 prolin1:39957    ESTABLISHED 
tcp        0      0 prolin1l:3938 prolin1:31017    TIME_WAIT
tcp        0      0 prolin1l:1521 prolin1:21545    ESTABLISHED
                               
… and so on …
                            
The above output goes on to show all the open sockets. In very simplistic terms, a socket is akin to a connection between two processes. [Please note: strictly speaking, “sockets” and “connections” are technically different. A socket could exist without a connection. However, a discussion on sockets and connections is beyond the scope of this article. Therefore I have merely presented the concept in an easy-to-understand manner.] Naturally, a connection has to have a source and a destination, called local andremote address. The end points could be on the same server; or on different servers.
In many cases, the programs connect to the same server. For instance, if two processes communicate among each other, the local and remote addresses will be the same, as you can see in the first line – the local and remote addresses are both the sever “prolin1”. However, the processes communicate over a port, which will be different. This port is shown next to the host name after the “:” (colon) mark. The user program sends the data to be sent across the socket to a queue and the receiver reads from a queue at the remote end. Here are the columns of the output:
  1. The leftmost column “ Proto” shows the type of the connection – tcp in this case.
  2. The column Recv-Q shows the bytes of data in the queue to be sent to the user program that established the connection. This value should be as close to 0 as possible. In busy servers this value will be more than 0 but shouldn’t be very high. A higher number may not mean much, unless you see a large number in Send-Q column, described below.
  3. The Send-Q column denotes the bytes in the queue to be sent to the remote program, i.e. the remote program has not yet acknowledged receiving it. This should be close to 0. A large number may indicate a network bottleneck.
  4. Local Address is source of the connection and the port number of the program.
  5. Foreign Address is the destination host and port number. In the first line, both the source and destination are on the same host: prolin1. The connection is simply waiting. The second line shows and established connection between port 1521 of proiln1 going to the port 40205 of the host applin1. It’s most likely an Oracle connection coming from the client applin1 to the database server prolin1. The Oracle listener on prolin1 runs on port 1521; so the port of the source is 1521. In this connection, the server is sending the requested data to the client.
  6. The column State shows the status of the connection. Here are some common values.
    • ESTABLISHED – that the connection has been established. It does not mean that any data is flowing between the end points; merely that the end points have talked to each other.
    • CLOSED – the connection has been closed, i.e. not used now.
    • TIME_WAIT – the connection is being closed but there are still packets in the network that are being handled.
    • CLOSE_WAIT – the remote end has shutdown and has asked to close the connection.
Well, from the foreign and local addresses, especially from the port numbers, you can probably guess that the connections are Oracle related, but won’t it be nice to know that for sure? Of course. The -p option shows the process information as well:
#  netstat -p
Proto  Recv-Q Send-Q Local Address Foreign Address State       PID/Program name   
tcp        0       0 prolin1:1521   prolin1:33303   ESTABLISHED  1327/oraclePROPRD1  
tcp        0       0 prolin1:1521   applin1:51324   ESTABLISHED 13827/oraclePROPRD1 
tcp        0       0 prolin1:1521   prolin1:33298   ESTABLISHED  32695/tnslsnr       
tcp        0       0 prolin1:1521   prolin1:32544   ESTABLISHED  15251/oracle+ASM    
tcp        0       0 prolin1:1521   prolin1:33331   ESTABLISHED  32695/tnslsnr    
This clearly shows the process IP and the process name in the last column, which confirms it to be Oracle server processes, listener process, and ASM server processes.
The netstat command can have various options and parameters. Here are some key ones:
To find out the network statistics for various interfaces, use the -i option.
#  netstat -i
Kernel  Interface table
Iface       MTU Met    RX-OK RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR TX-DRP TX-OVR Flg
eth0       1500   0  6860659      0      0      0  2055833      0      0      0 BMRU
eth8       1500   0     2345      0      0      0      833      0      0      0 BMRU
lo        16436   0 14449079      0      0      0 14449079      0      0      0 LRU
This shows the different interfaces present in the server (eth0, eth8, etc.) and the metrics associated with the interface.
  • RX-OK shows the number of packets successfully sent (for this interface)
  • RX-ERR shows number of errors
  • RX-DRP shows packets dropped and had to be re-sent (either successfully or not)
  • RX-OVR shows packets overrun
The next sets of columns (TX-OK, TX-ERR, etc.) show the corresponding stats for send data.
Flg column is a composite value of the property of the interface. Each letter indicates a specific property being present. Here is an explanation of the letters.
B – Broadcast
M – Multicast
R – Running
U – Up
O – ARP Off
P – Point to Point Connection
L – Loopback
m – Master
s - Slave
You can use the –interface (note: there are two hyphens, not one) option to display the same for a specific interface.
# netstat --interface=eth0 
Kernel Interface table
Iface       MTU Met    RX-OK  RX-ERR RX-DRP RX-OVR    TX-OK TX-ERR  TX-DRP TX-OVR Flg
eth0       1500   0 277903459      0      0      0 170897632      0      0      0 BMsRU
Needless to say, the output is wide and is a little difficult to grasp at one shot. If you are comparing across interfaces, it makes sense to have a tabular output. If you want to examine the values in a more readable format, use the -e option to produce an extended output:
# netstat -i -e
Kernel Interface table
eth0      Link encap:Ethernet   HWaddr 00:13:72:CC:EB:00  
          inet addr:10.14.106.0   Bcast:10.14.107.255   Mask:255.255.252.0
          inet6 addr: fe80::213:72ff:fecc:eb00/64 Scope:Link
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:6861068 errors:0 dropped:0 overruns:0 frame:0
          TX packets:2055956 errors:0 dropped:0 overruns:0  carrier:0
          collisions:0 txqueuelen:1000 
          RX bytes:3574788558 (3.3 GiB)  TX bytes:401608995 (383.0 MiB)
          Interrupt:169
Does the output seem familiar? It should; it’s the same as the output of the ifconfig.
If you’d rather see the output showing IP addresses instead of host names, use the -n option.
The -s option shows the summary statistics of each protocol, rather than showing the details of each connection. This can be combined with the protocol specific flag. For instance -u shows the stats related to the UDP protocol.
# netstat -s -u
Udp:
    12764104 packets received
    600849 packets to unknown port received.
    0 packet receive errors
    13455783 packets sent
Similarly, to see the stats for tcp, use -t and for raw, -r.
One of the really useful options is the display of the routing table, the -r option.
#  netstat -r
Kernel  IP routing table
Destination     Gateway         Genmask          Flags   MSS Window  irtt Iface
10.20.191.0     *               255.255.255.128  U         0 0          0 bond0
172.22.13.0     *               255.255.255.0    U         0 0          0 eth9
169.254.0.0     *               255.255.0.0      U         0 0          0 eth9
default         10.20.191.1     0.0.0.0          UG        0 0          0 bond0
The second column of netstat output– Gateway–shows the gateway to which the routing entry points. If no gateway is used, an asterisk is printed instead. The third column– Genmask–shows the “generality” of the route, i.e., the network mask for this route. When given an IP address to find a suitable route for, the kernel steps through each of the routing table entries, taking the bitwise AND of the address and the netmask before comparing it to the target of the route.
The fourth column, Flags, displays the following flags that describe the route:
  • G means the route uses a gateway.
  • U means the interface to be used is up (available).
  • H means only a single host can be reached through the route. For example, this is the case for the loopback entry 127.0.0.1.
  • D means this route is dynamically created.
  • ! means the route is a reject route and data will be dropped.
The next three columns show the MSSWindow, and irtt that will be applied to TCP connections established via this route.
  • MSS stands for Maximum Segment Size – the size of the largest datagram for transmission via this route.
  • Window is the maximum amount of data the system will accept in a single burst from a remote host for this route.
  • irtt stands for Initial Round Trip Time. It’s a little complicated to explain. Let me explain that separately.
The TCP protocol has a built-in reliability check. If a data packet fails during transmission, it’s re-transmitted. The protocol keeps track of how long the takes for the data to reach the destination and acknowledgement to be received. If the acknowledgement does not come within that timeframe, the packet is retransmitted. The amount of time the protocol has to wait before re-transmitting is set for the interface once (which can be changed) and that value is known as initial round trip time. A value of 0 means the default value is used.
Finally, the last field displays the network interface that this route will use.

nslookup

Every reachable host in a network should have an IP address, which identifies it uniquely in the network. In the internet, which is a big network anyway, IP addresses allow the connections to reach servers running Websites, e.g. www.oracle.com. So, when one host (such as a client) wants to connect to another (such as a database server) using its name and not the IP address, how does the client browser know which IP address to connect to?
The mechanism of translating the host name to IP addresses is known as name resolution. In the most rudimentary level, the host has a special file called hosts, which stores the IP Address – Hostname pairs. Here is an example file:
# cat /etc/hosts
# Do not remove the following  line, or various programs
# that require network  functionality will fail.
127.0.0.1       localhost.localdomain       localhost
192.168.1.101   prolin1.proligence.com      prolin1
192.168.1.102   prolin2.proligence.com      prolin2
This shows that the hostname prolin1.proligence.com is translated to 192.168.1.101. The special entry with the IP address 127.0.0.1 is called a loopback entry, which points back to the server itself via a special network interface called lo (which you saw earlier in the ifconfig and netstat commands).
Well, this is good, but you can’t possibly put all the IP addresses in the world in this file. There should be another mechanism to perform the name resolution. A special purpose server called a nameserver performs that role. It’s like a phonebook that your phone company provides; not your personal phonebook. There may be several nameservers available either inside or outside the private network. The host contacts one of the nameservers first, gets the IP address of the destination host it want to contact, and then attempts to connect to the IP address.
How does the host know what these nameservers are? It looks into a special file called /etc/resolv.conf to get that information. Here is a sample resolv file.
; generated by  /sbin/dhclient-script
search proligence.com
nameserver 10.14.1.58
nameserver 10.14.1.59
nameserver 10.20.223.108
How do you make sure that the name resolution is working fine for a specific host name? In other words, you want to make sure that when the Linux system tries to contact a host called oracle.com, it can find the IP address on the nameserver. The nslookup command is useful for that. Here is how you use it:
# nslookup oracle.com
Server:         10.14.1.58
Address:        10.14.1.58#53
                              


Non-authoritative answer:
Name:   oracle.com
Address: 141.146.8.66
                            
Let’s dissect the output. The Server output is the address of the nameserver. The name oracle.com resolves to the IP address 141.146.8.66. The name was resolved by the nameserver shown next to the word Server in the output.
If you put this IP address in a browser–http://141.146.8.66 instead of http://oracle.com--the browser will go the oracle.com site.
If you made a mistake, or looked for a wrong host:
# nslookup oracle-site.com
Server:         10.14.1.58
Address:        10.14.1.58#53
                              


** server can't find  oracle-site.com: NXDOMAIN
                            
The message is quite clear: this host does not exist.

dig

The nslookup command has been deprecated. Instead, a new, more powerful command – dig ( domain information groper) – should be used. On some newer Linux servers the nslookup command may not be even available.
Here is an example; to check the name resolution of the host oracle.com, you use the following command:
# dig oracle.com
                              


; <<>> DiG 9.2.4  <<>> oracle.com
;; global options:  printcmd
;; Got answer:
;; ->>HEADER<<-  opcode: QUERY, status: NOERROR, id: 62512
;; flags: qr rd ra; QUERY: 1,  ANSWER: 1, AUTHORITY: 8, ADDITIONAL: 8
 
;; QUESTION SECTION:
;oracle.com.                    IN      A
 
;; ANSWER SECTION:
oracle.com.             300     IN      A       141.146.8.66
 
;; AUTHORITY SECTION:
oracle.com.             3230    IN      NS      ns1.oracle.com.
oracle.com.             3230    IN      NS      ns4.oracle.com.
oracle.com.             3230    IN      NS      u-ns1.oracle.com.
oracle.com.             3230    IN      NS      u-ns2.oracle.com.
oracle.com.             3230    IN      NS      u-ns3.oracle.com.
oracle.com.             3230    IN      NS      u-ns4.oracle.com.
oracle.com.             3230    IN      NS      u-ns5.oracle.com.
oracle.com.             3230    IN      NS      u-ns6.oracle.com.
 
;; ADDITIONAL SECTION:
ns1.oracle.com.         124934  IN      A       148.87.1.20
ns4.oracle.com.         124934  IN      A       148.87.112.100
u-ns1.oracle.com.       46043   IN      A       204.74.108.1
u-ns2.oracle.com.       46043   IN      A       204.74.109.1
u-ns3.oracle.com.       46043   IN      A       199.7.68.1
u-ns4.oracle.com.       46043   IN      A       199.7.69.1
u-ns5.oracle.com.       46043   IN      A       204.74.114.1
u-ns6.oracle.com.       46043   IN      A       204.74.115.1
 
;; Query time: 97 msec
;; SERVER:  10.14.1.58#53(10.14.1.58)
;; WHEN: Mon Dec 29 22:05:56  2008
;; MSG SIZE  rcvd: 328
                            
From the mammoth output, several things stand out. It shows that the command sent a query to the nameserver and the host got a response back from the nameserver. The name resolution was also done at some other nameservers such as ns1.oracle.com. It shows that the query took 97 milliseconds.
If the size of the output might not make it all that useful, you can use the +short option to remove all those verbose output:
# dig +short oracle.com
141.146.8.66
You can also use the IP address to reverse lookup the host name from the IP address. The -x option is used for that.
# dig -x 141.146.8.66
The +domain parameter is useful when you are looking for a host inside a domain. For instance, suppose you are searching for the host otn in the oracle.com domain, you can either use:
# dig +short otn.oracle.com
Or you can use the +domain parameter:
# dig +short +tcp  +domain=oracle.com otn
www.oracle.com.
www.oraclegha.com.
141.146.8.66

Usage for the Oracle User

The connectivity is established between the app server and the database server. The TNSNAMES.ORA file, used by SQL*Net may look like this:
prodb3 =
  (description =
    (address_list =
      (address = (protocol = tcp)(host = prolin3)(port = 1521))
    )
    (connect_data =
      (sid = prodb3)
    )
  )
The host name prolin3 should be able to be resolved by the app server. Either this should be in the /etc/hosts file; or the host prolin3 should be defined in the DNS. To make sure the name resolution works and works correctly to point to the right host, you can use the dig command.
With these two commands you can handle most of the tasks involved with network in a Linux environment. In the rest of this installment you will learn how to manage a Linux environment effectively.

uptime

You just logged on to the server and see some things that were supposed to be running are not. Perhaps the processes were killed or perhaps all processes were killed by a shutdown. Instead of guessing, find out if the server was indeed rebooted with the uptime command. The command shows the length of time the server has been up since the last reboot.
# uptime
 16:43:43 up 672 days, 17:46,   45 users,  load average: 4.45,  5.18, 5.38
The output shows much useful information. The first column shows the current time when the command was executed. The second portion – up 672 days, 17:46 – shows the amount of time the server has been up. The numbers 17:46 depict the hour and minutes. So this server has been up for 672 days, 17 hours, and 46 minutes as of now.
The next item – 45 users – shows how many users are logged in to the server right now.
The last bits of the output show how much has been the load average of the server in the last 1, 5, and 15 minutes respectively. The term “load average” is a composite score that determines the load on the system based on CPU and I/O metrics. The higher the load average, the more the load on the system. It’s not based on a scale; unlike percentages it does not end at a fixed number such as 100. In addition, load averages of two systems can’t be compared. It is a number to quantify load on a system and relevant in that system alone. This output shows that the load average was 4.45 in the last 1 min, 5.18 in the 5 last mins, and so on.
The command does not have any options or accept any parameter other than -V, which shows the version of the command.
# uptime -V
procps version 3.2.3

Usage for Oracle Users

There is no clear Oracle-specific use of this command, except that you can find out the load on the system to explain some performance issues. If you see some performance issues on the database, and you trace it to high CPU or I/O load, you should immediately check the load averages using the uptime command. If you see a high load average, your next course of action is to dive down deep below the surface to find the root cause. To perform that deep dive, you have in your arsenal tools like mpstat, iostat, and sar (covered in this installment of this series).
Consider an output as shown below:
# uptime
 21:31:04 up 330 days,   7:16,  4 users,  load average: 12.90, 1.03, 1.00
It’s interesting as the load average was very high (12.90) in the last 1 minute but has been pretty low, even irrelevant, at 1.03 and 1.00 for 5 minutes and 15 minutes respectively. What does it mean? It proves that in less than 5 minutes, some process started that caused the load average to jump up for the last minute. This process was not present earlier because the previous load averages were so small. This analysis leads us to focus on the processes that kicked off during the last few minutes – speeding up the resolution process.
Of course, since it shows how long the server has been up, it also explains why the instance has been up since then.

who

Who is logged in the system right now? That’s a common question you might want to ask, especially when you are tracking down an errant user running some resource consuming commands.
The who command answers that question. Here is the simplest usage without any arguments or parameters.
# who
oracle   pts/2        Jan  8 15:57  (10.14.105.139)
oracle   pts/3        Jan  8 15:57  (10.14.105.139)
root     pts/1        Dec 26 13:42  (:0.0)
root     :0           Oct 23 15:32
The command can take several options. The -s option is the default; it produces the same output as the above.
Looking at the output, you might be straining your memory to remember what the columns are meant to be. Well, relax. You can use the -H option to display the header:
# who -H
NAME     LINE         TIME         COMMENT
oracle   pts/2        Jan  8 15:57  (10.14.105.139)
oracle   pts/3        Jan  8 15:57  (10.14.105.139)
root     pts/1        Dec 26  13:42 (:0.0)
root     :0           Oct 23  15:32
Now the meanings of the columns are clear. The column NAME shows the username of the logged in user. LINE shows the terminal name. In Linux each connection is labeled as a terminal with the naming convention pts/<n> where <n> is a number starting with 1. The :0 terminal is a label for X terminal. TIME shows when they first logged in. COMMENTS shows the IP address where they logged in from.
What if you just want a list of names of users instead of all those extraneous details? The -q option accomplishes that. It displays the names of users on one line, sorted alphabetically. It also displays a count of total number of users at the end (45 in this case):
# who -q
ananda ananda jsmith klome  oracle oracle root root  
                              
… and so on for  45 names
# users=45
                            
Some users could be just logged on but actually doing nothing. You can check how long they have been idle, a command especially useful if you are the boss, by using the -u option.
# who -uH
NAME     LINE         TIME          IDLE          PID COMMENT
oracle   pts/2        Jan  8 15:57   .          18127 (10.14.105.139)
oracle   pts/3        Jan  8 15:57  00:26       18127 (10.14.105.139)
root     pts/1        Dec 26 13:42   old         6451 (:0.0)
root     :0           Oct 23 15:32    ?         24215
The new column IDLE shows how long they have been idle in hh:mm format. Note the value “old” in that column? It means that the user has been idle for more than 1 day. The PID column shows the process ID of their shell connection.
Another useful option is -b that shows when the system was rebooted.
# who -b
         system boot  Feb 15  13:31
It shows the system was booted on Feb 15th at 1:31 PM. Remember the uptime command? It also shows you how long this system has been up. You can subtract the days shown in uptime to know the day of the boot. The who -b command makes it much simpler; it directly shows you the time of the boot.
Very Important Caveat: The who -b command shows the month and date only, not the year. So if the system has been up longer than a year, the output will not reflect the correct value. Therefore uptime is always a preferred approach, even if you have to do a little calculation. Here is an example:
# uptime
 21:37:49 up 675 days, 22:40,   1 user,  load average: 3.35,  3.08, 2.86
# who -b
         system boot   Mar  7 22:58
Note the boot time shows as March 7. That’s in 2007, not 2008! The uptime shows the correct time – it has been up for 675 days. If subtractions are not your forte you can use a simple SQL to get that date 675 days ago:
SQL> select sysdate - 675  from dual;
                              


SYSDATE-6
---------
07-MAR-07
                            
The -l option shows the logons to the system:
# who -lH 
NAME     LINE         TIME         IDLE          PID COMMENT
LOGIN    tty1         Feb 15  13:32              4081 id=1
LOGIN    tty6         Feb 15  13:32              4254 id=6
To find out the user terminals that have been dead, use the -d option:
# who -dH
NAME     LINE         TIME         IDLE          PID COMMENT  EXIT
                      Feb 15  13:31               489 id=si     term=0 exit=0
                      Feb 15  13:32              2870 id=l5     term=0 exit=0
         pts/1        Oct 10  14:53             31869 id=ts/1  term=0 exit=0
         pts/4        Jan 11  00:20             22155 id=ts/4  term=0 exit=0
         pts/3        Jun 29  16:01                 0 id=/3    term=0 exit=0
         pts/2         Oct 4  22:35              8371 id=/2    term=0 exit=0
         pts/5        Dec 30  03:15              5026 id=ts/5  term=0 exit=0
         pts/4        Dec 30  22:35                 0 id=/4    term=0 exit=0
Sometimes the init process (the process that starts first when the system is booted) kicks off other processes. The -p option shows all those logins that are active.
# who -pH
NAME     LINE         TIME                PID COMMENT
                      Feb 15 13:32       4083 id=2
                      Feb 15 13:32       4090 id=3
                      Feb 15 13:32       4166 id=4
                      Feb 15 13:32       4174 id=5
                      Feb 15 13:32       4255 id=x
                      Oct  4 23:14      13754 id=h1
Later in this installment, you will learn about a command – write – that enables real time messaging. You will also learn how to disable others’ ability to write to your terminal (the mesg command). If you want to know which users do and do not allow others to write to their terminals, use the -T option:
# who -TH
NAME       LINE          TIME         COMMENT
oracle   + pts/2        Jan 11 12:08  (10.23.32.10)
oracle   + pts/3        Jan 11 12:08  (10.23.32.10)
oracle   - pts/4        Jan 11 12:08  (10.23.32.10)
root     + pts/1        Dec 26 13:42  (:0.0)
root     ? :0           Oct 23 15:32
The + sign before the terminal name means the terminal accepts write commands from others; the “-“ sign means that the terminal does not allow. The “?” in this field means the terminal does not support writing to it, e.g. an X-window session.
The current run level of the system can be obtained by the -r option:
# who -rH
NAME     LINE         TIME         IDLE          PID COMMENT
         run-level 5  Feb 15  13:31                   last=S
A more descriptive listing can be obtained by the -a (all) option. This option combines the -b -d -l -p -r -t -T -u options. So these two commands produce the same result:
# who  -bdlprtTu
# who -a
Here is a sample output (with the header, so that you can understand the columns better):
# who -aH
NAME       LINE          TIME         IDLE          PID COMMENT  EXIT
                        Feb 15 13:31               489 id=si     term=0 exit=0
           system boot  Feb 15 13:31
           run-level 5  Feb 15 13:31                   last=S
                        Feb 15 13:32              2870 id=l5     term=0 exit=0
LOGIN      tty1         Feb 15 13:32              4081 id=1
                        Feb 15 13:32              4083 id=2
                        Feb 15 13:32              4090 id=3
                        Feb 15 13:32              4166 id=4
                        Feb 15 13:32              4174 id=5
LOGIN      tty6         Feb 15 13:32              4254 id=6
                        Feb 15 13:32              4255 id=x
                        Oct  4 23:14             13754 id=h1
           pts/1        Oct 10 14:53             31869 id=ts/1  term=0 exit=0
oracle   + pts/2        Jan  8 15:57   .         18127 (10.14.105.139)
oracle   + pts/3        Jan  8 15:57  00:18      18127 (10.14.105.139)
           pts/4        Dec 30 03:15              5026 id=ts/4  term=0 exit=0
           pts/3        Jun 29 16:01                 0 id=/3    term=0 exit=0
root     + pts/1        Dec 26 13:42  old         6451 (:0.0)
           pts/2        Oct  4 22:35              8371 id=/2     term=0 exit=0
root     ? :0           Oct 23 15:32   ?         24215
           pts/5        Dec 30 03:15              5026 id=ts/5  term=0 exit=0
           pts/4        Dec 30 22:35                 0 id=/4    term=0 exit=0
To find out your own login, use the -m option:
# who -m
oracle   pts/2        Jan  8 15:57  (10.14.105.139)
Note the pts/2 value? That’s the terminal number. You can find your own terminal via the tty command:
# tty
/dev/pts/2
There is a special command structure in Linux to show your own login – who am i. It produces the same output as the -m option.
# who am i
oracle   pts/2        Jan  8 15:57  (10.14.105.139)
The only arguments allowed are “am i" and “mom likes” (yes, believe it or not!). Both produce the same output,

The Original Instant Messenger System

With the advent of instant messaging or chat programs we seem to have conquered the ubiquitous challenge of maintaining a real time exchange of information while not getting distracted by voice communication. But are these only in the domain of the fancy programs?
The instant messaging or chat concept has been available on *nix for quite a while. In fact, you have a full fledged secure IM system built right into Linux. It allows you to securely talk to anyone connected to the system; no internet connection is required. The chat is enabled through the commands – write, mesg, wall and talk. Let’s examine each of them.
The write command can write to a user’s terminal. If the user has logged in more than one terminal, you can address a specific terminal. Here is how you write a message “Beware of the virus” to the user “oracle” logged in on terminal “pts/3”:
# write oracle pts/3
Beware of the virus
ttyl 
<Control-D>
#
The Control-D key combination ends the message, returns the shell prompt (#) to the user and sends to the user’s terminal. When the above is sent, the user “oracle” will see on terminal pts/3 the messages:
Beware of the virus
ttyl
Each line will come up as the sender presses ENTER after the lines. When the sender presses Control-D, marking the end of transmission, the receiver sees EOF on the screen. The message will be displayed regardless of the current action of the user. If the user is editing a file in vi, the message comes and the user can clear it by pressing Control-L. If the user is on SQL*Plus prompt, the message still comes but does not affect the keystrokes of the user.
What if you don’t want that slight inconvenience? You don’t want anyone to send a message to you – akin to “leave the phone off the hook”. You can do that via the mesg command. This command disables others ability to send you a message. The command without any arguments shows the ability:
# mesg
is y
It shows that others can write to you. To turn it off:
# mesg n
Now to confirm:
# mesg 
is n
When you attempt to write to the users’ terminals, you may want to know which terminals have disabled this writing from others. The who -T command (described earlier in this installment) shows you that:
# who -TH
NAME       LINE          TIME         COMMENT
oracle   + pts/2        Jan 11 12:08 (10.23.32.10)
oracle   + pts/3        Jan 11 12:08 (10.23.32.10)
oracle   - pts/4        Jan 11 12:08 (10.23.32.10)
root     + pts/1        Dec 26 13:42 (:0.0)
root     ? :0           Oct 23 15:32
The + sign before the terminal name indicates that it accepts write commands from others; the “-“ sign indicates that it doesn’t. The “?” indicates that the terminal does not support writing to it, e.g. an X-window session.
What if you want to write to all the logged in users? Instead of typing to each user, use the wall command:
# wall
hello everyone
When sent, the following shows up on the terminals of all logged in users:
Broadcast message from oracle  (pts/2) (Thu Jan  8 16:37:25 2009):
                              


hello everyone
                            
This is very useful for root user. When you want to shutdown the system, unmount a filesystem or perform similar administrative functions you may want all users to log off. Use this command to send a message to all.
Finally, the program talk allows you to chat in real time.  Just type the following:
# talk oracle pts/2
If you want to talk to a user on a different server – prolin2 – you can use
# talk oracle@prolin2 pts/2
It brings up a chat window on the other terminal and now you can chat in real time. Is it that different from a “professional” chat program you are using now? Probably not. Oh, by the way, to make the talk work, you should make sure the talkd daemon is running, which may not have been installed.

w

Yes, it’s a command, even if it’s just one letter long! The command w is a combination of uptime and who commands given one immediately after the other, in that order. Let’s see a very common output without any arguments and options.
# w
 17:29:22 up 672 days, 18:31,   2 users,  load average: 4.52,  4.54, 4.59
USER     TTY      FROM              LOGIN@   IDLE   JCPU   PCPU WHAT
oracle   pts/1     10.14.105.139    16:43    0.00s   0.06s  0.01s w
oracle   pts/2     10.14.105.139    17:26   57.00s   3.17s  3.17s sqlplus   as sysdba
                               
… and so  on …
                            
The output has two distinct parts. The first part shows the output of the uptime command (described above in this installment) which shows how long the server has been up, how many users have logged in and the load average for last 1, 5 and 15 minutes. The parts of the output have been explained under the uptime command. The second part of the output shows the output of the who command with the option -H (also explained in this installment). Again, these various columns have been explained under the who command.
If you rather not display the header, use the -h option.
#  w -h
oracle   pts/1     10.14.105.139    16:43    0.00s   0.02s  0.01s w -h
This removes the header from the output. It’s useful in shell scripts where you want to read and act on the output without the additional burden of skipping the header.
The -s option produces a compact (short) version of the output, removing the login time, JPCU and PCPU times.
# w -s
 17:30:07 up 672 days, 18:32,   2 users,  load average: 5.03,  4.65, 4.63
USER     TTY      FROM               IDLE WHAT
oracle   pts/1     10.14.105.139     0.00s w -s
oracle   pts/2     10.14.105.139     1:42  sqlplus   as sysdba
You might find that the “FROM” field is really not very useful. It shows the IP address of the same server, since the logins are all local. To save the space on the output, you may want to suppress that. The -f option disables printing of the FROM field:
# w -f
 17:30:53 up 672 days, 18:33,   2 users,  load average: 4.77,  4.65, 4.63
USER     TTY        LOGIN@   IDLE    JCPU   PCPU WHAT
oracle   pts/1      16:43    0.00s  0.06s   0.00s w -f
oracle   pts/2      17:26    2:28   3.17s   3.17s sqlplus   as sysdba
The command accepts only one parameter: the name of a user. By default w shows the process and logins for all users. If you put a username, it shows the logins for that user only. For instance, to show logins for root only, issue:
# w -h root
root     pts/1    :0.0             26Dec08 13days 0.01s   0.01s bash
root     :0       -                23Oct08 ?xdm?   21:13m  1.81s  /usr/bin/gnome-session
The -h option was used to suppress displaying header.

kill

A process is running and you want the process to be terminated. What should you do? The process runs in the background so there is no going to the terminal and pressing Control-C; or, the process belongs to another user (using the same userid, such as “oracle”) and you want to terminate it. The kill command comes to rescue; it does what its name suggests – it kills the process. The most common use is:
# kill  
                              
<Process ID of the Linux process>
                            
Suppose you want to kill a process called sqlplus issued by the user oracle, you need to know its processid, or PID:
# ps -aef|grep sqlplus|grep ananda
oracle    8728 23916  0 10:36 pts/3    00:00:00 sqlplus
oracle    8768 23896  0 10:36 pts/2    00:00:00  grep sqlplus
Now, to kill the PID 8728:
# kill 8728
That’s it; the process is killed. Of course, you have to be the same user (oracle) to kill a process kicked off by oracle. To kill processes kicked off by other users you have to be super user – root.
Sometimes you may want to merely halt the process instead of killing it. You can use the option -SIGSTOP with the kill command.
# kill -SIGSTOP 9790
# ps -aef|grep sqlplus|grep oracle
oracle    9790 23916   0 10:41 pts/3    00:00:00 sqlplus   as sysdba
oracle    9885 23896  0 10:41 pts/2    00:00:00  grep sqlplus
This is good for background jobs but with the foreground processes, it merely stops the process and removes the control from the user. So, if you check for the process again after issuing the command:
# ps -aef|grep sqlplus|grep oracle
oracle    9790 23916  0 10:41 pts/3    00:00:00 sqlplus   as sysdba
oracle   10144 23896  0 10:42 pts/2    00:00:00  grep sqlplus
You see that the process is still running. It has not been terminated. To kill this process, and any stubborn processes that refuse to be terminated, you have to pass a new signal called SIGKILL. The default signal is SIGTERM.
# kill -SIGKILL 9790
# ps -aef|grep sqlplus|grep oracle
oracle   10092 23916  0 10:42 pts/3    00:00:00 sqlplus   as sysdba
oracle   10198 23896  0 10:43 pts/2    00:00:00  grep sqlplus
Note the options -SIGSTOP and -SIGKILL, which pass a specific signal (stop and kill, respectively) to the process. Likewise there are several other signals you can use. To get a listing of all the available signals, you can use the -l (that’s the letter “L”, not the numeral “1”) option:
# kill -l
 1) SIGHUP       2) SIGINT       3) SIGQUIT      4) SIGILL
 5) SIGTRAP      6) SIGABRT      7) SIGBUS       8) SIGFPE
 9) SIGKILL     10) SIGUSR1     11) SIGSEGV     12) SIGUSR2
13) SIGPIPE     14) SIGALRM     15) SIGTERM     17) SIGCHLD
18) SIGCONT     19) SIGSTOP     20) SIGTSTP     21) SIGTTIN
22) SIGTTOU     23) SIGURG      24) SIGXCPU     25) SIGXFSZ
26) SIGVTALRM   27) SIGPROF     28) SIGWINCH    29) SIGIO
30) SIGPWR      31) SIGSYS      34) SIGRTMIN    35) SIGRTMIN+1
36) SIGRTMIN+2  37) SIGRTMIN+3  38) SIGRTMIN+4  39) SIGRTMIN+5
40) SIGRTMIN+6  41) SIGRTMIN+7  42) SIGRTMIN+8  43) SIGRTMIN+9
44) SIGRTMIN+10 45) SIGRTMIN+11 46) SIGRTMIN+12 47) SIGRTMIN+13
48) SIGRTMIN+14 49) SIGRTMIN+15 50) SIGRTMAX-14 51) SIGRTMAX-13
52) SIGRTMAX-12 53) SIGRTMAX-11 54) SIGRTMAX-10 55) SIGRTMAX-9
56) SIGRTMAX-8  57) SIGRTMAX-7  58) SIGRTMAX-6  59) SIGRTMAX-5
60) SIGRTMAX-4  61) SIGRTMAX-3  62) SIGRTMAX-2  63) SIGRTMAX-1
64) SIGRTMAX
You can also use the numeral equivalent of the signal in place of the actual signal name. For instance, instead of kill -SIGKILL 9790, you can use kill -9 9790.
By the way, this is an interesting command. Remember, almost all Linux commands are usually executable files located in /bin, /sbin/, /user/bin and similar directories. The PATH executable determines where these command files can be found. Some other commands are an actually “built-in” command, i.e. they are part of the shell itself. One such example is kill. To demonstrate, give the following:
# kill -h 
-bash: kill: h: invalid signal  specification
Note the output that came back from the bash shell. The usage is incorrect since the -h argument was not expected. Now use the following:
# /bin/kill -h
usage: kill [ -s signal | -p ]  [ -a ] pid ...
       kill -l [ signal ]
Aha! This version of the command kill as an executable in the /bin directory accepted the option -h properly. Now you know the subtle difference between the shell built-in commands and their namesake utilities in the form of executable files.
Why is it important to know the difference? It’s important because the functionality varies significantly across these two forms. The kill built-in has lesser functionality than its utility equivalent. When you issue the command kill, you are actually invoking the built-in, not the utility. To add the other functionality, you have to use the /bin/kill utility.
The kill utility has many options and arguments. The most popular is the kill command used to kill the processes with process names, rather than PIDs. Here is an example where you want to kill all processes with the name sqlplus:
# /bin/kill sqlplus
[1]   Terminated              sqlplus
[2]   Terminated              sqlplus
[3]   Terminated              sqlplus
[4]   Terminated              sqlplus
[5]   Terminated              sqlplus
[6]   Terminated              sqlplus
[7]-  Terminated              sqlplus
[8]+  Terminated              sqlplus
Sometimes you may want to see all the process IDs kill will terminate. The -p option accomplishes that. It prints all the PIDs it would have killed, without actually killing them. It serves as a confirmation prior to action:
#  /bin/kill -p sqlplus
6798
6802
6803
6807
6808
6812
6813
6817
The output shows the PIDs of the processes it would have killed. If you reissue the command without the -p option, it will kill all those processes.
At this time you may be tempted to know which other commands are “built-in” in the shell, instead of being utilities.
# man -k builtin
. [builtins]         (1)   - bash built-in commands, see bash(1)
: [builtins]         (1)   - bash built-in commands, see bash(1)
[ [builtins]         (1)   - bash built-in commands, see bash(1)
alias [builtins]     (1)   - bash built-in commands, see bash(1)
bash [builtins]      (1)   - bash built-in commands, see bash(1)
bg [builtins]        (1)   - bash built-in commands, see bash(1)
                               
… and so on …
                            
Some entries seem familiar – alias, bg and so on. Some are purely built-ins, e.g. alias. There is no executable file called alias.

Usage for Oracle Users

Killing a process has many uses – mostly to kill zombie processes, processes that are in the background and others that have stopped responding to the normal shutdown commands. For instance, the Oracle database instance is not shutting down as a result of some memory issue. You have to bring it down by killing one of the key processes like pmon or smon. This should not be an activity to be performed all the time, just when you don’t have much choice.
You may want to kill all sqlplus sessions or all rman jobs using the utility kill command. Oracle Enterprise Manager processes run as perl processes; or DBCA or DBUA processes run, which you may want to kill quickly:
# /bin/kill perl rman perl dbca  dbua java
There is also a more common use of the command. When you want to terminate a user session in Oracle Database, you typically do this:
  • Find the SID and Serial# of the session
  • Kill the session using ALTER SYSTEM command
Let’s see what happens when we want to kill the session of the user SH.
SQL> select sid, serial#,  status
  2  from v$session
  3* where username = 'SH';
       SID    SERIAL# STATUS
---------- ---------- --------
       116       5784  INACTIVE
 
SQL> alter system kill  session '116,5784'
  2  /
 
System altered.
 
It’s killed; but when you check the status of the session:
 
       SID    SERIAL# STATUS
---------- ---------- --------
       116       5784 KILLED
It shows as KILLED, not completely gone. It happens because Oracle waits until the user SH gets to his session and attempts to do something, during which he gets the message “ORA-00028: your session has been killed”. After that time the session disappears from V$SESSION.
A faster way to kill a session is to kill the corresponding server process at the Linux level. To do so, first find the PID of the server process:
SQL> select spid
  2  from v$process
  3  where addr =
  4  (
  5     select paddr
  6     from v$session
  7     where username =  'SH'
  8  );
SPID
------------------------
30986
The SPID is the Process ID of the server process. Now kill this process:
# kill -9 30986
Now if you check the view V$SESSION, it will be gone immediately. The user will not get a message immediately; but if he attempts to perform a database query, he will get:
ERROR at line 1:
ORA-03135: connection lost  contact
Process ID: 30986
Session ID: 125 Serial number:  34528
This is a faster method to kill a session but there are some caveats. The Oracle database has to perform a session cleanup--rollback changes and so on. So this should be performed only when the sessions are idle. Otherwise you can use one of the two other ways to kill a session immediately:
alter system disconnect session  '125,35447' immediate;
alter system disconnect session  '125,35447' post_transaction;

killall

Unlike the dual nature of kill, killall is purely a utility, i.e. this is an executable program in the /usr/bin directory. The command is similar to kill in functionality but instead of killing a process based on its PID, it accepts the process name as an argument. For instance, to kill all sqlplus processes, issue:
# killall sqlplus
This kills all processes named sqlplus (which you have the permission to kill, of course). Unlike the kill built-in command, you don’t need to know the Process ID of the processes to be killed.
If the command does not terminate the process, or the process does not respond to a TERM signal, you can send an explicit SIGKILL signal as you saw in the kill command using the -s option.
# killall -s SIGKILL sqlplus
Like kill, you can use -9 option in lieu of -s SIGKILL. For a list of all available signals, you can use the -l option.
# killall -l
HUP INT QUIT ILL TRAP ABRT IOT  BUS FPE KILL USR1 SEGV USR2 PIPE ALRM TERM
STKFLT CHLD CONT STOP TSTP TTIN  TTOU URG XCPU XFSZ VTALRM PROF WINCH IO PWR SYS
UNUSED
To get a verbose output of the killall command, use the -v option:
# killall -v sqlplus
Killed sqlplus(26448) with signal 15
Killed sqlplus(26452) with signal 15
Killed sqlplus(26456) with signal 15
Killed sqlplus(26457) with signal 15
                               
… and so on …
                            
Sometimes you may want to examine the process before terminating it. The -i option allows you run it interactively. This option prompts for your input before killing it:
# killall -i sqlplus
Kill sqlplus(2537) ? (y/n) n
Kill sqlplus(2555) ? (y/n) n
Kill sqlplus(2555) ? (y/n) y
Killed sqlplus(2555) with signal 15
What happens when you pass a wrong process name?
# killall wrong_process
wrong_process: no process  killed
There is no such running process called wrong_process so nothing was killed and the output clearly showed that. To suppress this complaint “no process killed”, use the -q option. That option comes handy in shell scripts where you can’t parse the output. Rather, you want to capture the return code from the command:
# killall -q wrong_process
# echo $?
1
The return code (shown by the shell variable $?) is “1”, instead of “0”, meaning failure. You can check the return code to examine whether the killall process was successful, i.e. the return code was “0”.
One interesting thing about this command is that it does not kill itself. Of course, it kills other killall commands given elsewhere but not itself.

Usage for Oracle Users

Like the kill command, the killall command is also used to kill processes. The biggest advantage of killall is the ability to display the processid and the interactive nature. Suppose you want to kill all perl, java, sqlplus, rman and dbca processes but do it interactively; you can issue:
# killall -i -p perl sqlplus  java rman dbca
Kill sqlplus(pgid 7053) ? (y/n) n
Kill perl(pgid 31233) ? (y/n) n
                               
... and so on ...
                            
This allows you to view the PID before you kill them, which can be very useful.

Conclusion

In this installment you learned about these commands (shown in alphabetical order)
dig
A newer version of nslookup
ifconfig
To display information on network interfaces
kill
Kill a specific process
killall
Kill a specific process, a group of processes and names matching a pattern
mesg
To turn on or off the ability of others to display something on one’s terminal.
netstat
To display statistics and other metrics on network interface usage
nslookup
To lookup a hostname for its IP address or lookup IP address for its hostname on the DNS
talk
To establish an Instant Message system between two users for realtime chat
uptime
How long the system has been up and its load average for 1, 5 and 15 minutes
w
Combination of uptime and who
wall
To display some text on the terminals of all the logged in users
who
To display the users logged into the system and what they are doing
write
To instantly display something on a specific user’s terminal session
As I have mentioned earlier, it is not my intention to present before you every available command in Linux systems. You need to master only a handful of them to effectively manage a system and this series shows you those very important ones. Practice them on your environment to understand these commands – with their parameters and options – very well. In the next installment, the last one, you will learn how to manage a Linux environment – on a regular machine, in a virtual machine, and on the cloud.



Shell Keyword Variables

When in the command line, you are using a ''shell'' – most likely the bash shell. In a shell you can define a variable and set a value to it to be retrieved later. Here is an example of a variable named ORACLE_HOME:
# export ORACLE_HOME=/opt/oracle/product/11gR2/db1
Later, you can refer to the variable by prefixing a ''$'' sign to the variable name, e.g.:
# cd $ORACLE_HOME
This is called a user defined variable. Likewise, there are several variables defined in the shell itself. These variables -- whose names have been pre-defined in the shell -- control how you interact with the shell. You should learn about these variables (at least a handful of important ones) to improve the quality and efficiency of your work.

PS1

This variable sets the Linux command prompt. Here is an example when the command where we are trying to change the prompt from the default ''# '' to ''$ '':
# export PS1="$ "
$
Note how the prompt changed to $? You can place any character here to change the default prompt. The double quotes are not necessary but since we want to put a space after the ''$'' sign, we have to place quotes around it.
Is that it – to show the prompt in a fancy predefined character or character strings? Not at all. You can also place special symbols in the variable to show special values. For instance the symbol \u shows the username who logged in and \h shows the hostname. If we use these two symbols, the prompt can be customized to show who logged in and where:
$export PS1="\u@\h# "
oracle@oradba1#
   
This shows the prompt as oracle logged in on the server called oradba1 – enough to remind yourself who and where you are. You can further customize the prompt using another symbol, \W, which shows the basename of the current directory. Here is how the prompt looks now:
# export PS1="\u@\h \W# "

oracle@oradba1 ~#     
The current directory is HOME; so it shows ''~''. As you change to a different directory it changes.
Adding the current directory is a great way to remind yourself where you are and the implications of your actions. Executing rm * has a different impact on /tmp than if you were on /home/oracle, doesn't it?
There is another symbol - \w. There is a very important difference between \w and \W. The latter produces the basename of the current directory while the former shows the full directory:
oracle@oradba1 11:59 AM db1# export PS1="\u@\h \@ \w# "
oracle@oradba1 12:01 PM /opt/oracle/product/11gR2/db1#
    
Note the difference? In the previous prompt, where \W was used, it showed only the directory db1, which is the basename. In the next prompt where \w was used, the full directory /opt/oracle/product/11gR2/db1 was displayed.
In many cases a full directory name in the prompt may be immensely helpful. Suppose you have three Oracle Homes. Each one will have a subdirectory called db1. How will you know where exactly you are if only ''db1'' is displayed? A full directory will leave no doubts. However, a full directory also makes the prompt very long, making it a little inconvenient.
The symbol ''\@'' shows the current time in hour:minute AM/PM format:
# export PS1="\u@\h \@ \W# "
oracle@oradba1 11:59 AM db1#

Here are some other symbols you can use in PS1 shell variable:
\!
The command number in the history (more on this later)
\d
The date in Weekday Month Date format
\H
The host name with the domain name. \h is the hostname without the domain
\T
The same as \@ but displaying seconds as well.
\A
The time in hour:minutes as in \@ format but 24 hours
\t
The same as \A but with the seconds as well;

IFS

This variable asks the shell to treat the shell variables as a whole or separate them. If desired to be separated, the value set to the IFS variable is used as a separator. Hence the name Input Field Separator (IFS). To demonstrate, let's define a variable as shown below.
# pfiles=initODBA112.ora:init.ora
These are actually two files: initODBA112.ora and init.ora. Now you want to display the first line of each of these files, you will use the head -1 command.
# head -1 $pfiles
head: cannot open `initODBA112.ora:init.ora' for reading: No such file or directory
The output says it all; the shell interpreted the variable as a whole: `initODBA112.ora:init.ora', which is not a name of any file. That's why the head command fails. If the shell interpreted the '':'' as some sort of a separator, it would have done that job properly. So, that's what we can do by setting IFS variable:
# export IFS=":"
# head -1 $pfiles
==> initODBA112.ora <==
# first line of file initODBA112.ora
 
==> init.ora <==
# first line of file init.ora
There you go – the shell expanded the command head -1 $pfiles to head -1 initODBA112.ora and head -1 init.ora and therefore the command executed properly.

PATH

When you use a command in Linux, it's either in a shell as you saw with the kill command in Part 4 or it's an executable file. If it's an executable, how do you know where is it located?
Take for instance the rm command, which removes some file. The command can be given from any directory. Of course the executable file rm does not exist in all the directories so how does Linux know where to look?
The variable PATH holds the locations where the shell must look for that executable. Here is an example of a PATH setting:
# echo $PATH
/usr/kerberos/bin:/usr/local/bin:/bin:/usr/bin:/usr/X11R6/bin:.
   
When you issue a command such as rm, the shell looks for a file rm in these locations in this order:
/usr/kerberos/bin
/usr/local/bin
/bin
/usr/bin
/usr/X11R6/bin
. (the current directory)
    
If the file is not found in any of these locations, the shell returns an error message -bash: rm: command not found. If you want to add more locations to the PATH variable, do so with '':'' as a separator.
Did you note a very interesting fact above? The location ''current directory'' is set at the very end, not at the beginning. Common sense may suggest that you put it in the beginning so that the shell looks for an executable on the current directory first before looking elsewhere. Putting it at the end will instruct the shell to look in the current directory at the end of the process. But is there a better approach?
Experts recommend that you put the current directory (.) at the end of the PATH command, not the beginning. Why? This practice is for safety. Suppose you are experimenting with some ideas to enhance common shell commands and inadvertently leave that file in your home directory. When you log in, you are in the home directory and when you execute that command, you are not executing the common command but rather the executable file in your home directory.
This could be disastrous in some cases. Suppose you are using toying with a new version of the ''cp'' command and there is a file called cp in your home directory. This file may potentially do some damage. If you type ''cp somefile anotherfile'', your version of file cp will be executed, creating damage. Putting the current directory at the end executes the normal ''cp'' command and avoids such a risk.
It also prevents the risk of some hacker placing some malignant command file in the form of common commands. Some experts even suggest to remove the ''.'' from the PATH altogether, to prevent any inadvertent execution. If you have to execute something in the current directory, just use the ./ notation as in:
# ./mycommand
This executes a file called mycommand in the present directory.

CDPATH

Very similar to PATH, this variable expands the scope of the cd command to more than the present directory. For instance when you type the cd command as shown below:
# cd dbs
-bash: cd: dbs: No such file or directory
It makes sense since the dbs directory does not exist in the present directory. It's under /opt/oracle/product/11gR2/db1. That's why the cd command fails. You can of course go to the directory /opt/oracle/product/11gR2 and then execute the cd command, which will be successful. If you want to increase the scope to include /opt/oracle/product/11gR2/db1, you can issue:
# export CDPATH=/opt/oracle/product/11gR2/db1
Now if you issue the cd command from any dorectory:
# cd dbs 
/opt/oracle/product/11gR2/db1/dbs
# pwd
/opt/oracle/product/11gR2/db1/dbs
The cd command now looks for other directories for that subdirectory.
There are several other variables; but these ones are very widely used and you should know how to master them.

set

This command controls the behavior of the shell. It has many options and arguments but I will explain a few important ones.
A very common mistake people make while using overwriting commands such as cp and mv is to overwrite correct files inadvertently. You can prevent risk that by using ''alias'' (shown in Part 1 of this series), e.g. using mv –i instead of mv. However, how can you prevent someone or some script overwriting the files by the re-direction operator (''>'')?
Let's see an example. Suppose you have a file called very_important.txt. Someone (or some script) inadvertently used something like:
# ls -l > very_important.txt
The file immediately gets overwritten. You lose the original contents of the file. To prevent this risk, you can use the set command with the option -o noclobber as shown below:
# set -o noclobber
After this command if someone tries to overwrite the file:
# ls -l > very_important.txt
-bash: very_important.txt: cannot overwrite existing file
The shell now prevents an existing file being overwritten. What if you want to overwrite? You can use the >| operator.
# ls -l >| very_important.txt
To turn it off:
# set +o noclobber
Another very useful set command is used to use vi editor for editing commands. Later in this installment you will learn how to check the commands you have given and how they can be re-executed. One quick way to re-execute the command is to recall the commands using the vi editor. To enable it execute this command first:
# set -o vi
No suppose you are looking for a command that contains the letter ''v'' (such as vi, or vim, etc.). To search for the command, execute this keystrokes. I have shown the keys to be pressed within square brackets:
# [Escape Key][/ key][v key][ENTER Key]
This will bring up the latest command executed containing ''v''. The last command in this case was set –o vi; so that comes up at the command prompt.
# set -o vi
If that's not the command you were looking for, press the ''n'' key for the next latest command. This way you can cycle through all the command executed with the letter ''v'' in it. When you see the command, you can press [ENTER key] to execute it. The search can be as explicit as you want. Suppose you are looking for a mpstat command executed earlier. All you have to do is enter that search string ''mpstat'':
# [Escape Key][/ key]mpstat[ENTER Key]
Suppose the above command shows mpstat 5 5 and you really want to execute mpstat 10 10. Instead of retyping, you can edit the command in vi. To do so, press [Escape Key] and the [v] key, which will bring up the command in vi editor. Now you can edit the command as you want. When you save it in vi by pressing :wq, the modified command will be executed.

type

In Part 4 you learned about the kill command, which is a special one – it's both a utility (an executable in some directory) and a shell built-in. In addition, you also learned about aliases in a prior installment. Sometimes there are some commands used in shell scripts – ''do'', ''done'', ''while'' for instance, which are not really commands by themselves. They are called shell keywords.
How do you know what type of command it is? The type command shows that. Here is how we have used it to show the types of the commands mv, do, fc and oh.
# type mv do fc oh
mv is /bin/mv
do is a shell keyword
fc is a shell builtin
oh is aliased to `cd $ORACLE_HOME'
It shows very clearly that mv is a utility (along with its location), do is a keyword used inside scripts, fc is a built-in and oh is an alias (and what it aliased to).

history

When you login to the Linux system you typically execute a lot of commands at the command prompt. How do you know what commands you have executed? You may want to know that for a lot of reasons – you want to re-execute it without retyping, you want to make sure you executed the right command (e.g. removed the right file), you want to verify what commands were issued, and so on. The history command gives you a history of the commands executed.
# history 
 1064  cd dbs
 1065  export CDPATH=/opt/oracle/product/11gR2/db1
 1066  cd dbs
 1067  pwd
 1068  env
 1069  env | grep HIST
 … and so on …
Note the numbers before each command. This is the event or command number. You will learn how to use this feature later in this section. If you want to display only a few lines of history instead of all available, say the most recent five commands:
# history 5
The biggest usefulness of history command actually comes from the ability to re-execute a command without retyping. To do so, enter the ! mark followed by the event or the command number that precedes the command name in the history output. To re-execute the command cd dbs shown in number 1066, you can issue:
# !1066
cd dbs
/opt/oracle/product/11gR2/db1/dbs
The command !! (two exclamation marks) executes the last command executed. You can also pass a string after the ! command, which re-executes the latest command with the pattern as the string in the starting position. The following command re-executes the most recent command starting with cd:
# !cd
cd dbs
/opt/oracle/product/11gR2/db1/dbs
What if you want to re-execute a command containing a string – not start with it? The ? modifier does a pattern matching in the commands. To search for a command that has network in it, issue:
# !?network?
cd network
/opt/oracle/product/11gR2/db1/network
You can modify the command to be re-executed as well. For instance, suppose you had given earlier a command cd /opt/oracle/product/11gR2/db1/network and want to re-execute it after adding /admin at the end, you will issue:
# !1091/admin
cd network/admin
/opt/oracle/product/11gR2/db1/network/admin

fc

This command is a shell built-in used to show the command history as well, like history. The most common option is  -l (the letter ''L'', not the number ''1'') which shows the 16 most recent commands:
# fc -l
1055     echo $pfiles
1056     export IFS=
... and so on ...
1064     cd dbs
1065     export CDPATH=/opt/oracle/product/11gR2/db1
1066     cd dbs
You can also ask fc to show only a few commands by giving a range of event numbers, e.g. 1060 and 1064:
# fc -l 1060 1064
1060     pwd
1061     echo CDPATH
1062     echo $CDPATH
1063     cd
1064     cd dbs
The -l option also takes two other parameters – the string to perform a pattern matching. Here is an example where you want to display the history of commands that start with the word echo all the way to the most recent command that starts with pwd.
# fc -l echo pwd
1062     echo $CDPATH
1063     cd
1064     cd dbs
1065     export CDPATH=/opt/oracle/product/11gR2/db1
1066     cd dbs
1067     pwd
If you want to re-execute the command cd dbs (command number 1066), you can simply enter that number after fc with the -s option:
# fc -s 1066
cd dbs
/opt/oracle/product/11gR2/db1/dbs
Another powerful use of the fc -l command is the substitution of commands. Suppose you want to execute a command similar to the 1066 (cd dbs) but want to issue cd network, not cd dbs, you can use the substitution argument as shown below:
# fc -s dbs=network 1066
cd network
/opt/oracle/product/11gR2/db1/network
If you omit the -s option, as shown below:
# fc 1066
It opens a vi file with the command cd dbs inside, which you can edit and execute.

cpio

Consider this: you want to send a set of files to someone or somewhere and do not want to risk the files getting lost, breaking the set. What can you do to make sure of that? Simple. If you could put all the files into a single file and send that single file to its destination, you can rest assured that all the files arrived safely.
The cpio command has three main options:
  • -o (create) to create an archive
  • -i (extract) to extract files from an archive
  • -p (pass through) to copy files to a different directory
Each option has its own set of sub-options. For instance the -c option is applicable in case of -i and -o but not in case of -p. So, let's see the major option groups and how they are used.
The -v option is used to display a verbose output, which may be beneficial in cases where you want a definite feedback on what's going on.
First, let see how to create an archive from a bunch of files. Here we are putting all files with the extension ''trc'' in a specific directory and putting then in a file called myfiles.cpio:
$ ls *.trc | cpio -ocv > myfiles.cpio
+asm_ora_14651.trc
odba112_ora_13591.trc
odba112_ora_14111.trc
odba112_ora_14729.trc
odba112_ora_15422.trc
9 blocks
The -v option was for verbose output so cpio showed us each file as it was added to the archive. The -o option was used since we wanted to create an archive. The -c option was used to tell cpio to write the header information in ASCII, which makes it easier to move across platforms.
Another option is -O which accepts the output archive file as a parameter.
# ls *.trc | cpio -ocv -O mynewfiles.cpio
To extract the files:
$ cpio -icv < myfiles.cpio
+asm_ora_14651.trc
cpio: odba112_ora_13591.trc not created: newer or same age version exists
odba112_ora_13591.trc
Here the -v and –i options are used for verbose output and for extraction of files from the archives. The –c option was used to instruct the cpio to read the header information as ASCII. When cpio extracts a file and it is already present (as it was the case for odba112_ora_13591.trc), it does not overwrite the file; but simply skips it with the message. To force overwriting, use the -u option.
# cpio -icvu < myfiles.cpio
To only display the contents without actually extracting, use the –t option along with the –i (extraction):
# cpio -it < myfiles.cpio
+asm_ora_14651.trc
odba112_ora_13591.trc
odba112_ora_14111.trc
What if you are extracting a file which already exists? You still want to extract it but perhaps to a different name. One example is that you are trying to restore a file called alert.log (which is a log file for an Oracle instance) and you don't want to overwrite the current alert.log file.
One of the very useful options is –r, which allows you to rename the files being extracted, interactively.
# cpio -ir < myfiles.cpio
rename +asm_ora_14651.trc -> a.trc
rename odba112_ora_13591.trc -> b.trc
rename odba112_ora_14111.trc -> [ENTER] which leaves the name alone
If you created a cpio archive of a directory and want to extract to the same directory structure, use the –d option while extracting.
While creating, you can add files to an existing archive (append) using the -A option as shown below:
# ls *.trc | cpio -ocvA -O mynewfiles.cpio
The command has many other options; but you need to know only these to effectively use them.

tar

Another mechanism for creating an archive is tar. Originally created for archiving to tape drives (hence the nam:e Tape Archiver), tar is a very popular command for its simplicity. It takes three primary options
  • -c to create an archive
  • -x to extract files from an archive
  • -t to display the files in an archive
Here is how you create a tar archive. The –f option lets you name an output file that tar will create as an archive. In this example we are creating an archive called myfiles.tar from all the files with the extension ''trc''.
# tar -cf myfiles.tar *.trc
Once created, you can list the contents of an archive by the –t option.
# tar tf myfiles.tar
+asm_ora_14651.trc
odba112_ora_13591.trc
odba112_ora_14111.trc
odba112_ora_14729.trc
odba112_ora_15422.trc
To show the details of the files, use the –v (verbose) option:
# tar tvf myfiles.tar
-rw-r----- oracle/dba     1150 2008-12-30 22:06:39 +asm_ora_14651.trc
-rw-r----- oracle/dba      654 2008-12-26 15:17:22 odba112_ora_13591.trc
-rw-r----- oracle/dba      654 2008-12-26 15:19:29 odba112_ora_14111.trc
-rw-r----- oracle/dba      654 2008-12-26 15:21:36 odba112_ora_14729.trc
-rw-r----- oracle/dba      654 2008-12-26 15:24:32 odba112_ora_15422.trc
To extract files from the archive, use the –x option. Here is an example (the –v option has been used to show the verbose output):
# tar xvf myfiles.tar

zip

Compression is a very important part of Linux administration. You may be required to compress a lot of files to make room for new ones, or to send to via email and so on.
Linux offers many compression commands; here we'll examine to most common ones: zip and gzip.
The zip command produces a single file by consolidating other files and compressing them into one zip (or compressed) file. Here is a simple example usage of the command:
# zip myzip *.aud
It produces a file called myzip.zip with all the files in the directory named with a .aud extension.
zip accepts several options. The most common is -9, which instructs zip to compress as much as possible, while sacrificing the CPU cycles (and therefore taking longer). The other option, -1, instructs just the opposite: to compress faster while not compressing as much.
You can also protect the zip file by encrypting it with a password. Without the correct password the zip file cannot be unzipped. This is provided at the runtime with the –e (encrypt) option:
# zip -e ze *.aud
Enter password: 
Verify password: 
  adding: odba112_ora_10025_1.aud (deflated 32%)
  adding: odba112_ora_10093_1.aud (deflated 31%)
... and so on ...
The -P option allows a password to be given in the command line. Since this allows other users to see the password in plaintext by checking for the processes or in the command history, it's not recommended over the -e option.
# zip -P oracle zp *.aud 
updating: odba112_ora_10025_1.aud (deflated 32%)
updating: odba112_ora_10093_1.aud (deflated 31%)
updating: odba112_ora_10187_1.aud (deflated 31%)
… and so on ..
You can check the integrity of the zip files by the -T option. If the zipfile is encrypted with a password, you have to provide the password.
# zip -T ze
[ze.zip] odba112_ora_10025_1.aud password: 
test of ze.zip OK
Of course, when you zip, you need to unzip later. And the command is – you guessed it – unzip. Here is a simple usage of the unzip command to unzip the zip file ze.zip.
# unzip myfiles.zip
If the zip file has been encrypted with a password, you will be asked for the password. When you enter it, it will not be repeated on the screen.
# unzip ze.zip
Archive:  ze.zip
[ze.zip] odba112_ora_10025_1.aud password: 
password incorrect--reenter: 
password incorrect--reenter: 
replace odba112_ora_10025_1.aud? [y]es, [n]o, [A]ll, [N]one, [r]ename: N
In the above example you entered the password incorrectly first; so it was prompted again. After entering it correctly, unzip found that there is already a file called odba112_ora_10025_1.aud; so unzip prompted you for your action. Note the choices – you had a rename option as well, to rename the unzipped file.
Remember the zip protected by a password passed in the command line with the zip –P command? You can unzip this file by passing the command in the command line as well, using the same –P option.
# unzip -P mypass zp.zip
The -P option differs from the -p option. The -p option instructs unzip to unzip files to the standard output, which can then be redirected to another file or another program.
The attractiveness of zip comes from the fact that it's the most portable. You can zip it on Linux and unzip it on OS X or Windows. The unzip utility is available on many platforms.
Suppose you have zipped a lot of files under several subdirectories under a directory. When you unzip this file, it creates the subdirectories as needed. If you want all the files to be unzipped into the current directory instead, use the -j option.
# unzip -j myfiles.zip
One of the most useful combinations is the use of tar to consolidate the files and compressing the resultant archive file via the zip command. Instead of a two-step process of tar and zip, you can pass the output of tar to zip as shown below:
# tar cf - . | zip myfiles -
  adding: - (deflated 90%)
The special character ''-'' in zip means the name of the file. The above command tars everything and creates a zip called myfiles.zip.
Similarly, while unzipping the zipped file and extracting the files from the zip archive, you can eliminate the two step process and perform both in one shot:
# unzip -p myfiles | tar xf -

gzip

The command gzip (short for GNU zip), is another command to compress files. It is intended to replace the old UNIX compress utility.
The main practical difference between zip and gzip is that the former creates a zip file from a bunch of files while the latter creates a compressed file for each input file. Here is an example usage:
# gzip odba112_ora_10025_1.aud
Note, it did not ask for a zip file name. The gzip command takes each file (e.g. odba112_ora_10025_1.aud) and simply creates a zip file named odba112_ora_10025_1.aud.gz. Additionally, note this point carefully, it removes the original file odba112_ora_10025_1.aud. If you pass a bunch of files as parameter to the command:
# gzip *
It creates a zip file with the extension .gz for each of these files present in the directory. Initially the directory contained these files:
a.txt
b.pdf
c.trc
After the gzip * command, the contents of the directory will be:
a.txt.gz
b.pdf.gz
c.trc.gz
The same command is also used for unzip (or uncompress, or decompress). The option is, quite intuitively, -d to decompress the files compressed by gzip
To check the contents of the gzipped file and how much has been compressed, you can use the -l option. It actually doesn't compress or uncompress anything; it just shows the contents.
# gzip -l *
         compressed        uncompressed  ratio uncompressed_name
                698                1150  42.5% +asm_ora_14651.trc
                464                 654  35.2% odba112_ora_13591.trc
                466                 654  34.9% odba112_ora_14111.trc
                466                 654  34.9% odba112_ora_14729.trc
                463                 654  35.3% odba112_ora_15422.trc
               2557                3766  33.2% (totals)
You can compress the files in a directory as well, using the recursive option (-r). To gzip all files under the log directory, use:
# gzip -r log
To check integrity of a gzip-ed file, use the -t option:
# gzip -t myfile.gz
When you want to create a different name for the gzipped file, not the default .gz, you should use the –c option. This instructs the gzip command to write to standard output which can be directed to a file. You can use the same technique to put more than one file in the same gzipped file. Here we are compressing two files - odba112_ora_14111.trc, odba112_ora_15422.trc – in the same compressed file named 1.gz.
# gzip -c  odba112_ora_14111.trc odba112_ora_15422.trc > 1.gz
Note when you display the contents of the compressed file:
# gzip -l 1.gz
         compressed        uncompressed  ratio uncompressed_name
                                    654 -35.9% 1
The compression ratio shown is for the last file in the list only (that is why it shows a lesser size for original than the compressed one). When you decompress this file, both the original files will be displayed one after the other and both will be uncompressed properly.
The -f option forces the output to overwrite the files, if present. The –v option shows the output in more verbose manner. Here is an example:
# gzip -v *.trc
+asm_ora_14651.trc:      42.5% -- replaced with +asm_ora_14651.trc.gz
odba112_ora_13591.trc:   35.2% -- replaced with odba112_ora_13591.trc.gz
odba112_ora_14111.trc:   34.9% -- replaced with odba112_ora_14111.trc.gz
odba112_ora_14729.trc:   34.9% -- replaced with odba112_ora_14729.trc.gz
odba112_ora_15422.trc:   35.3% -- replaced with odba112_ora_15422.trc.gz
A related command is zcat. If you want to display the contents of the gzipped file without unzipping it first, use the zcat command:
# zcat 1.gz
The zcat command is similar to gzip -d | cat on the file; but does not actually decompress the file.
Like the zip command, gzip also accepts the options for degree of compression:
# gzip -1 myfile.txt … Least compression consuming least CPU and fastest
# gzip -9 myfile.txt … Most compression consuming most CPU and slowest
The command gunzip is also available, which is equivalent to gzip -d (to decompress a gzipped file)

Managing Linux in a Virtual Environment

Linux has been running in data centers all over the world for a quite a while now. Traditionally, the concept of a server means a physical machine distinct from other physical machines. This was true until the arrival of virtualization, where a single server could be carved up to become several virtual servers, with each one appearing as if they are independent servers on the network. Conversely, a ''pool'' of servers made up of several physical servers can be carved up as deemed necessary.
Since there is a no longer a one-on-one relationship between a physical server and a logical or virtual server, some concepts might appear tricky. For instance, what is available memory? Is it available memory of (1) the virtual server, (2) the individual physical server from where the virtual server was carved out of, or (3) the total of the pool of servers the virtual server is a part of? So Linux commands may behave a little differently when operated under a virtual environment.
In addition, the virtual environment also needs some administration so there are specialized commands for the management of the virtualized infrastructure. In this section you will learn about the specialized commands and activities related to the virtual environment. We will use Oracle VM as an example.
One of the key components of the virtualization in an Oracle VM environment is the Oracle VM Agent, which must be up for Oracle VM to be fully operational. To check if the agent is up, you have to get on to the Administration server (provm1, in this case) and use the service command:
[root@provm1 vnc-4.1.2]# service ovs-agent status
ok! process OVSMonitorServer exists.
ok! process OVSLogServer exists.
ok! process OVSAgentServer exists.
ok! process OVSPolicyServer exists.
ok! OVSAgentServer is alive.
The output shows clearly that all the key processes are up. If they are not up, they may have been misconfigured and you may want to configure it (or configure it the first time):
# service ovs-agent configure
The same service command is also used to start, restart and stop the agent processes:
service ovs-agent start
service ovs-agent restart
service ovs-agent stop 
The best way, however, to manage the environment is via the GUI console, which is Web based. The Manager Webpage is available on the Admin server, at the port 8888, by default. You can bring it up by entering the following on any Web browser (assuming the admin server name is oradba2).
http://oradba2:8888/OVS
Login as admin and the password you created during installation. It brings up a screen shown below:


The bottom of the screen shows the physical servers of the server pool. Here the server pool is called provmpool1 and the physical server IP is 10.14.106.0. On this screen, you can reboot the server, power it off, take it off the pool and edit the details on the server. You can also add a new physical server to this pool by clicking on the Add Server button.
Clicking on the IP address of the server brings up the details of that physical server, as shown below:

Perhaps the most useful is the Monitor tab. If you click on it, it shows up the utilization of resources on the server – CPU, disk and memory, as shown below. From this page you can visually check if the resources are under or over utilized, if you need to add more physical servers and so on.

Going back to the main page, the Server Pools tab shows the various server pools defined. Here you can define another pool, stop, reinstate the pool and so on:

If you want to add a user or another administrator, you need to click on the Administration tab. There is a default administrator called ''admin''. You can check all the admins here, set their properties like email addresses, names, etc.:

Perhaps the most frequent activity you will perform is the management of individual virtual machines. Almost all the functions are located on the Virtual Machines tab on the main home page. It shows the VMs you have created so far. Here is a partial screenshot showing two machines called provmlin1 and provmlin2:

The VM named provmlin2 shows as ''powered off'', i.e. it appears as down to the end users. The other one – provmlin1 – has some kind of error. First, let's start the provmlin2 VM. Select the radio button next to it and click on the button Power On. After some time it will show as ''Running'', shown below:

If you click on the VM name, you will be able to see the details of the VM, as shown below:

From the above screen we know that the VM have been allocated 512MB of RAM; it runs Oracle Enterprise Linux 5; it has only one core and so on. One of the key information available on the page is the VNC port: 5900. Using this, you can bring up the VNC terminal of this virtual machine. Here, I have used a VNV viewer, using the hostname provm1 and port 5900:

This brings up the VNC session on the server. Now you can start a terminal session:

Since the VNC port 5900 pointed to the virtual machine called provmlin4, the terminal on that VM came up. Now you can issue your regular Linux commands on this terminal.

xm

On the server running the virtual machines, the performance measurement commands like uptime (described in Installment 3) and top (described in Installment 2) have different meanings compared to their physical server counterparts. In a physical server the uptime refers to the amount if time the server has been up, while in a virtual world it could be ambiguous – referring to the individual virtual servers on that server. To measure performance of the physical serverpool, you use a different command, xm. The commands are issued from this main command. For instance, to list the virtual servers, you can use the command xm list:
[root@provm1 ~]# xm list
Name                                        ID   Mem VCPUs      State   Time(s)
22_provmlin4                                1   512     1         -b----   27.8
Domain-0                                    0   532     2         r-----   4631.9
To measure uptime, you would use xm uptime:
[root@provm1 ~]# xm uptime
Name                                ID Uptime 
22_provmlin4                        1  0:02:05
Domain-0                            0  8:34:07
The other commands available in xm are shown below. Many of these commands can be executed via the GUI as well.
console             Attach to <Domain>'s console.                     
 create             Create a domain based on <ConfigFile>.            
 new                Adds a domain to Xend domain management           
 delete             Remove a domain from Xend domain management.      
 destroy            Terminate a domain immediately.                   
 dump-core          Dump core for a specific domain.                  
 help               Display this message.                             
 list               List information about all/some domains.          
 mem-set            Set the current memory usage for a domain.        
 migrate            Migrate a domain to another machine.              
 pause              Pause execution of a domain.                      
 reboot             Reboot a domain.                                  
 restore            Restore a domain from a saved state.              
 resume             Resume a Xend managed domain.                      
 save               Save a domain state to restore later.             
 shell              Launch an interactive shell.                      
 shutdown           Shutdown a domain.                               
 start              Start a Xend managed domain.                      
 suspend            Suspend a Xend managed domain.                    
 top                Monitor a host and the domains in real time.      
 unpause            Unpause a paused domain.                          
 uptime             Print uptime for a domain.                        
 vcpu-set           Set the number of active VCPUs for allowed for the domain.
Let's see some frequently used ones. Besides uptime, you may be interested in the system performance via the top command. This command xm top acts pretty much like the top command in the regular server shell – it refreshes automatically, has some keys that bring up different types of measurements such as CPU, I/O, Network, etc. Here is the output of the basic xm top command:
                               
xentop - 02:16:58   Xen 3.1.4
2 domains: 1 running, 1 blocked, 0 paused, 0 crashed, 0 dying, 0 shutdown
Mem: 1562776k total, 1107616k used, 455160k free    CPUs: 2 @ 2992MHz
NAME  STATE   CPU(sec) CPU(%)     MEM(k) MEM(%)  MAXMEM(k) MAXMEM(%) VCPUS NETS NETTX(k) NETRX(k) VBDS   VBD_OO   VBD_RD   VBD_WR SSID
22_provmlin4 --b---       27      0.1    524288  33.5    1048576     67.1     1    1        9      154    1        06598     1207    0
 Domain-0 -----r        4647      23.4   544768  34.9   no limit      n/a     2    8    68656  2902548    0        0         0       0
                            
It shows the stats like the percentages of CPU used, memory used and so on for each Virtual Machine. If you press N, you will see network activities as shown below:
                               
xentop - 02:17:18   Xen 3.1.4
2 domains: 1 running, 1 blocked, 0 paused, 0 crashed, 0 dying, 0 shutdown
Mem: 1562776k total, 1107616k used, 455160k free    CPUs: 2 @ 2992MHz
Net0 RX:   180692bytes     2380pkts        0err      587drop  TX:     9414bytes       63pkts        0err        0drop
  Domain-0 -----r       4650   22.5     544768   34.9   no limit       n/a     2    8    68665  2902570    0        0
 0        0    0
Net0 RX: 2972232400bytes  2449735pkts        0err        0drop  TX: 70313906bytes  1017641pkts        0err        0drop
Net1 RX:        0bytes        0pkts        0err        0drop  TX:        0bytes        0pkts        0err        0drop
Net2 RX:        0bytes        0pkts        0err        0drop  TX:        0bytes        0pkts        0err        0drop
Net3 RX:        0bytes        0pkts        0err        0drop  TX:        0bytes        0pkts        0err        0drop
Net4 RX:        0bytes        0pkts        0err        0drop  TX:        0bytes        0pkts        0err        0drop
Net5 RX:        0bytes        0pkts        0err        0drop  TX:        0bytes        0pkts        0err        0drop
Net6 RX:        0bytes        0pkts        0err        0drop  TX:        0bytes        0pkts        0err        0drop
Net7 RX:        0bytes        0pkts        0err        0drop  TX:        0bytes        0pkts        0err        0drop
                            
Pressing V brings up VCPU (Virtual CPU) stats.
                               
xentop - 02:19:02   Xen 3.1.4
2 domains: 1 running, 1 blocked, 0 paused, 0 crashed, 0 dying, 0 shutdown
Mem: 1562776k total, 1107616k used, 455160k free    CPUs: 2 @ 2992MHz
NAME  STATE   CPU(sec) CPU(%)     MEM(k) MEM(%)  MAXMEM(k) MAXMEM(%) VCPUS NETS NETTX(k) NETRX(k) VBDS   VBD_OO   VBD_RD   VBD_WR SSID
22_provmlin4 --b---      28      0.1     524288   33.5    1048576    67.1     1    1        9     282    1       06598     1220    0
VCPUs(sec):   0:         28s
Domain-0 -----r          4667    1.6     544768   34.9   no limit     n/a     2    8    68791  2902688   0       00        0       0
VCPUs(sec):   0:       2753s  1:       1913s
                            
Let's go through some fairly common activities – one of which is distributing the available memory among the VMs. Suppose you want to give each VM 256 MB of RAM, you should use xm mem-set command as shown below. Later you should use xm list command to confirm that.
                               


[root@provm1 ~]# xm mem-set 1 256
  [root@provm1 ~]# xm mem-set 0 256
  [root@provm1 ~]# xm list
  Name                                        ID   Mem VCPUs      State   Time(s)
  22_provmlin4                                 1   256     1     -b----     33.0
  Domain-0                                     0   256     2     r-----   4984.4
 
                            


Conclusion

This brings an end to the five-installment long series on advanced Linux commands. As I mentioned in the beginning of the series, Linux has thousands of commands that are useful for many cases, and new commands are developed and added regularly. It's not as important to know all available commands as to know what works best for you.
In this series I presented and explained a few commands necessary to perform most of your daily jobs. If you practice these few commands, along with their options and arguments, you will be able to handle any Linux infrastructure with ease.
Thanks for reading and best of luck.

Arup Nanda ( arup@proligence.com) has been exclusively an Oracle DBA for more than 12 years with experiences spanning all areas of Oracle Database technology, and was named "DBA of the Year" by Oracle Magazine in 2003. Arup is a frequent speaker and writer in Oracle-related events and journals and an Oracle ACE Director. He co-authored four books, including RMAN Recipes for Oracle Database 11g: A Problem Solution Approach .

No comments: