doc/man-pages/pod8/fragments/fileserver-troubleshooting.pod

   1 Sending process signals to the File Server Process can change its
   2 behavior in the following ways:
   3
   4   Process          Signal       OS     Result
   5   ---------------------------------------------------------------------
   6
   7   File Server      XCPU        Unix    Prints a list of client IP
   8                                        Addresses.
   9
  10   File Server      USR2      Windows   Prints a list of client IP
  11                                        Addresses.
  12
  13   File Server      POLL        HPUX    Prints a list of client IP
  14                                        Addresses.
  15
  16   Any server       TSTP        Any     Increases Debug level by a power
  17                                        of 5 -- 1,5,25,125, etc.
  18                                        This has the same effect as the
  19                                        -d XXX command-line option.
  20
  21   Any Server       HUP         Any     Resets Debug level to 0
  22
  23   File Server      TERM        Any     Run minor instrumentation over
  24                                        the list of descriptors.
  25
  26   Other Servers    TERM        Any     Causes the process to quit.
  27
  28   File Server      QUIT        Any     Causes the File Server to Quit.
  29                                        Bos Server knows this.
  30
  31 The basic metric of whether an AFS file server is doing well is the number
  32 of connections waiting for a thread,
  33 which can be found by running the following command:
  34
  35    % rxdebug <server> | grep waiting_for | wc -l
  36
  37 Each line returned by C<rxdebug> that contains the text "waiting_for"
  38 represents a connection that's waiting for a file server thread.
  39
  40 If the blocked connection count is ever above 0, the server is having
  41 problems replying to clients in a timely fashion.  If it gets above 10,
  42 roughly, there will be noticeable slowness by the user.  The total number of
  43 connections is a mostly irrelevant number that goes essentially
  44 monotonically for as long as the server has been running and then goes back
  45 down to zero when it's restarted.
  46
  47 The most common cause of blocked connections rising on a server is some
  48 process somewhere performing an abnormal number of accesses to that server
  49 and its volumes.  If multiple servers have a blocked connection count, the
  50 most likely explanation is that there is a volume replicated between those
  51 servers that is absorbing an abnormally high access rate.
  52
  53 To get an access count on all the volumes on a server, run:
  54
  55    % vos listvol <server> -long
  56
  57 and save the output in a file.  The results will look like a bunch of B<vos
  58 examine> output for each volume on the server.  Look for lines like:
  59
  60    40065 accesses in the past day (i.e., vnode references)
  61
  62 and look for volumes with an abnormally high number of accesses.  Anything
  63 over 10,000 is fairly high, but some volumes like root.cell and other
  64 volumes close to the root of the cell will have that many hits routinely.
  65 Anything over 100,000 is generally abnormally high.  The count resets about
  66 once a day.
  67
  68 Another approach that can be used to narrow the possibilities for a
  69 replicated volume, when multiple servers are having trouble, is to find all
  70 replicated volumes for that server.  Run:
  71
  72    % vos listvldb -server <server>
  73
  74 where <server> is one of the servers having problems to refresh the VLDB
  75 cache, and then run:
  76
  77    % vos listvldb -server <server> -part <partition>
  78
  79 to get a list of all volumes on that server and partition, including every
  80 other server with replicas.
  81
  82 Once the volume causing the problem has been identified, the best way to
  83 deal with the problem is to move that volume to another server with a low
  84 load or to stop any runaway programs that are accessing that volume
  85 unnecessarily.  Often the volume will be enough information to tell what's
  86 going on.
  87
  88 If you still need additional information about who's hitting that server,
  89 sometimes you can guess at that information from the failed callbacks in the
  90 F<FileLog> log in F</var/log/afs> on the server, or from the output of:
  91
  92    % /usr/afsws/etc/rxdebug <server> -rxstats
  93
  94 but the best way is to turn on debugging output from the file server.
  95 (Warning: This generates a lot of output into FileLog on the AFS server.)
  96 To do this, log on to the AFS server, find the PID of the fileserver
  97 process, and do:
  98
  99     kill -TSTP <pid>
 100
 101 where <pid> is the PID of the file server process.  This will raise the
 102 debugging level so that you'll start seeing what people are actually doing
 103 on the server.  You can do this up to three more times to get even more
 104 output if needed.  To reset the debugging level back to normal, use (The
 105 following command will NOT terminate the file server):
 106
 107     kill -HUP <pid>
 108
 109 The debugging setting on the File Server should be reset back to normal when
 110 debugging is no longer needed.  Otherwise, the AFS server may well fill its
 111 disks with debugging output.
 112
 113 The lines of the debugging output that are most useful for debugging load
 114 problems are:
 115
 116     SAFS_FetchStatus,  Fid = 2003828163.77154.82248, Host 171.64.15.76
 117     SRXAFS_FetchData, Fid = 2003828163.77154.82248
 118
 119 (The example above is partly truncated to highlight the interesting
 120 information).  The Fid identifies the volume and inode within the volume;
 121 the volume is the first long number.  So, for example, this was:
 122
 123    % vos examine 2003828163
 124    pubsw.matlab61                   2003828163 RW    1040060 K  On-line
 125        afssvr5.Stanford.EDU /vicepa
 126        RWrite 2003828163 ROnly 2003828164 Backup 2003828165
 127        MaxQuota    3000000 K
 128        Creation    Mon Aug  6 16:40:55 2001
 129        Last Update Tue Jul 30 19:00:25 2002
 130        86181 accesses in the past day (i.e., vnode references)
 131
 132        RWrite: 2003828163    ROnly: 2003828164    Backup: 2003828165
 133        number of sites -> 3
 134           server afssvr5.Stanford.EDU partition /vicepa RW Site
 135           server afssvr11.Stanford.EDU partition /vicepd RO Site
 136           server afssvr5.Stanford.EDU partition /vicepa RO Site
 137
 138 and from the Host information one can tell what system is accessing that
 139 volume.
 140
 141 Note that the output of L<vos_examine(1)> also includes the access count, so
 142 once the problem has been identified, vos examine can be used to see if the
 143 access count is still increasing.  Also remember that you can run vos
 144 examine on the read-only replica (e.g., pubsw.matlab61.readonly) to see the
 145 access counts on the read-only replica on all of the servers that it's
 146 located on.