>> High but not full CPU utilization (70~90%)
>> CPU Time spent in user application (85+% in user)
>> Low Disk utilization (5~15% for each disk)
>> Low Network utilization (10~30% per network, less 5% of collision)
2. 튜닝의 일반적인 모델
Agree level of Performance to reach -> Gather Data using monitoring tool(vmstat, iostat, netstat,...) -> Analyze Data -> Work from the biggest bottleneck first |
주의: 시스템의 성능(Performance)은 시스템의 리소스를 어떻게 사용하느냐에 좌우된다.
3. 튜닝의 일반적인 순서
1st. Application Tuning
2nd. DataBase Tuning
3rd. OS Tuning(System Tuning)
4. 실행가능한 몇가지 모니터링 툴
- vmstat : command to view status of memory and CPU
- ps : command to find which process are hogs
- swap : command for available swap space
- iostat : command for terminal, disk, cpu utilization
- netstat, nfsstat : command for network performance
- sar : command to view system activity (need SUNWaccr, SUNWaccu packages)
- mpstat : command to view status of multi-cpu
- truss : command to trace system calls what is going on
- cachefs : mechanism to speed up read-mostly NFS
- PrestoServe : application for synchronous writes (many small files, Mail server)
- DiskPak of Eagle company : application for fragmented disks
5. 힌트 : 데이터베이스 관리시스템 튜닝
- Configure disk for speed, Not capacity
- I/O load needs many random access disks
- 3*1.05GB is over twice as fast as 1*2.9GB
- Use raw disk for tablespaces to reduce CPU load
- Save inode and indirect block updates
- Use dd | compress into filesystem for snap backup then ufsdump normally
- Use UFS for tablespaces to reduce I/O load
- Extra level of caching needs more RAM
- With UFS, use PrestoServe/NVSIMM or Logging option
- Use large shared memory area (up to 25% of RAM)
- If uo value is upper 25%, must expand shared memory
6. 시스템 성능을 좌우하는 요소들
- CPU : number of CPUs
- I/O Devices : disk, printer, terminal, transfer information
- Memory : primary memory(RAM), secondary memory(on disk)
- Kernel : kernel parameters (/etc/system)
- Network
주의 : 튜닝시 반드시 시스템과 네트웍을 같이 고려해야 한다.
7. 주의깊게 봐야할 변수들
항 목 |
변 수 사 항 |
Source Code |
Alogorithm, Language, Programming Model, Compiler |
Executable |
Environment, Filesytem Type |
DataBase |
Buffer Sizes, Indexing |
Kernel |
Buffer Sizes, Paging, Tuning, Configuring |
Memory |
Cache Type, Line Size and Miss Cost |
Disk |
Driver Algorithms, Disk Type, Load Balance |
Windows & Graphics |
Window System, Graphics Library, Bus Throughput |
CPU |
Processor Implementation |
Multiprocessors |
Load Balancing, Concurrency, Bus Throughput |
Network |
Protocol, Hardware, Usage Pattern |
8. 첫 번째 시스템 튜닝 10 단계
1st. The system will usually have a disk bottleneck.
2nd. You will be told that the system is NOT I/O bound.
3rd. After first pass tuning the system will still have a disk bottleneck.
4th. Poor NFS response times are hard to pin down.
5th. Avoid the common memory usage misconceptions.
6th. Don't panic when you seee page-ins and page-outs in vmstat.
7th. Look for page scanner activity.
8th. Look for a long run queue (vmstat procs r).
9th. Look for processes blocked waiting for I/O (vmstat proc b).
10th. Look for CPU system time dominating user time.
9. 첫 번째 튜닝 접근
1st. Clear up any RAM shortage.
=> If at first the monitoring indicates paging, Add more RAM.
2nd. Make sure that processor speed, or number of process.
=> Clear out unnecessary processes.
=> Make sure that the run queue is as small as possible.
3rd. Focus I/O subsystems (disk, networks)
4th. Use "iostat -x" to monitor the disk.
=> Check busy(%b) and service time(svc_t)
5th. Continue to cycle around the tuning path until all subsytems, and indeed the machine itself is fast enough to reach the required perfomance metric.
Analysis using tool & How to read them
ps command
How to use ps
#> ps
TIME : the total amount of CPU time used by the process since it began
#> ps -efl
SZ : shows the amount of virtual memory required by the process
Example of ps
#> ps
PID TTY TIME COMD
346 pts/2 0:01 ksh
1029 pts/2 0:00 ps
1199 pts/2 0:01 ksh
#> ps -efl
UID PID PPID C PRI NI ADDR SZ WCHAN STIME TTY TIME COMD
root 0 0 80 0 SY f01706f0 0 19:47:44 ? 0:01 sched
root 1 0 80 99 20 fc18f800 173 fc18f9c8 19:47:48 ? 0:36 /etc/init -
vmstat command
How to use vmstat
#> vmstat 5
procs
r b w
r : In the run queue, waiting for processing time
b : Blocked, waiting for resources
cpu
cs us sy id
us : percentage of CPU time spent in USER mode
sy : percentage of CPU time spent in SYSTEM mode
id : percentage of CPU time spent idle
How to read vmstat data
▶ If CPU spends most of its time in USER mode, one or more processes may be monopolizing the CPU.
=> Check "ps -ef"
▶ A low value "id" indicates a MEMORY starved or I/O bound system.
=> RECOMMENDATION : idle time "id" should be grater than 20% most of time
=> In GENERAL guide
id < 15% : have to wait before being put into execution
us > 70% : the application load may be NEED some BALANCING
sy = 30% : is a good water mark
▶ Compute Intensive program.
=> One Compute Intensive program can push utilization rate to 100%.
▶ If the CPU is mainly in system mode, then it is probably I/O bound.
Example of vmstat
#> vmstat
procs memory page disk faults cpu
r b w swap free re mf pi po fr de sr f0 s0 s1 s2 in sy cs us sy id
0 0 0 72836 7688 0 1 5 1 4 0 2 0 0 0 1 16 37 30 1 1 98
주의 : sr 값이 계속 높으면 메모리부족을 의심해 봐야 한다.
CPU Solutions
If your CPU is often busy, or
If it often deals with jobs than monopolize the system,
=> Be lower the priority of other processes (nice command).
=> Check for runaway processes, or other processes monopolizing the CPU or MEMORY.
check "ps -ef" (TIME field)
check "ps -efl" (SZ field)
=> Evaluate your system's memory usage.
주의 : 부족한 메모리는 과다한 스와핑(swapping)과 페이징(paging)의 원인이 된다.
swap command
How to use swap
#> swap
blocks : 512 bytes block
free : 512 bytes block
#> swap -s
How much swap space is available
Example of ps
#> swap -l
swapfile dev swaplo blocks free
/dev/dsk/c0t3d0s1 32,25 8 156232 95200
#> swap -s
total: 33092k bytes allocated + 9104k reserved = 42196k used, 53172k available
vmstat -S command
How to use vmstat -S
#> vmstat -S 5
po : kbytes paged out
pi : kbytes paged in
si : number of pages swapped in per second
so : number of whole processes swapped out
How to read vmstat -S output
▶ po = 0 : no paging occuring
Note 1. check "proc r" field
r > 1 : indicative of processor speed
r > 2 - 4 : adding more CPU
Note 2. check "proc b" field
If this column has values, run "iostat" to tune the disk I/O.
▶ po > 0 : not sufficient RAM for the application
Consult the application vendor regarding the memory requirements
▶ pi rate is NOT important
주의 : 메모리와 디스크의 액세스 시간을 구별하여야 한다.
Example of vmstat -S
#> vmstat -S 5
procs memory page disk faults cpu
r b w swap free si so pi po fr de sr f0 s0 s1 s2 in sy cs us sy id
0 0 0 53036 2848 0 0 15 8 19 0 11 0 0 0 0 52 37 92 4 3 93
0 0 0 53140 6328 0 1 5 0 0 0 0 0 0 0 0 10 21 30 1 1 98
0 0 0 53140 6328 0 0 0 0 0 0 0 0 0 0 0 31 26 62 0 1 99
sar -g command
How to use sar -g
#> sar -g 5 20
pgout/s : The average number of page-out requests per second
<= A good indication of memory performance
pgfree/s : The average number of pages per second that were added to the free list
pgscan/s : The average number of pages that needed to be scanned in order to find more memory
How to read sar data
▶ consistently pgout = 0 : NO memory problem
▶ several interval pgout > 0 : System perfomance is suffering
▶ pgfree and pgscan should be small (less than 5)
Example of sar -g
#> sar -g 5 20
SunOS hostname 5.5.1 Generic_103640-12 sun4u 11/12/98
16:59:36 pgout/s ppgout/s pgfree/s pgscan/s %ufs_ipf
16:59:41 1.39 5.78 23.90 79.88 0.00
16:59:46 1.60 113.77 27.74 253.69 0.00
16:59:51 0.80 2.00 13.80 81.40 0.00
16:59:56 0.20 0.20 0.20 0.00 0.00
Memory Solutions
1st. Two Memory Problem
=> The system spend a lot of time paging and/or swapping.
=> Run out of SWAP space
2nd. Check SZ field from "ps -efl"
3rd. For 1st. problem;
=> Adding physical memory until no paging and swapping
Factors for Disk Performance
- Speed of disk
- transfer rate
- seek time
- rotation latency
- Load balance across the multiple disk
- Access type
- single user vs multi user
- sequential vs random
- Memory
- disk buffers, used when transfering information to and from disk, are stored
df -k command
How to use df
#> df
capacity : How much of the file system's total capacity has been used
How to read df output
▶ %capacity = 100%
=> remove core and any s/w packages
=> add more disk space or move files to another partition
Example of df
#> df -k
Filesytem kbytes used avail capacity Mounted on
/dev/dsk/c0t3d0s0 21615 14909 4546 77% /
/dev/dsk/c0t3d0s6 240463 211348 5075 98% /usr
/proc 0 0 0 0% /proc
iostat command
How to use iostat
#> iostat 5
disk
serv : average service time, in milliseconds
cpu
us : time spent in user mode
sy : time spent in system mode
wt : time spent waiting I/O
#> iostat -x 5
svc_t : service time
w% : percentage of time the queue is not empty
%b : percentage of time the disk busy
How to read iostat -x output
▶ %b
5% > %b : ignore
30% < %b : be concerned about anything
60% < %b : need fixing
▶ svc_t
if %b < 5%, svc_t : ignore
if %w has values, 10 - 50ms : OK
100 - 150 : Need fixing
iostat -D command
How to use iostat -D
#> iostat -D 5
util : percentage of disk utilization
We can find the load balance between disks.
Example of iostat
#> iostat 5
tty fd0 sd0 sd1 sd2 cpu
tin tout Kps tps serv Kps tps serv Kps tps serv Kps tps serv us sy wt id
0 560 0 0 0 0 0 99 1 0 49 10 2 54 4 4 4 88
0 16 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 99
#> iostat -x 5
extended disk statistics
disk r/s w/s Kr/s Kw/s wait actv svc_t %w %b
fd0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0 0
sd0 0.0 0.0 0.2 0.1 0.0 0.0 99.5 0 0
#> iostat -D
fd0 sd0 sd1 sd2
rps wps util rps wps util rps wps util rps wps util
0 0 0.0 0 0 0.2 0 0 0.4 1 0 3.1
sar -a -b -d command
How to use sar
#> sar -a 5 3
-a : Report use of file access system routines (report on file access)
iget/s
namei/s
dirbk/s
Average
#> sar -b 5 3
-b : Report buffer activity (report on disk buffers)
%rcache : Fraction of logical reads found in the system buffers
%wcache : Fraction of logical writes found in the system buffers
#> sar -d 5 3
-d : Report activity for each block device (report on disk transfers)
r+w/s : read + write per second
%busy : Percentage of time the device spend servicing a transfer
blk/s : Number of 512-byte blocks transferred to device, per second
How to read sar data
sar -a
▶ The large the values, the more time the kernel is spending to ACCESS user files
▶ This report is USEFUL for understanding "HOW disk-dependent a system is"
sar -b
▶ %rcache < 90% and %wcache < 65%
=> may be possible to improve performance by increasing the buffer space
sar -d
▶ %busy > 85% : high utilization, load problem
▶ r+w/s > 65% : overload
Example of sar output
#> sar -a 5 3
16:59:36 iget/s namei/s dirbk/s
16:59:41 0 0 0
16:59:46 8 26 15
16:59:51 271 297 288
Average 93 108 101
#> sar -b 5 3
16:59:36 bread/s lread/s %rcache bwrit/s lwrit/s %wcache pread/s pwrit/s
16:59:41 19 57 67 22 26 17 0 0
16:59:46 23 73 69 18 23 20 0 0
16:59:51 25 51 51 15 27 46 0 0
Average 22 60 63 18 25 28 0 0
#> sar -d 5 3
13:21:07 device %busy avque r+w/s blks/s avwait avserv
13:21:12 sd1 52 0.8 30 216 4.8 21.1
sd3 6 0.1 1 14 0.0 45.4
average
sd1 32 0.5 17 187 3.5 23.9
sd2 45 0.4 24 208 0.0 18.9
sd3 4 0.0 1 9 0.0 51.1
DISK Solutions
1st. Check the file system overload (above 90 or 95% capacity)
=> Clean out unused files from /va/adm, /var/adm/sa and /var/lp/logs.
=> Clean out the core files.
=> Find the files what are unused more than 60 days.
#> find /home -type f pmtime +60 -print
#> find /home -name core -exec rm {}\;
2nd. If you have more than one disks,
=> Distribute the file systems for a more balanced load between the disks.
=> Try reducing disk seeking through careful planning in data positioning on disk.
(I/O on the outer sectors can be far faster)
3rd. Consider adding Memory.
=> Additional memory reduces swapping and paging, and allows an expanded buffer pool.
4st. Consider buying faster disks.
5st. Make sure disks are not overloading SCSI controller.
=> Below 60% utilization of SCSI bus.
6st. Consider adding disks.
=> Not be busier more than 40 - 60% of the time.
(%b : iostat -xct 5 and %busy : sar -d 5 3)
7st. Considering using an in-memory file system for /tmp directory.
ð It's default in Solaris 2.X.
Overview of Network Performance
● Congested or Collision - resend
Network has the bankwidth, and so can only transmit a certain amount of data
● EtherNet
10Mbit/sec(14,400 packes/sec)
Size of 1 packet - 64 bytes (about 14,400 packets/sec)
Maximum packet size : 1518 bytes
Inter packet gab : 9.6 micro-second
30-40% utilization because of collision contention
● Latency
Not important as much as disk
Must consider that remote system has resources including disk
● NFS
UDP : Common protocol in use, being part of TCP/IP and allows fast network throughput with little overhead.
Logical packet size : 9Kbytes
On ethernet : 6 * 1518 bytes
After collision, have to resend ALL serveral ethernet packets
● Slower remote server
The remote server is CPU bound
● Network Monitoring Tool
nfsstat
netstat
snoop
ping
spray
ping command
How to use ping
#> ping
Send a packet to a host on the network.
-s : send one packet per second
How to read ping -s output
▶ 2 single SPARCstations on a quiet EtherNet always respond with less than 1 millisecond
Example of ping
#> ping -s host
PING host: 56 data bytes
64 bytes from host (1.1.1.1): icmp_seq=0. time7. ms
64 bytes from host (1.1.1.1): icmp_seq=0. time7. ms
----host PING Statistics----
5 packets transmitted, 5 packets received, 0% packet loss
round-trip (ms) min/avg/max = 1/2/7
spray command
How to use spray
#> spray
Send a one-way stream of packets to a host
Reports How may were received and the transfer RATE.
-c : count(number) of packets
-d : specifies the delay, in microseconds
Default: 9.6 microsecond
-l : specifies the Length(size) of the packet
How to read spray output
▶ If you use -d option, if there are many packet dropped the packet,
=> Check Hardware such as loose cables or missing termination
=> Check the possible a congested network
=> Use the netstat command to get more information
Example of spray
#> spray -d 20 -c 100 -l 2048 host
sending 100 packets of length 2048 to host ...
no packets dropped by host
560 packets/sec, 1147576 bytes/sec
netstat -i command
How to use netstat -i
#> netstat -i 5
errs : the number of errors
packets : the number of packets
colls : the number of collisions
* Collision percentage rate = colls/output packets * 100
How to read netstat data
▶ collision percentage > 5% (one system)
=> Checking the network interface and cabling
▶ collision percentage > 5% (all system)
=> The network is congested
▶ errs field has data
=> Suspect BAD hardware generating illegal sized packets
=> Check Repeated Network
Example of netstat
#> netstat -i 5
input le0 output input (Total) output
packets errs packets errs colls packets errs packets errs colls
71853 1 27270 8 4839 72526 1 27943 8 4839
7 0 0 0 0 7 0 0 0 0
14 0 0 0 0 14 0 0 0 0
snoop command
How to use snoop
#> snoop
Capture packets from the network
Display their contents
Example of snoop
#> snoop host1
#> snoop -o filename host1 host2
#> snoop -i filename -t r | more
#> snoop -i filename -p99,108
#> snoop -i filename -v -p101
#> snoop -i filename rpc nfs and host1 and host2
nfsstat -c command
How to use nfsstat -c
#> nfsstat -c
Display a summary of servers and client statistics
Can be used to IDENTIFY NFS problems
retrans : Number of remote procedure calls(RPCs) that were retransmitted
badxids : Number of times that a duplicat acknowledgement was received for a single NFS request
timeout : Number of calls that timed out
readlink : Number of reads to symbolic links
How to read nfsstat data
▶ % of retrans of calls > 5% : maybe network problem
=> Looking for network congestion
=> Looking for overloaded servers
=> Check ethernet interface
▶ high badxid, as well as timeout : remote server slow
=> Increase the time-out period
#> mount host:/home /home rw,soft,timeout=15 0 0
▶ % of readlink of the calls > 10% : too many symbolic links
Example of nfsstat
#> nfsstat -c
Client rpc:
calls badcalls retrans badxid timeout wait newcred timers
13185 0 8 0 8 0 0 50
Client nfs:
calls badcalls nclget nclcreate
13147 0 13147 0
null getattr setattr root lookup readlink read
0 0% 794 6% 10 0% 0 0% 2141 16% 2720 21% 6283 48%
wrcache write create remove rename link symlink
0 0% 581 4% 33 0% 29 0% 4 0% 0 0% 0 0%
mkdir rmdir readdir statf
0 0% 0 0% 539 4% 13 0%
Network Solutions
1st. Consider adding the Prestoserve NFS Write accelerator.
=> Write % from nfsstat -s > 15%, consider installing it.
2nd. Subneting
=> If your network is congested, consider subnetting.
=> That is collision rate > 5%, subnetting.
3rd. Install the bridge
=> If your network is congested and physical segmentation is NOT possible.
=> Isolate physical segments of a busy network.
4st. Install the local disk into diskless machines.
>> Bottlenecks
Detection
증 상 |
sar 필드 |
Uneven workload |
r+w/s, avque |
Many threads blocked waiting on I/O |
%wio |
High disk utilization rate |
%busy |
Active disk with no free space |
|
Solutions
▶ Balance the disk load
▶ Use mmap instead of read and write
▶ Use shared libraries
▶ Put busier filesystems on smaller disks
▶ Organize I/O requests to be more contiguous
▶ Add more/faster disks
Detection
Solutions
▶ Modify process load ▶ Tune paging parameters ▶ Add more memory ▶ Use shared libraries ▶ Use memcntl to use memory more efficiently within application ▶ Analyze locality of reference in applications ▶ Set memory limits - setrlimit |
증 상 |
sar 필드 |
CPU idle time is low |
%usr,%sys,%idle |
Threads waiting on run queue |
runq-sz,%runocc |
Slower response/interactive performance |
|
Solutions
▶ Use priocntl / nice to modify process/thread priorities
▶ Modify dispatch parameter tables
▶ Modify applications to use system calls more efficiently
▶ System daemons
▶ Device interrupts
▶ Modify/limit process load
▶ Custom device drivers
▶ More, faster CPUs
>> Rules Table
Network Rules
Notation used in tables
▶ Rules
측정결과를 나타내기 위해 명령어 이름과 "."와 변수 이름을 조합하여 표시하였다. 예를 들어 "iostat -x" 명령을 사용하여 디스크 서비스 타임을 30초 간격으로 측정하였다면 이름을 "iostat-x30.svc_t"와 같이 표시하였다.
변수들간의 조합은 논리연산자 "&&", "||", "=="를 사용하였고 간결하게 하기 위해 범위는 "0 <= X < 100" 와 같이 표기하였다.
▶ Levels
테이블의 level은 상태의 심각함 정도를 나타내며 아래표와 같다.
Level |
Discription |
white |
low usage |
blue |
under-utilization/imbalance of resource |
green |
target utilization/no problem |
amber |
warning level |
red |
critical level that needs to be fixed |
black |
problems that can pervent your system |
▶ Actions
각 테이블의 rules에서 취해져야할 조치를 표기했다. 문제에 대한 간단한 메모와 관련사항들을 나타내고 있다.
Rules based upon ethernet collizions
Rule for each network interface |
Level |
Action |
(0<netstat-i30.output.packets<10)&&(100*netstat-i30. output.colls/netstat-i30.output.packets<0.5%)&&(other nets white or green) |
White |
No Problem |
(0<netstat-i30.output.packets<10)&&(100*netstat-i30. output.colls/netstat-i30.output.packets<0.5%)&&(other nets amber or red) |
Blue |
Inactive Net |
(10<=netstat-i30.output.packets)&&(0.5%<=100*netstat- i30.output.colls/netsat-i30.output.packets<2.0%) |
Green |
No Problem |
(10<=netstat-i30.output.packets)&&(2.0%<=100*netstat- i30.output.colls/netstat-i30.output.packets<5.0%) |
Amber |
Busy Net |
(10<=netstat-i30.output.packets)&&(5.0%<=100* netstat-i30.output.colls/netstat-i30.output.packets) |
Red |
Busy Net |
network type is not "ie","le","ne",or "qe", it is "bf" or "nf". |
Green |
Not Ether |
☞ Inactive Net
An inactive network is a waste of throughput when other networks are overloaded. Rebalance the load so that all networks are used more evenly.
☞ Busy Net
A network with too many collisions reduces throughput and increases response time for users. Move some of the load to inactive networks if there are any. Add more thernets or upgrade to a faster interface type like FDDI, 100MBit ethernet or ATM.
☞ Not Ether
If the last letter of the interface name is not "e" then this not an ethernet so the collision based network performance rule should not be used.
Network Rules
Notation used in tables
▶ Rules
측정결과를 나타내기 위해 명령어 이름과 "."와 변수 이름을 조합하여 표시하였다. 예를 들어 "iostat -x" 명령을 사용하여 디스크 서비스 타임을 30초 간격으로 측정하였다면 이름을 "iostat-x30.svc_t"와 같이 표시하였다.
변수들간의 조합은 논리연산자 "&&", "||", "=="를 사용하였고 간결하게 하기 위해 범위는 "0 <= X < 100" 와 같이 표기하였다.
▶ Levels
테이블의 level은 상태의 심각함 정도를 나타내며 아래표와 같다.
Level |
Discription |
white |
low usage |
blue |
under-utilization/imbalance of resource |
green |
target utilization/no problem |
amber |
warning level |
red |
critical level that needs to be fixed |
black |
problems that can pervent your system |
▶ Actions
각 테이블의 rules에서 취해져야할 조치를 표기했다. 문제에 대한 간단한 메모와 관련사항들을 나타내고 있다.
Rules based upon ethernet collizions
Rule for each network interface |
Level |
Action |
(0<netstat-i30.output.packets<10)&&(100*netstat-i30. output.colls/netstat-i30.output.packets<0.5%)&&(other nets white or green) |
White |
No Problem |
(0<netstat-i30.output.packets<10)&&(100*netstat-i30. output.colls/netstat-i30.output.packets<0.5%)&&(other nets amber or red) |
Blue |
Inactive Net |
(10<=netstat-i30.output.packets)&&(0.5%<=100*netstat- i30.output.colls/netsat-i30.output.packets<2.0%) |
Green |
No Problem |
(10<=netstat-i30.output.packets)&&(2.0%<=100*netstat- i30.output.colls/netstat-i30.output.packets<5.0%) |
Amber |
Busy Net |
(10<=netstat-i30.output.packets)&&(5.0%<=100* netstat-i30.output.colls/netstat-i30.output.packets) |
Red |
Busy Net |
network type is not "ie","le","ne",or "qe", it is "bf" or "nf". |
Green |
Not Ether |
☞ Inactive Net
An inactive network is a waste of throughput when other networks are overloaded. Rebalance the load so that all networks are used more evenly.
☞ Busy Net
A network with too many collisions reduces throughput and increases response time for users. Move some of the load to inactive networks if there are any. Add more thernets or upgrade to a faster interface type like FDDI, 100MBit ethernet or ATM.
☞ Not Ether
If the last letter of the interface name is not "e" then this not an ethernet so the collision based network performance rule should not be used.
CPU Rules
Notation used in tables
▶ Rules
측정결과를 나타내기 위해 명령어 이름과 "."와 변수 이름을 조합하여 표시하였다. 예를 들어 "iostat -x" 명령을 사용하여 디스크 서비스 타임을 30초 간격으로 측정하였다면 이름을 "iostat-x30.svc_t"와 같이 표시하였다.
변수들간의 조합은 논리연산자 "&&", "||", "=="를 사용하였고 간결하게 하기 위해 범위는 "0 <= X < 100" 와 같이 표기하였다.
▶ Levels
테이블의 level은 상태의 심각함 정도를 나타내며 아래표와 같다.
Level |
Discription |
white |
low usage |
blue |
under-utilization/imbalance of resource |
green |
target utilization/no problem |
amber |
warning level |
red |
critical level that needs to be fixed |
black |
problems that can pervent your system |
▶ Actions
각 테이블의 rules에서 취해져야할 조치를 표기했다. 문제에 대한 간단한 메모와 관련사항들을 나타내고 있다.
Rules for SunOS4 and Solaris2
CPU Rule |
Level |
Action |
0 == vmstat30.r |
White |
CPU Idle |
0 < (vmstat30.r / ncpus) < 3.0 |
Green |
No problem |
3.0 <= (vmstat30.r / ncpus) <= 5.0 |
Amber |
CPU Busy |
5.0 <= (vmstat30.r /ncpus) |
Red |
CPU Busy |
mpstat30.smtx < 200 |
Green |
No problem |
200 <= mpstat30.smtx < 400 |
Amber |
Mutex Stall |
400 <= mpstat30.smtx |
Red |
Mutex Stall |
☞ CPU Idle
The CPU power of this system is underutilized. Fewer or less powerful CPUs could be used to do this job.
☞ CPU Busy
There is insufficient CPU power and jobs are spending an increasing amount of time in the queue before being assigned to a CPU. This reduces throughput and increases interactive response times.
☞ Mutex Stall
If the number of stalls per CPU per second exceeds the limit there is mutex contention happening in the kernel which wastes CPU time and degrades multiprocessor scaling.
커널 변수는 운영체제 릴리즈에 굉장히 의존적이다. 그래서 어느 한 릴리즈에서 잘 작동하는 것이 다른 릴리즈에서는 제대로 작동하지 않을 수도 있다.
아래에 보이는 표는 솔라리스 2.3과 2.4 에서 권장하는 변수값이다.
Name |
Default |
Min |
Max |
maxusers |
MB available (physmem) |
8 |
2048 |
pt_cnt |
48 |
48 |
3000 |
ncsize |
~(maxusers*17) + 90 |
226 |
34906 |
ufs_ninode |
~(maxusers*17) + 90 |
226 |
34906 |
autoup |
30 sec |
If (fsflush > 5% of CPU time) => double autoup => tune_t_fsflushr += 5 | |
tune_t_fsflushr |
5 sec |
주의 : "autoup" 변수값을 120초 이상으로 설정하지 마라.
maxusers : kernel-tunable variable
대개의 경우 maxusssers 변수값은 시스템 메모리의 메가바이트 수로 설정될 것이다. /etc/system 파일을 사용하여 최대 2048까지 설정할 수 있지만 시스템은 maxusers 값을 결코 1024 보다 크게 설정하지는 않을 것이다.
▶ 아래표는 SunOS 5.4 시스템에서 maxusers 변수값에 영향을 받는 커널변수들이다.
max_npprocs = ( 10 + 16 * maxusers ) ufs_ninode = ( max_nprocs + 16 + maxusers ) + 64 ndquot = ( ( maxusers * NMOUNT ) / 4 ) + max_nprocs maxuprc = ( max_nprocs - 5 ) ncsize = ( max_nprocs + 16 + maxusers ) + 64 |
>> Tuning Tips
▶ 한번에 한가지씩 점검하라.
▶ 영향을 가장 크게 미치는 요소에 최대한 시간을 투자하라.
▶ 다음 모두를 함께 고려하라.
- Disk Access
- CPU Access
- Main Memory Access
- Network I/O devices
▶ 튜닝은 한 쪽으로 치우친 것을 고르게 분배시키는 것이라는 것을 염두에 두라.
▶ Disk bottleneck을 검토해서 busy가 30% 이상이고 서비스 시간이 50ms 이상이면 데이터를 다른 곳으로 분산시키거나 DiskSuite 같은 툴로서 스트라이핑(striping)하라.
▶ 디스크가 문제없다는 말을 믿지 마라. "iostat -x 30" 명령을 사용하여 주의깊게 살펴봐라.
▶ 튜닝작업후 시스템의 성능을 향상시켰다면 디스크 busy를 다시 점검하라.
▶ NFS 클라이언트는 I/O wait가 아닌 idle로서 서버를 기다린다. 속도가 낮은 NFS 클라이언트에서 "nfsstat -m" 명령을 사용하여 네트웍의 문제인지 NFS 서버의 문제인지를 점검하라.
▶ vmstat 명령 실행시 free RAM의 값이 높음에 신경을 써지 마라. 전체의 불과 1/6이 머무른다.
▶ vmstat 명령 실행시 page-in과 page-out의 값이 높음에 신경을 써지 마라. 모든 파일시스템의 I/O 작업은 page-in과 page-out을 통하여 행해진다.
▶ 실행중인 queue length 또는 load average가 CPU수의 4배가 넘으면 CPU가 더 필요할 것이다.
▶ vmstat 명령 실행시 procs r 값 만큼의 procs b 값이 많으면 디스크가 느리지 않은지 점검하라
'SuperCluster,EXADATA,ODA' 카테고리의 다른 글
Clouding Computing 필요성 (0) | 2015.12.29 |
---|---|
OVM (LDOM) 이란? (0) | 2015.12.29 |
TCP 파라미터 정리 (0) | 2015.12.29 |
dsk, rdsk 차의 (0) | 2015.12.29 |
CORE 파일 분석 방법 (0) | 2015.12.29 |