CEN3303-Administrasi Sistem - Pertemuan 14

Pengenalan Filosofi Troubleshooting Framework Komprehensif Tools Esensial Scenario Umum Teknik Lanjut Root Cause Analysis Troubleshooting Production Automation Case Studies Documentation Interactive Troubleshooter

Pertemuan 14: Troubleshooting Sistem

1. Pendahuluan

Troubleshooting adalah seni dan ilmu dalam mengidentifikasi, menganalisis, dan menyelesaikan masalah sistem.

Critical Skill: Sebagai administrator sistem, kemampuan troubleshooting yang efektif adalah keterampilan paling kritis yang membedakan profesional yang baik dari yang luar biasa.

Systematic Approach

Metodis dan terstruktur

Hypothesis-Driven

Bentuk hipotesis, test, validasi

Documentation-Oriented

Dokumentasi setiap step

Calm Under Pressure

Tetap tenang dalam krisis

2. Filosofi dan Metodologi Troubleshooting

Mindset Troubleshooting yang Efektif

Systematic Approach

Metodis dan terstruktur dalam setiap langkah troubleshooting.

Hypothesis-Driven

Bentuk hipotesis, test teori, validasi hasil sebelum mengambil tindakan.

Documentation-Oriented

Dokumentasi setiap step, observation, dan hasil untuk analisis future.

Calm Under Pressure

Tetap tenang dalam situasi krisis, think clearly under stress.

Metodologi Troubleshooting Sistematis

Identify
Problem

↓

Establish
Theory

↓

Test
Theory

↓

Plan of
Action

↓

Implement
Solution

↓

Verify
Functionality

↓

Document
Findings

Prinsip Dasar Troubleshooting:

Divide and Conquer - Pecah masalah menjadi bagian-bagian kecil

Follow the Path - Ikuti aliran data atau request

Start Simple - Mulai dari solusi paling sederhana

Compare with Known Good - Bandingkan dengan sistem yang berfungsi normal

3. Framework Troubleshooting Komprehensif

OSI Model untuk Troubleshooting:

Layer-by-Layer Analysis Framework

OSI Layer	Focus Area	Troubleshooting Tools	Common Issues
7. Application	Application logs, error messages	tail, journalctl, app logs	Configuration errors, permission issues
6. Presentation	Data format, encryption	openssl, gpg, encoding tools	SSL errors, data corruption
5. Session	Session management, timeouts	ss, netstat, lsof	Session leaks, connection limits
4. Transport	TCP/UDP, ports, connections	telnet, nc, tcpdump	Port conflicts, firewall blocks
3. Network	IP, routing, ICMP	ping, traceroute, ip route	Routing issues, network partitions
2. Data Link	MAC addresses, switches	arp, ethtool, bridge	VLAN misconfig, duplex mismatches
1. Physical	Cables, network interfaces	ethtool, dmesg, ip link	Cable faults, hardware failures

Troubleshooting Matrix:

Gejala	Area Potensial	Tools Diagnostik	Quick Checks
System slow	CPU, Memory, Disk I/O	top, vmstat, iostat	Load average, memory usage, I/O wait
Network issues	Network config, DNS, Firewall	ping, traceroute, netstat	Connectivity, DNS resolution, ports
Service down	Service status, Dependencies	systemctl, journalctl	Service status, port listening
Disk problems	Filesystem, Space, Permissions	df, du, lsblk	Disk space, inodes, filesystem errors
High load	Processes, Resource contention	ps, htop, pidstat	Running processes, resource usage
Connection refused	Firewall, Service binding	ss, iptables, netstat	Port listening, firewall rules

4. Tools Troubleshooting Esensial

System Monitoring Tools

Real-time Monitoring:

# Process monitoring

htop

# I/O by process

iotop

# Network usage by process

nethogs

# Network bandwidth

iftop

System State Snapshot:

# Virtual memory statistics

vmstat 1 10

# CPU statistics

mpstat 1 10

# I/O statistics

iostat -x 1 10

# Network statistics

sar -n DEV 1 10

Network Troubleshooting Tools

Basic Connectivity:

# ICMP connectivity

ping google.com

# Path analysis

traceroute google.com

# Continuous path analysis

mtr google.com

Port and Service Checking:

# TCP connectivity test

telnet host port

# Port scanning

nc -zv host port

# TCP port scan

nmap -sT host

Network Configuration:

# Interface configuration

ip addr show

# Routing table

ip route show

# Socket statistics

ss -tunlp

# Traditional socket info

netstat -tunlp

Log Analysis Tools

Real-time Log Monitoring:

# System logs

tail -f /var/log/syslog

# Systemd journals

journalctl -f

Log Filtering and Analysis:

# Filter errors

grep -i error /var/log/syslog

# Recent failures

journalctl --since "1 hour ago" | grep -i fail

# Pattern matching

awk '/pattern/ {print}' /var/log/file.log

Log Aggregation:

# Analyze failed SSH attempts

sudo grep -h "Failed password" /var/log/auth.log* | \

awk '{print $11}' | sort | uniq -c | sort -nr

Process and Service Tools

Service Management:

# Service status

systemctl status service_name

# Restart service

systemctl restart service_name

# List failed services

systemctl --failed

Process Analysis:

# Find process

ps aux | grep process_name

# Process tree

pstree -p

# Processes using port 80

lsof -i :80

# Processes using file

fuser -v /path/to/file

Resource Limits:

# User limits

ulimit -a

# System file handles

cat /proc/sys/fs/file-nr

5. Common Troubleshooting Scenarios

Scenario 1: High System Load

Step 1: Identify Load Average

uptime

# Output: load average: 4.5, 3.2, 2.1

Step 2: Identify Resource Bottlenecks

top

htop

Step 3: Detailed Analysis

# CPU-bound?

mpstat -P ALL 1 5

# Memory-bound?

free -h

vmstat 1 5

# I/O-bound?

iostat -x 1 5

# I/O by process

iotop

Step 4: Identify Culprit Processes

# Top CPU processes

ps aux --sort=-%cpu | head -10

# Top memory processes

ps aux --sort=-%mem | head -10

Step 5: Take Appropriate Action

# Graceful termination

kill -TERM problematic_pid

# Force kill (last resort)

kill -KILL problematic_pid

Scenario 2: Network Connectivity Issues

Step 1: Local Interface Check

ip addr show

ip link show

Step 2: Local Network Connectivity

# Gateway connectivity

ping 192.168.1.1

# External IP connectivity

ping 8.8.8.8

# DNS test

ping google.com

Step 3: Path Analysis

traceroute google.com

mtr google.com

Step 4: Port and Service Check

telnet target_host 80

nc -zv target_host 22

Step 5: Firewall Check

# Linux firewall rules

iptables -L -n

# UFW status

ufw status

Step 6: DNS Resolution

nslookup google.com

dig google.com

# DNS configuration

cat /etc/resolv.conf

Scenario 3: Disk Space Issues

Step 1: Check Disk Usage

# Filesystem usage

df -h

# Inode usage

df -i

Step 2: Identify Large Files/Directories

# Top directories in root

du -sh /* 2>/dev/null | sort -hr | head -10

# Large files in /var/log

du -ah /var/log 2>/dev/null | sort -hr | head -10

Step 3: Find and Clean Up

# Clean package cache (Debian/Ubuntu)

apt clean

# Clean package cache (RHEL/CentOS)

yum clean all

# Clean log files

find /var/log -name "*.log" -type f -mtime +30 -exec rm -f {} \;

# Clean systemd journals

journalctl --vacuum-time=7d

# Find large files

find / -type f -size +100M -exec ls -lh {} \; 2>/dev/null

Scenario 4: Service Failure

Step 1: Check Service Status

systemctl status nginx

systemctl is-active nginx

systemctl is-enabled nginx

Step 2: Check Service Logs

# Systemd service logs

journalctl -u nginx --since "1 hour ago"

# Application logs

tail -f /var/log/nginx/error.log

Step 3: Check Dependencies

systemctl list-dependencies nginx

Step 4: Test Configuration

# Test nginx configuration

nginx -t

# Test Apache configuration

apache2ctl configtest

Step 5: Check Resources

# Check if port is occupied

ss -tlnp | grep :80

# Check what's using port 80

lsof -i :80

6. Advanced Troubleshooting Techniques

Strace untuk Debugging Process

Trace System Calls:

# Trace running process

strace -p 1234

# Trace command with children

strace -f command

# Trace only file operations

strace -e trace=file command

# Summary of system calls

strace -c command

Common Error Patterns:

ENOENT - File not found
EACCES - Permission denied
ENOSPC - No space left
ECONNREFUSED - Connection refused
EADDRINUSE - Address already in use

Tcpdump untuk Network Analysis

Capture Network Traffic:

# Capture on interface

tcpdump -i eth0

# Capture traffic to/from host

tcpdump host 192.168.1.100

# Capture HTTP traffic

tcpdump port 80

# Save to file for analysis

tcpdump -w capture.pcap

Advanced Filters:

# Full capture with large packets

tcpdump -i any -s 0 -w full_capture.pcap

# Read from file

tcpdump -r capture.pcap

# Filter by protocol

tcpdump icmp

tcpdump tcp port 22

Memory Analysis

Check Memory Usage Details:

# System memory info

cat /proc/meminfo

# Kernel slab memory info

slabtop

Process Memory Details:

# Process memory map

pmap -x 1234

# Detailed memory segments

cat /proc/1234/smaps

Memory Leak Detection:

# Using valgrind

valgrind --leak-check=yes program_name

# Monitor memory over time

watch -n 1 'ps aux --sort=-%mem | head -10'

Performance Profiling

CPU Profiling:

# Record performance data

perf record -g command

# Analyze performance data

perf report

System-wide Profiling:

# System-wide sampling

perf record -a -g sleep 10

# Text-based report

perf report --stdio

# Specific event monitoring

perf stat -e cache-misses command

Flame Graph Generation:

# Capture stack traces

perf record -F 99 -a -g -- sleep 30

# Generate flame graph

perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > perf.svg

7. Root Cause Analysis (RCA)

Teknik RCA

5 Whys Technique

Tanya "mengapa" berulang kali hingga akar masalah ditemukan.

Example: Service down → Why? → Out of memory → Why? → Memory leak → Why? → Bug in application code

Fishbone Diagram

Diagram sebab-akibat untuk visualisasi faktor kontribusi.

Categories: People, Process, Technology, Environment

Fault Tree Analysis

Analisis pohon kesalahan dengan logical gates.

AND/OR gates untuk kombinasi failure conditions

Timeline Analysis

Urutan kejadian kronologis untuk correlation analysis.

Correlate system changes with incident timeline

Template RCA Document

# ROOT CAUSE ANALYSIS DOCUMENT

## Incident Summary

- **Date/Time**: [Timestamp]

- **System Affected**: [System/Service]

- **Impact**: [Business impact]

- **Duration**: [Downtime duration]

## Timeline of Events

1. [Timestamp] - First symptom observed

2. [Timestamp] - Initial investigation started

3. [Timestamp] - Escalation to team

4. [Timestamp] - Root cause identified

5. [Timestamp] - Resolution implemented

## Root Cause

[Detailed description of underlying cause]

## Contributing Factors

- Factor 1: [Description]

- Factor 2: [Description]

- Factor 3: [Description]

## Resolution Steps

[Steps taken to resolve the issue]

## Preventive Measures

- [ ] Action item 1

- [ ] Action item 2

- [ ] Action item 3

## Lessons Learned

[Key takeaways for future improvement]

8. Troubleshooting di Environment Production

Best Practices Production Troubleshooting

Have a Rollback Plan - Selalu siap rollback perubahan
Communicate Proactively - Update stakeholders secara regular
Monitor Impact - Pantau impact perubahan secara real-time
Document Everything - Dokumentasi setiap step dan observasi
Change Control - Follow change management procedures

Minimizing Business Impact

Traffic Management:

# Reduce load temporarily

systemctl set-property nginx CPUQuota=50%

# Maintenance mode page

echo "System maintenance" > /var/www/html/maintenance.html

# Load balancer drain

# Mark instance as draining in LB

Service Degradation:

# Reduce service quality temporarily

# Disable non-essential features

# Increase timeouts

# Reduce cache TTL

Graceful Degradation:

# Fallback to cached data

# Serve static content only

# Queue requests for later processing

# Return 503 with retry-after

Collaboration Tools untuk Troubleshooting

Shared Troubleshooting Session:

# Shared terminal session

tmux new-session -s troubleshooting

# Alternative screen session

screen -S troubleshooting

Log Sharing:

# Share logs via netcat

tail -f /var/log/syslog | nc -l 9999

# Remote log monitoring

ssh user@host "tail -f /var/log/file"

# Centralized logging

# Send logs to ELK/Splunk

12. Interactive Troubleshooting Assistant

AI-Powered Troubleshooting Guide

Describe Your Issue:

Problem Type:

Specific Symptoms:

Error Messages (if any):

Troubleshooting Plan:

Ringkasan Pembelajaran

Key Troubleshooting Skills:

Systematic methodology approach
Comprehensive tool proficiency
Effective root cause analysis
Production environment awareness
Documentation and communication

Critical Mindset:

Stay calm under pressure
Think logically and systematically
Communicate effectively with stakeholders
Learn from every incident
Build preventive measures

Sebelumnya Selanjutnya: Review & UAS

Pertemuan 14: Troubleshooting Sistem

1. Pendahuluan

Systematic Approach

Hypothesis-Driven

Documentation-Oriented

Calm Under Pressure

2. Filosofi dan Metodologi Troubleshooting

Systematic Approach

Hypothesis-Driven

Documentation-Oriented

Calm Under Pressure

Prinsip Dasar Troubleshooting:

3. Framework Troubleshooting Komprehensif

OSI Model untuk Troubleshooting:

Troubleshooting Matrix:

4. Tools Troubleshooting Esensial

Real-time Monitoring:

System State Snapshot:

Basic Connectivity:

Port and Service Checking:

Network Configuration:

Real-time Log Monitoring:

Log Filtering and Analysis:

Log Aggregation:

Service Management:

Process Analysis:

Resource Limits:

5. Common Troubleshooting Scenarios

Scenario 1: High System Load

Step 1: Identify Load Average

Step 2: Identify Resource Bottlenecks

Step 3: Detailed Analysis

Step 4: Identify Culprit Processes

Step 5: Take Appropriate Action

Scenario 2: Network Connectivity Issues

Step 1: Local Interface Check

Step 2: Local Network Connectivity

Step 3: Path Analysis

Step 4: Port and Service Check

Step 5: Firewall Check

Step 6: DNS Resolution

Scenario 3: Disk Space Issues

Step 1: Check Disk Usage

Step 2: Identify Large Files/Directories

Step 3: Find and Clean Up

Scenario 4: Service Failure

Step 1: Check Service Status

Step 2: Check Service Logs

Step 3: Check Dependencies

Step 4: Test Configuration

Step 5: Check Resources

6. Advanced Troubleshooting Techniques

Trace System Calls:

Common Error Patterns:

Capture Network Traffic:

Advanced Filters:

Check Memory Usage Details:

Process Memory Details:

Memory Leak Detection:

CPU Profiling:

System-wide Profiling:

Flame Graph Generation:

7. Root Cause Analysis (RCA)

5 Whys Technique

Fishbone Diagram

Fault Tree Analysis

Timeline Analysis

8. Troubleshooting di Environment Production

Traffic Management:

Service Degradation:

Graceful Degradation:

Shared Troubleshooting Session:

Log Sharing:

12. Interactive Troubleshooting Assistant

Describe Your Issue:

Troubleshooting Plan:

Diagnostic Steps:

Useful Commands:

Common Solutions:

Ringkasan Pembelajaran