Pertemuan 14: Troubleshooting Sistem

1. Pendahuluan

Troubleshooting adalah seni dan ilmu dalam mengidentifikasi, menganalisis, dan menyelesaikan masalah sistem.

Critical Skill: Sebagai administrator sistem, kemampuan troubleshooting yang efektif adalah keterampilan paling kritis yang membedakan profesional yang baik dari yang luar biasa.
Systematic Approach
Metodis dan terstruktur
Hypothesis-Driven
Bentuk hipotesis, test, validasi
Documentation-Oriented
Dokumentasi setiap step
Calm Under Pressure
Tetap tenang dalam krisis

2. Filosofi dan Metodologi Troubleshooting

Mindset Troubleshooting yang Efektif
Systematic Approach

Metodis dan terstruktur dalam setiap langkah troubleshooting.

Hypothesis-Driven

Bentuk hipotesis, test teori, validasi hasil sebelum mengambil tindakan.

Documentation-Oriented

Dokumentasi setiap step, observation, dan hasil untuk analisis future.

Calm Under Pressure

Tetap tenang dalam situasi krisis, think clearly under stress.

Metodologi Troubleshooting Sistematis
1
Identify
Problem
2
Establish
Theory
3
Test
Theory
4
Plan of
Action
5
Implement
Solution
6
Verify
Functionality
7
Document
Findings
Prinsip Dasar Troubleshooting:
Divide and Conquer - Pecah masalah menjadi bagian-bagian kecil
Follow the Path - Ikuti aliran data atau request
Start Simple - Mulai dari solusi paling sederhana
Compare with Known Good - Bandingkan dengan sistem yang berfungsi normal

3. Framework Troubleshooting Komprehensif

OSI Model untuk Troubleshooting:
Layer-by-Layer Analysis Framework
OSI Layer Focus Area Troubleshooting Tools Common Issues
7. Application Application logs, error messages tail, journalctl, app logs Configuration errors, permission issues
6. Presentation Data format, encryption openssl, gpg, encoding tools SSL errors, data corruption
5. Session Session management, timeouts ss, netstat, lsof Session leaks, connection limits
4. Transport TCP/UDP, ports, connections telnet, nc, tcpdump Port conflicts, firewall blocks
3. Network IP, routing, ICMP ping, traceroute, ip route Routing issues, network partitions
2. Data Link MAC addresses, switches arp, ethtool, bridge VLAN misconfig, duplex mismatches
1. Physical Cables, network interfaces ethtool, dmesg, ip link Cable faults, hardware failures
Troubleshooting Matrix:
Gejala Area Potensial Tools Diagnostik Quick Checks
System slow CPU, Memory, Disk I/O top, vmstat, iostat Load average, memory usage, I/O wait
Network issues Network config, DNS, Firewall ping, traceroute, netstat Connectivity, DNS resolution, ports
Service down Service status, Dependencies systemctl, journalctl Service status, port listening
Disk problems Filesystem, Space, Permissions df, du, lsblk Disk space, inodes, filesystem errors
High load Processes, Resource contention ps, htop, pidstat Running processes, resource usage
Connection refused Firewall, Service binding ss, iptables, netstat Port listening, firewall rules

4. Tools Troubleshooting Esensial

System Monitoring Tools
Real-time Monitoring:
# Process monitoring
htop

# I/O by process
iotop

# Network usage by process
nethogs

# Network bandwidth
iftop
System State Snapshot:
# Virtual memory statistics
vmstat 1 10

# CPU statistics
mpstat 1 10

# I/O statistics
iostat -x 1 10

# Network statistics
sar -n DEV 1 10
Network Troubleshooting Tools
Basic Connectivity:
# ICMP connectivity
ping google.com

# Path analysis
traceroute google.com

# Continuous path analysis
mtr google.com
Port and Service Checking:
# TCP connectivity test
telnet host port

# Port scanning
nc -zv host port

# TCP port scan
nmap -sT host
Network Configuration:
# Interface configuration
ip addr show

# Routing table
ip route show

# Socket statistics
ss -tunlp

# Traditional socket info
netstat -tunlp
Log Analysis Tools
Real-time Log Monitoring:
# System logs
tail -f /var/log/syslog

# Systemd journals
journalctl -f
Log Filtering and Analysis:
# Filter errors
grep -i error /var/log/syslog

# Recent failures
journalctl --since "1 hour ago" | grep -i fail

# Pattern matching
awk '/pattern/ {print}' /var/log/file.log
Log Aggregation:
# Analyze failed SSH attempts
sudo grep -h "Failed password" /var/log/auth.log* | \
awk '{print $11}' | sort | uniq -c | sort -nr
Process and Service Tools
Service Management:
# Service status
systemctl status service_name

# Restart service
systemctl restart service_name

# List failed services
systemctl --failed
Process Analysis:
# Find process
ps aux | grep process_name

# Process tree
pstree -p

# Processes using port 80
lsof -i :80

# Processes using file
fuser -v /path/to/file
Resource Limits:
# User limits
ulimit -a

# System file handles
cat /proc/sys/fs/file-nr

5. Common Troubleshooting Scenarios

Scenario 1: High System Load
Step 1: Identify Load Average
uptime
# Output: load average: 4.5, 3.2, 2.1
Step 2: Identify Resource Bottlenecks
top
htop
Step 3: Detailed Analysis
# CPU-bound?
mpstat -P ALL 1 5
# Memory-bound?
free -h
vmstat 1 5
# I/O-bound?
iostat -x 1 5
# I/O by process
iotop
Step 4: Identify Culprit Processes
# Top CPU processes
ps aux --sort=-%cpu | head -10

# Top memory processes
ps aux --sort=-%mem | head -10
Step 5: Take Appropriate Action
# Graceful termination
kill -TERM problematic_pid

# Force kill (last resort)
kill -KILL problematic_pid
Scenario 2: Network Connectivity Issues
Step 1: Local Interface Check
ip addr show
ip link show
Step 2: Local Network Connectivity
# Gateway connectivity
ping 192.168.1.1

# External IP connectivity
ping 8.8.8.8

# DNS test
ping google.com
Step 3: Path Analysis
traceroute google.com
mtr google.com
Step 4: Port and Service Check
telnet target_host 80
nc -zv target_host 22
Step 5: Firewall Check
# Linux firewall rules
iptables -L -n

# UFW status
ufw status
Step 6: DNS Resolution
nslookup google.com
dig google.com

# DNS configuration
cat /etc/resolv.conf
Scenario 3: Disk Space Issues
Step 1: Check Disk Usage
# Filesystem usage
df -h

# Inode usage
df -i
Step 2: Identify Large Files/Directories
# Top directories in root
du -sh /* 2>/dev/null | sort -hr | head -10

# Large files in /var/log
du -ah /var/log 2>/dev/null | sort -hr | head -10
Step 3: Find and Clean Up
# Clean package cache (Debian/Ubuntu)
apt clean

# Clean package cache (RHEL/CentOS)
yum clean all

# Clean log files
find /var/log -name "*.log" -type f -mtime +30 -exec rm -f {} \;

# Clean systemd journals
journalctl --vacuum-time=7d

# Find large files
find / -type f -size +100M -exec ls -lh {} \; 2>/dev/null
Scenario 4: Service Failure
Step 1: Check Service Status
systemctl status nginx
systemctl is-active nginx
systemctl is-enabled nginx
Step 2: Check Service Logs
# Systemd service logs
journalctl -u nginx --since "1 hour ago"

# Application logs
tail -f /var/log/nginx/error.log
Step 3: Check Dependencies
systemctl list-dependencies nginx
Step 4: Test Configuration
# Test nginx configuration
nginx -t

# Test Apache configuration
apache2ctl configtest
Step 5: Check Resources
# Check if port is occupied
ss -tlnp | grep :80

# Check what's using port 80
lsof -i :80

6. Advanced Troubleshooting Techniques

Strace untuk Debugging Process
Trace System Calls:
# Trace running process
strace -p 1234

# Trace command with children
strace -f command

# Trace only file operations
strace -e trace=file command

# Summary of system calls
strace -c command
Common Error Patterns:
  • ENOENT - File not found
  • EACCES - Permission denied
  • ENOSPC - No space left
  • ECONNREFUSED - Connection refused
  • EADDRINUSE - Address already in use
Tcpdump untuk Network Analysis
Capture Network Traffic:
# Capture on interface
tcpdump -i eth0

# Capture traffic to/from host
tcpdump host 192.168.1.100

# Capture HTTP traffic
tcpdump port 80

# Save to file for analysis
tcpdump -w capture.pcap
Advanced Filters:
# Full capture with large packets
tcpdump -i any -s 0 -w full_capture.pcap

# Read from file
tcpdump -r capture.pcap

# Filter by protocol
tcpdump icmp
tcpdump tcp port 22
Memory Analysis
Check Memory Usage Details:
# System memory info
cat /proc/meminfo

# Kernel slab memory info
slabtop
Process Memory Details:
# Process memory map
pmap -x 1234

# Detailed memory segments
cat /proc/1234/smaps
Memory Leak Detection:
# Using valgrind
valgrind --leak-check=yes program_name

# Monitor memory over time
watch -n 1 'ps aux --sort=-%mem | head -10'
Performance Profiling
CPU Profiling:
# Record performance data
perf record -g command

# Analyze performance data
perf report
System-wide Profiling:
# System-wide sampling
perf record -a -g sleep 10

# Text-based report
perf report --stdio

# Specific event monitoring
perf stat -e cache-misses command
Flame Graph Generation:
# Capture stack traces
perf record -F 99 -a -g -- sleep 30

# Generate flame graph
perf script | ./stackcollapse-perf.pl | ./flamegraph.pl > perf.svg

7. Root Cause Analysis (RCA)

Teknik RCA
5 Whys Technique

Tanya "mengapa" berulang kali hingga akar masalah ditemukan.

Example: Service down → Why? → Out of memory → Why? → Memory leak → Why? → Bug in application code
Fishbone Diagram

Diagram sebab-akibat untuk visualisasi faktor kontribusi.

Categories: People, Process, Technology, Environment
Fault Tree Analysis

Analisis pohon kesalahan dengan logical gates.

AND/OR gates untuk kombinasi failure conditions
Timeline Analysis

Urutan kejadian kronologis untuk correlation analysis.

Correlate system changes with incident timeline
Template RCA Document
# ROOT CAUSE ANALYSIS DOCUMENT

## Incident Summary
- **Date/Time**: [Timestamp]
- **System Affected**: [System/Service]
- **Impact**: [Business impact]
- **Duration**: [Downtime duration]

## Timeline of Events
1. [Timestamp] - First symptom observed
2. [Timestamp] - Initial investigation started
3. [Timestamp] - Escalation to team
4. [Timestamp] - Root cause identified
5. [Timestamp] - Resolution implemented

## Root Cause
[Detailed description of underlying cause]

## Contributing Factors
- Factor 1: [Description]
- Factor 2: [Description]
- Factor 3: [Description]

## Resolution Steps
[Steps taken to resolve the issue]

## Preventive Measures
- [ ] Action item 1
- [ ] Action item 2
- [ ] Action item 3

## Lessons Learned
[Key takeaways for future improvement]

8. Troubleshooting di Environment Production

Best Practices Production Troubleshooting
  • Have a Rollback Plan - Selalu siap rollback perubahan
  • Communicate Proactively - Update stakeholders secara regular
  • Monitor Impact - Pantau impact perubahan secara real-time
  • Document Everything - Dokumentasi setiap step dan observasi
  • Change Control - Follow change management procedures
Minimizing Business Impact
Traffic Management:
# Reduce load temporarily
systemctl set-property nginx CPUQuota=50%

# Maintenance mode page
echo "System maintenance" > /var/www/html/maintenance.html

# Load balancer drain
# Mark instance as draining in LB
Service Degradation:
# Reduce service quality temporarily
# Disable non-essential features
# Increase timeouts
# Reduce cache TTL
Graceful Degradation:
# Fallback to cached data
# Serve static content only
# Queue requests for later processing
# Return 503 with retry-after
Collaboration Tools untuk Troubleshooting
Shared Troubleshooting Session:
# Shared terminal session
tmux new-session -s troubleshooting

# Alternative screen session
screen -S troubleshooting
Log Sharing:
# Share logs via netcat
tail -f /var/log/syslog | nc -l 9999

# Remote log monitoring
ssh user@host "tail -f /var/log/file"

# Centralized logging
# Send logs to ELK/Splunk

12. Interactive Troubleshooting Assistant

AI-Powered Troubleshooting Guide
Describe Your Issue:
Troubleshooting Plan:

Ringkasan Pembelajaran

Key Troubleshooting Skills:
  • Systematic methodology approach
  • Comprehensive tool proficiency
  • Effective root cause analysis
  • Production environment awareness
  • Documentation and communication
Critical Mindset:
  • Stay calm under pressure
  • Think logically and systematically
  • Communicate effectively with stakeholders
  • Learn from every incident
  • Build preventive measures