Pertemuan 14

Troubleshooting Sistem Linux

Metodologi troubleshooting untuk boot issues, performance, dan service problems

Tujuan Pembelajaran

Setelah menyelesaikan praktikum ini, mahasiswa mampu:

  • Memahami metodologi troubleshooting sistem Linux yang sistematis
  • Mengidentifikasi dan menganalisis masalah pada berbagai layer sistem
  • Menggunakan tools diagnostik untuk troubleshooting
  • Melakukan perbaikan masalah pada boot process, filesystem, network, dan service
  • Membuat dokumentasi troubleshooting yang komprehensif

Teori Pendukung

Metodologi Troubleshooting
1. Identify

Kenali gejala dan kumpulkan informasi awal

2. Analyze

Analisis penyebab potensial berdasarkan evidence

3. Plan

Rencanakan tindakan perbaikan yang tepat

4. Implement

Lakukan perbaikan dengan hati-hati

5. Verify

Verifikasi bahwa masalah teratasi

6. Document

Dokumentasikan proses dan hasil

Common Problem Areas
Boot Issues

GRUB errors, kernel panic, filesystem corruption

Performance Issues

High load, memory leaks, I/O bottlenecks

Network Issues

Connectivity, DNS, firewall rules, routing

Service Issues

Service crashes, configuration errors, dependencies

Security Issues

Unauthorized access, malware, misconfigurations

Troubleshooting Priority Matrix
Impact High Priority Medium Priority Low Priority
High System down, data loss Performance degradation Minor service issues
Medium Critical service outage Partial functionality loss Cosmetic issues
Low Security breaches Feature limitations Documentation updates

Persiapan Environment Troubleshooting

1. Setup Direktori dan Install Tools
# Buat direktori untuk praktikum troubleshooting
sudo mkdir -p /troubleshooting/{logs,scripts,backups}
sudo chmod 755 /troubleshooting

# Install tools troubleshooting lengkap
sudo apt update && sudo apt install -y \
sysstat dstat nmon htop iotop iftop nethogs \
strace ltrace lsof tcpdump wireshark-cli \
auditd fail2ban net-tools iproute2
2. Backup Konfigurasi Sistem
# Backup critical configuration files
sudo cp /etc/fstab /troubleshooting/backups/fstab.backup
sudo cp /etc/hosts /troubleshooting/backups/hosts.backup
sudo cp /etc/ssh/sshd_config /troubleshooting/backups/sshd_config.backup
sudo cp /etc/network/interfaces /troubleshooting/backups/interfaces.backup

# Backup package lists
dpkg --get-selections > /troubleshooting/backups/package_list.txt

Troubleshooting Boot Process

1. Simulasi Masalah Boot
# Backup GRUB configuration (HATI-HATI!)
sudo cp /boot/grub/grub.cfg /troubleshooting/backups/grub.cfg.backup

# Hanya di lingkungan praktikum yang aman:
# sudo mv /boot/grub/grub.cfg /boot/grub/grub.cfg.corrupt

# Simulasi kernel parameter issues
sudo sed -i 's/quiet/quiet broken_param/' /etc/default/grub
sudo update-grub
2. Pemulihan GRUB
# Boot menggunakan live USB/CD
# Mount partisi root dan boot
sudo mount /dev/sda1 /mnt
sudo mount /dev/sda2 /mnt/boot

# Mount system directories
sudo mount --bind /dev /mnt/dev
sudo mount --bind /proc /mnt/proc
sudo mount --bind /sys /mnt/sys

# Chroot ke sistem
sudo chroot /mnt

# Reinstall GRUB
grub-install /dev/sda
update-grub

# Exit chroot dan reboot
exit
sudo reboot
3. Troubleshooting Kernel Panic
# Analisis kernel logs
dmesg | grep -i "error\|panic\|fail"
journalctl -k --since="1 hour ago"

# Check hardware issues
lspci -v
lsusb -v
dmidecode -t memory

# Check kernel parameters
cat /proc/cmdline
sysctl -a | grep panic
4. Boot Process Stages dan Troubleshooting
Stage Symptoms Tools Solutions
BIOS/UEFI No display, beep codes Hardware diagnostics Check cables, RAM, CPU
Bootloader GRUB rescue prompt Live USB, chroot GRUB reinstall, config fix
Kernel Kernel panic, hang dmesg, journalctl Kernel parameters, drivers
Initramfs Init failures, module errors initrd debugging Rebuild initramfs
Systemd Service failures, target issues systemctl, journalctl Service fixes, dependencies

Troubleshooting Filesystem Issues

1. Simulasi Filesystem Corruption
# Buat filesystem test
sudo dd if=/dev/zero of=/tmp/testfs.img bs=1M count=100
sudo mkfs.ext4 /tmp/testfs.img
sudo mount /tmp/testfs.img /mnt/test

# Corrupt filesystem (simulasi)
sudo umount /mnt/test
sudo fsck.ext4 -f /tmp/testfs.img # Force check
2. Filesystem Repair
# Check filesystem
sudo fsck /dev/sda1

# Repair filesystem dengan auto-fix
sudo fsck -y /dev/sda1

# Untuk ext4 filesystem khusus
sudo e2fsck -f -y /dev/sda1

# Check filesystem type dan options
lsblk -f
blkid
mount | grep /dev/sda1
3. Disk Space Issues
# Find large files
sudo find / -type f -size +100M -exec ls -lh {} \; 2>/dev/null | head -20

# Check disk usage by directory
sudo du -sh /* 2>/dev/null | sort -hr | head -10

# Cleanup temporary files
sudo apt autoremove -y
sudo apt autoclean -y
sudo journalctl --vacuum-time=7d

# Check inode usage
df -i
find / -xdev -type f | cut -d "/" -f 2 | sort | uniq -c | sort -n
4. Filesystem Error Symptoms dan Solutions
Symptom Possible Cause Diagnosis Command Solution
"Read-only filesystem" Filesystem errors, hardware issues dmesg | grep error, smartctl fsck, remount rw, check disk
"No space left on device" Disk full, inode exhaustion df -h, df -i, du -sh Cleanup files, resize partition
"Input/output error" Disk failure, cable issues dmesg, smartctl, badblocks Replace disk, check connections
"Stale file handle" NFS issues, deleted files showmount, lsof +L1 Umount/remount, clear handles

Troubleshooting Performance Issues

1. High CPU Usage Investigation
# Identify CPU-intensive processes
top -b -n 1 | head -20
ps aux --sort=-%cpu | head -10

# Process tracing dan analysis
sudo strace -p $(pgrep -f "process_name") -c
sudo perf top -p $(pgrep -f "process_name")

# Check system load
uptime
cat /proc/loadavg
mpstat -P ALL 1 3
2. Memory Leak Detection
# Monitor memory usage patterns
free -h
cat /proc/meminfo
vmstat 1 5

# Process memory map analysis
pmap -x $(pgrep -f "process_name")

# Memory leak tools
valgrind --leak-check=yes ./application
sudo /usr/lib/linux-tools/*/perf mem record ./application

# Check slab memory
sudo slabtop -o
cat /proc/slabinfo | head -20
3. I/O Bottleneck Analysis
# I/O monitoring tools
iotop -o
iostat -x 1 3

# Identify I/O intensive processes
sudo lsof +D /var/log # Files opened in directory
sudo fuser -v /dev/sda1 # Processes using filesystem

# Disk latency analysis
sudo ioping -c 10 /
cat /sys/block/sda/queue/nr_requests
cat /sys/block/sda/queue/scheduler
4. Performance Issue Patterns
Pattern Likely Cause Diagnosis Tools Resolution
High CPU, low I/O wait CPU-bound application top, perf, strace Optimize code, scale horizontally
High I/O wait, low CPU Disk bottleneck iostat, iotop, fio Upgrade storage, optimize I/O
High memory, swapping Memory leak/insufficient RAM free, vmstat, valgrind Add RAM, fix memory leaks
High load, low resource usage Process contention, locks strace, lsof, ipcs Identify blocking processes

Troubleshooting Network Issues

1. Connectivity Testing
# Basic connectivity tests
ping -c 4 8.8.8.8
ping -c 4 google.com

# Path analysis
traceroute google.com
mtr google.com

# DNS troubleshooting
nslookup google.com
dig google.com
dig @8.8.8.8 google.com
systemd-resolve --status
2. Port dan Service Checking
# Local port listening
netstat -tulnp
ss -tulnp
lsof -i :80

# Remote port checking
nc -zv google.com 80
telnet google.com 80
nmap -p 80 google.com

# Check firewall rules
sudo iptables -L -n -v
sudo ufw status verbose
3. Packet Capture dan Analysis
# Capture network traffic
sudo tcpdump -i ens33 -w /troubleshooting/logs/network_capture.pcap

# Analyze dengan tshark
sudo tshark -r /troubleshooting/logs/network_capture.pcap -Y "http"
sudo tshark -r /troubleshooting/logs/network_capture.pcap -Y "dns"

# Real-time packet analysis
sudo tcpdump -i ens33 -n port 80
sudo tshark -i ens33 -f "tcp port 443"
4. Network Issue Isolation
Layer Symptoms Diagnosis Tools Common Solutions
Physical No link, packet loss ethtool, ip link, dmesg Check cables, NIC, drivers
Network No route to host ip route, traceroute, ping Fix routing, gateway config
Transport Connection refused netstat, ss, nmap Check services, firewall
Application Service errors, timeouts curl, telnet, logs Fix app config, dependencies

Troubleshooting Service Issues

1. Service Status Investigation
# Service status checking
sudo systemctl status apache2
sudo systemctl is-enabled apache2
sudo systemctl is-active apache2

# Service logs analysis
sudo journalctl -u apache2 --since="1 hour ago"
sudo tail -f /var/log/apache2/error.log

# Check failed services
sudo systemctl --failed
sudo journalctl -p 3 -xb # Error priority logs
2. Service Dependency Checking
# Service dependencies
systemctl list-dependencies apache2
systemctl list-dependencies apache2 --reverse

# Service conflicts
systemctl list-unit-files --state=enabled
systemctl list-unit-files --state=failed

# Service resource usage
systemd-cgtop
systemctl show apache2 -p MemoryCurrent,CPUUsage
3. Configuration Validation
# Syntax checking untuk berbagai services
sudo apache2ctl configtest
sudo nginx -t
sudo sshd -t
sudo named-checkconf

# Configuration comparison
diff /etc/apache2/apache2.conf /troubleshooting/backups/apache2.conf.backup

# Check file permissions
namei -l /etc/apache2/apache2.conf
ls -la /etc/apache2/
4. Service Recovery Procedures
Issue Recovery Steps Verification Prevention
Service crash Restart service, check logs, fix config Service status, functionality test Monitoring, resource limits
Dependency failure Check dependent services, restart chain All services running, dependencies met Proper service ordering
Configuration error Validate config, restore backup, test Config test, service start Config management, testing
Resource exhaustion Increase limits, optimize, add resources Resource monitoring, performance Capacity planning, monitoring

Security Incident Response

1. Unauthorized Access Detection
# Check login attempts
sudo lastb -a
sudo grep "Failed password" /var/log/auth.log
sudo fail2ban-client status

# Check suspicious processes
ps aux | grep -E "(curl|wget|nc|netcat|telnet)"
sudo lsof -i | grep ESTABLISHED

# Check cron jobs untuk suspicious entries
sudo crontab -l
sudo ls -la /etc/cron.*
2. Malware Scanning
# Install dan run malware scanner
sudo apt install clamav -y
sudo freshclam # Update virus database
sudo clamscan -r -i /home/

# Rootkit detection
sudo chkrootkit
sudo rkhunter --check

# Check for suspicious files
find / -name "*.php" -mtime -1 2>/dev/null
find / -name ".ssh" -type d 2>/dev/null
3. Forensic Analysis
# File integrity checking
sudo aide --check
sudo tripwire --check

# Timeline analysis
sudo find / -mtime -1 -type f -exec ls -la {} \; 2>/dev/null | head -20

# Network connection analysis
sudo netstat -tulnp | grep -v 127.0.0.1
sudo ss -tulnp | grep -v 127.0.0.1
4. Incident Response Checklist
Phase Actions Tools Documentation
Preparation Backup systems, establish procedures Backup tools, documentation Incident response plan
Identification Detect incident, assess impact Monitoring, log analysis Incident report
Containment Isolate systems, prevent spread Firewall, network isolation Containment actions
Eradication Remove threat, restore systems Malware scanners, backups Remediation steps
Recovery Restore operations, verify systems Backup restoration, testing Recovery validation
Lessons Learned Analyze incident, improve processes Post-mortem analysis Improvement plan

Automated Troubleshooting Scripts

1. System Health Check Script
cat > /troubleshooting/scripts/health_check.sh << 'EOF'

#!/bin/bash
# Comprehensive System Health Check

LOG_FILE="/troubleshooting/logs/health_$(date +%Y%m%d_%H%M%S).log"

echo "=== SYSTEM HEALTH CHECK ===" > $LOG_FILE
echo "Date: $(date)" >> $LOG_FILE
echo "Hostname: $(hostname)" >> $LOG_FILE
echo "" >> $LOG_FILE

# CPU and Load
echo "=== CPU AND LOAD ===" >> $LOG_FILE
uptime >> $LOG_FILE
top -bn1 | head -5 >> $LOG_FILE
echo "" >> $LOG_FILE

# Memory
echo "=== MEMORY USAGE ===" >> $LOG_FILE
free -h >> $LOG_FILE
echo "" >> $LOG_FILE

# Disk
echo "=== DISK USAGE ===" >> $LOG_FILE
df -h >> $LOG_FILE
echo "" >> $LOG_FILE

# Services
echo "=== SERVICE STATUS ===" >> $LOG_FILE
systemctl list-units --state=failed >> $LOG_FILE
echo "" >> $LOG_FILE

# Network
echo "=== NETWORK CONNECTIONS ===" >> $LOG_FILE
netstat -tulnp | grep LISTEN >> $LOG_FILE
echo "" >> $LOG_FILE

echo "Health check completed: $LOG_FILE"
EOF

chmod +x /troubleshooting/scripts/health_check.sh
2. Incident Response Script
cat > /troubleshooting/scripts/incident_response.sh << 'EOF'

#!/bin/bash
# Basic Incident Response Script

INCIDENT_DIR="/troubleshooting/incident_$(date +%Y%m%d_%H%M%S)"
mkdir -p $INCIDENT_DIR

# System snapshot
date > $INCIDENT_DIR/timestamp.txt
uname -a > $INCIDENT_DIR/system_info.txt

# Process information
ps aux > $INCIDENT_DIR/processes.txt
lsof > $INCIDENT_DIR/open_files.txt

# Network information
netstat -tulnp > $INCIDENT_DIR/network_connections.txt
iptables -L -n > $INCIDENT_DIR/firewall_rules.txt

# Log information
tail -100 /var/log/auth.log > $INCIDENT_DIR/auth_log.txt
tail -100 /var/log/syslog > $INCIDENT_DIR/syslog.txt

echo "Incident data collected: $INCIDENT_DIR"
EOF

chmod +x /troubleshooting/scripts/incident_response.sh

Documentation dan Reporting

1. Troubleshooting Report Template
cat > /troubleshooting/scripts/report_template.md << 'EOF'

# Troubleshooting Report

### Incident Summary
- **Date**:
- **Time**:
- **System**:
- **Reported Issue**:

### Investigation
### Symptoms
-

### Data Collected
-

### Analysis
-

### Resolution
### Steps Taken
1.
2.
3.

### Results
-

### Prevention
### Recommendations
-

## Follow-up Actions
-
EOF
2. Troubleshooting Log Template
Time Action Taken Results Next Steps
14:00 Checked system load Load avg: 5.2, high Investigate processes
14:05 Identified process MySQL using 90% CPU Check MySQL queries
14:10 Analyzed queries Slow query detected Optimize query
14:20 Optimized query CPU usage dropped to 40% Monitor system

Tugas dan Evaluasi

  1. Jelaskan langkah-langkah sistematis yang harus dilakukan ketika menghadapi sistem yang tidak bisa boot!
  2. Bagaimana cara membedakan antara masalah jaringan yang disebabkan oleh DNS, firewall, atau konektivitas fisik?
  3. Apa tools yang paling efektif untuk mengidentifikasi memory leak pada suatu proses?
  4. Bagaimana prosedur incident response yang tepat ketika mendeteksi aktivitas mencurigakan pada sistem?
  5. Buat skenario: Server web tiba-tiba merespons sangat lambat. Tulis langkah-langkah troubleshooting yang akan dilakukan!

Case Study: Database Performance Troubleshooting

#!/bin/bash
# Database Performance Troubleshooting Script

echo "Starting database performance investigation..."

# 1. Check system resources
echo "=== SYSTEM RESOURCES ==="
top -bn1 | head -10
free -h

# 2. Check database processes
echo "=== DATABASE PROCESSES ==="
ps aux | grep -E "(mysql|postgres)" | head -10

# 3. Check disk I/O
echo "=== DISK I/O ==="
iostat -x 1 3

# 4. Check database connections
echo "=== DATABASE CONNECTIONS ==="
netstat -tulnp | grep -E "(3306|5432)"

# 5. Check database logs
echo "=== DATABASE LOGS ==="
tail -20 /var/log/mysql/error.log 2>/dev/null || tail -20 /var/log/postgresql/postgresql-*.log 2>/dev/null

echo "Initial investigation completed. Review outputs for further analysis."