Session 4: Process Management & System Monitoring

Master process control, system monitoring, and resource management essential for maintaining robust IoT systems. Understanding process management is critical for optimizing performance, troubleshooting issues, and ensuring reliable operation of IoT applications in production environments.

Duration 2 hours
Module Linux Basics
Session 4 of 5
Difficulty Intermediate

Session Learning Objectives

By the end of this session, you will be able to:

Monitor System Performance

Use ps, top, htop, and system monitoring tools to analyze running processes and identify performance bottlenecks in IoT systems.

Control Process Lifecycle

Start, stop, suspend, and manage processes including background services, daemon management, and process prioritization.

Optimize Resource Usage

Understand CPU, memory, and I/O usage patterns to optimize IoT applications for resource-constrained environments.

Implement Process Automation

Create automated process management solutions for production IoT deployments including monitoring and recovery systems.

1. Process Monitoring - Your System Observatory

Understanding Processes in IoT Systems

Processes are running programs that consume system resources. In IoT environments, efficient process monitoring is crucial because devices often have limited CPU, memory, and power. Understanding process behavior helps optimize performance and troubleshoot issues.

Why Process Monitoring Matters for IoT: IoT devices run continuously, often unattended. A runaway process can drain battery, cause system crashes, or degrade performance. Monitoring helps identify resource hogs, detect security issues, and ensure system stability.

Process States and Their Significance

Process is actively using CPU. Monitor for processes consuming too much CPU time.
Process waiting for resources or events. Normal state for most IoT services waiting for sensor data.
Process waiting for I/O operations. High numbers indicate storage or network issues.
Dead process waiting for parent cleanup. Indicates programming issues in IoT applications.
Process suspended by signal. Useful for debugging IoT applications.
Kernel thread in idle state. Normal for system maintenance processes.

Process Viewing with ps - Your Process Inspector

The ps command provides detailed information about running processes. It's essential for troubleshooting, performance analysis, and security monitoring in IoT systems.

# Basic process listing commands
ps                              # Show processes for current user and terminal
ps -u iot                       # Show processes for specific user (iot)
ps -e                           # Show all processes on the system
ps -f                           # Full format listing with detailed information
ps -ef                          # All processes with full format (most comprehensive)

# Advanced process information display
ps aux                          # All processes with detailed resource usage
ps aux | head -20               # First 20 processes (most resource-intensive usually at top)
ps aux | grep python            # Find all Python processes
ps aux | grep -E "(iot|sensor|mqtt)"  # Find IoT-related processes

# Custom column formatting for specific information
ps -eo pid,ppid,cmd,%mem,%cpu   # Show PID, parent PID, command, memory %, CPU %
ps -eo pid,user,cmd,start       # Show PID, user, command, and start time
ps -eo pid,cmd,etime            # Show PID, command, and elapsed time

# Process hierarchy and relationships
ps -ef --forest                # Show process tree (parent-child relationships)
ps -ejH                         # Alternative tree view
pstree                          # Visual process tree (if installed)
pstree -p                       # Process tree with PIDs
pstree iot                      # Process tree for specific user

# IoT-specific process monitoring examples
ps aux | grep mosquitto         # Check MQTT broker status
ps aux | grep node              # Find Node.js IoT applications
ps aux | grep python | grep sensor  # Find Python sensor scripts
ps -C systemd --no-headers | wc -l  # Count systemd processes
ps -eo pid,cmd | grep -E "(sensor|iot|mqtt|gpio)"  # Find hardware-related processes

# Process resource analysis for IoT optimization
ps aux --sort=-%cpu | head -10  # Top 10 CPU-consuming processes
ps aux --sort=-%mem | head -10  # Top 10 memory-consuming processes
ps aux | awk '{sum+=$6} END {print "Total memory usage: " sum/1024 " MB"}'  # Total memory usage

# Monitoring specific IoT services
ps -fp $(pgrep -d, mosquitto)   # Detailed info for MQTT broker
ps -fp $(pgrep -d, -f "sensor") # Detailed info for sensor processes
watch "ps aux | grep iot"       # Continuously monitor IoT processes

Understanding ps Output Columns

Process ID - unique identifier for each running process
Parent Process ID - the process that started this one
CPU usage percentage - important for performance monitoring
Memory usage percentage - critical for resource-constrained IoT devices
Virtual memory size - total memory footprint
Resident Set Size - actual physical memory used
Terminal associated with process (? means no terminal)
Process state (R=running, S=sleeping, Z=zombie, D=uninterruptible)

Real-time Monitoring with top and htop - Your System Dashboard

Real-time process monitoring is essential for IoT systems that need to maintain consistent performance. These tools help identify performance bottlenecks and resource issues as they occur.

# top command - built-in real-time process monitor
top                             # Real-time process viewer (press 'q' to quit)
top -u iot                      # Show processes for specific user only
top -p 1234,5678               # Monitor specific PIDs
top -d 5                        # Update every 5 seconds (default is 3)
top -b -n 1                     # Batch mode, single snapshot (good for scripts)

# Interactive top commands (while top is running):
# q     - quit top
# k     - kill process (enter PID when prompted)
# r     - renice process (change priority)
# M     - sort by memory usage (useful for IoT memory analysis)
# P     - sort by CPU usage (default)
# T     - sort by running time
# u     - filter by username
# 1     - show individual CPU cores (useful for multi-core IoT devices)
# f     - add/remove display fields
# W     - save current configuration

# htop command - enhanced process monitor (install with: sudo apt install htop)
htop                            # Better interface than top
htop -u iot                     # Show processes for IoT user
htop -d 50                      # Update every 5 seconds (delay in tenths)

# htop advantages for IoT monitoring:
# - Color-coded display for easy reading
# - Mouse support for interaction
# - Easy process killing with F9
# - Process tree view with F5
# - Search functionality with F3
# - Better memory and CPU visualization
# - More intuitive interface for beginners

# System resource monitoring for IoT devices
top -b -n 1 | head -20          # System overview snapshot
top -b -n 1 | grep "load average"  # System load information
uptime                          # Quick system load and uptime

# Continuous monitoring for IoT troubleshooting
watch "ps aux | grep -E '(iot|sensor|mqtt)' | head -10"  # Monitor IoT processes
watch "free -h"                 # Monitor memory usage
watch "df -h"                   # Monitor disk space

# IoT-specific monitoring scenarios
# Monitor MQTT broker performance
top -p $(pgrep mosquitto)       # Monitor MQTT broker specifically
# Monitor sensor data collection processes
htop -F "sensor"                # Filter for sensor processes in htop
# System health check for IoT gateway
top -b -n 1 | awk '/load average/ {print "Load: " $10 $11 $12}'

System Resource Analysis - Understanding Your IoT Device Limits

IoT devices often operate with limited resources. Understanding system resource usage helps optimize performance and prevent system failures.

# Memory analysis - critical for resource-constrained IoT devices
free -h                         # Human-readable memory information
free -m                         # Memory usage in megabytes
cat /proc/meminfo               # Detailed memory statistics
cat /proc/meminfo | grep -E "(MemTotal|MemFree|MemAvailable)"  # Key memory metrics

# CPU information and performance
cat /proc/cpuinfo               # Detailed CPU information
lscpu                           # CPU architecture summary
nproc                           # Number of processing units
cat /proc/loadavg               # Current system load

# Disk usage monitoring
df -h                           # Disk space usage (human-readable)
df -i                           # Inode usage (file count limits)
du -sh /var/log                 # Directory size (logs can grow large)
du -h --max-depth=1 /home       # Subdirectory sizes in home

# System performance metrics
uptime                          # System uptime and load averages
w                               # Who is logged in and system load
vmstat 1 5                      # Virtual memory statistics (1 sec intervals, 5 times)
iostat 1 5                      # I/O statistics (if sysstat package installed)

# IoT-specific resource monitoring
cat /proc/version               # Kernel version information
cat /proc/stat                  # System and CPU statistics
ls /sys/class/thermal/          # Available temperature sensors
cat /sys/class/thermal/thermal_zone0/temp  # CPU temperature (in millidegrees)

# Network resource monitoring for IoT devices
netstat -tuln                   # Network connections and listening ports
ss -tuln                        # Modern replacement for netstat
cat /proc/net/dev               # Network interface statistics

# Process resource consumption analysis
ps aux --sort=-%cpu | head -10  # Top CPU consumers
ps aux --sort=-%mem | head -10  # Top memory consumers
ps aux | awk '{sum+=$6} END {print "Total RSS: " sum/1024 " MB"}'  # Total memory usage

# IoT device health monitoring script example
#!/bin/bash
echo "=== IoT Device Health Check ==="
echo "Uptime: $(uptime)"
echo "Memory: $(free -h | grep Mem)"
echo "Disk: $(df -h / | tail -1)"
echo "Load: $(cat /proc/loadavg)"
echo "Temperature: $(($(cat /sys/class/thermal/thermal_zone0/temp 2>/dev/null || echo 0)/1000))°C"
echo "IoT Processes: $(ps aux | grep -c -E '(iot|sensor|mqtt)')"

2. Process Control and Management - Your System Command Center

Process Termination - Controlled Shutdown Procedures

Proper process termination is crucial in IoT systems to prevent data loss, ensure clean shutdowns, and maintain system stability. Understanding different termination methods helps you handle both routine operations and emergency situations.

# Basic process termination commands
kill PID                        # Send TERM signal (graceful termination)
kill -9 PID                     # Send KILL signal (force termination)
kill -TERM PID                  # Explicit graceful termination
kill -HUP PID                   # Hang up signal (often reloads configuration)
kill -INT PID                   # Interrupt signal (equivalent to Ctrl+C)

# Killing processes by name or pattern
killall process_name            # Kill all processes with specific name
killall mosquitto               # Kill all MQTT broker processes
pkill pattern                   # Kill processes matching pattern
pkill -f "python.*sensor"      # Kill Python processes containing "sensor"
pkill -u iot                    # Kill all processes owned by 'iot' user

# Signal types and their purposes
kill -l                         # List all available signals
# Common signals for IoT process management:
# TERM (15) - Graceful termination (default, allows cleanup)
# KILL (9)  - Force kill (cannot be caught or ignored)
# HUP (1)   - Hang up (often triggers config reload)
# INT (2)   - Interrupt (Ctrl+C equivalent)
# STOP (19) - Pause process (cannot be caught)
# CONT (18) - Continue paused process
# USR1 (10) - User-defined signal 1 (application-specific)
# USR2 (12) - User-defined signal 2 (application-specific)

# Professional IoT process management examples
# Gracefully restart MQTT broker
sudo kill -HUP $(pgrep mosquitto)              # Reload configuration
sudo systemctl reload mosquitto                # Alternative using systemd

# Stop sensor data collection safely
kill -TERM $(pgrep -f "sensor_collector")      # Allow data to be saved
sleep 5                                         # Wait for graceful shutdown
kill -9 $(pgrep -f "sensor_collector") 2>/dev/null  # Force kill if still running

# Emergency process cleanup
sudo pkill -9 -f "runaway_process"             # Force kill problematic process
sudo pkill -TERM -u iot                        # Gracefully stop all IoT user processes

# Process management with PID files (common in IoT daemons)
# Many IoT services store their PID in files for management
kill -HUP $(cat /var/run/iot_daemon.pid)       # Reload daemon using PID file
kill -TERM $(cat /run/sensor_service.pid)      # Stop service using PID file

# Monitoring process termination
kill -TERM $PID && echo "Termination signal sent"
while kill -0 $PID 2>/dev/null; do             # Check if process still exists
    echo "Waiting for process to terminate..."
    sleep 1
done
echo "Process terminated successfully"

Background and Foreground Process Management - Multitasking Mastery

IoT systems often need to run multiple processes simultaneously - sensor monitoring, data processing, network communication, and user interfaces. Understanding process execution modes is essential for efficient system operation.

# Running processes in background
command &                       # Start command in background
python3 sensor_monitor.py &     # Run sensor monitoring in background
nohup command &                 # Run command immune to hangups (terminal closure)
nohup python3 data_logger.py > /dev/null 2>&1 &  # Silent background execution

# Professional background execution for IoT
nohup python3 /opt/iot/sensor_reader.py > /var/log/iot/sensor.log 2>&1 &
nohup mosquitto -c /etc/mosquitto/mosquitto.conf > /var/log/iot/mqtt.log 2>&1 &
nohup node /opt/iot/dashboard/server.js > /var/log/iot/dashboard.log 2>&1 &

# Job control commands
jobs                            # List active background jobs
jobs -l                         # List jobs with process IDs
fg                              # Bring last background job to foreground
fg %1                           # Bring job number 1 to foreground
bg                              # Send last suspended job to background
bg %2                           # Send job number 2 to background

# Process suspension and control
Ctrl+Z                          # Suspend current foreground process
Ctrl+C                          # Interrupt (terminate) current foreground process
kill -STOP PID                  # Suspend process by PID
kill -CONT PID                  # Resume suspended process by PID

# Advanced background process management
disown %1                       # Remove job from shell's job table
disown -a                       # Remove all jobs from job table
disown -h %1                    # Mark job to not receive SIGHUP

# Screen and tmux for persistent sessions (essential for IoT)
screen -S iot_monitor python3 sensor_monitor.py    # Run in named screen session
screen -ls                      # List screen sessions
screen -r iot_monitor           # Reattach to screen session

tmux new-session -d -s iot_dashboard "python3 /opt/iot/dashboard.py"  # Tmux session
tmux list-sessions              # List tmux sessions
tmux attach-session -t iot_dashboard  # Attach to tmux session

# IoT daemon management examples
# Start IoT services that survive terminal disconnection
nohup /opt/iot/bin/temperature_service &
echo $! > /var/run/temperature_service.pid  # Save PID for later management

# Monitor background IoT processes
jobs -l | grep -E "(sensor|iot|mqtt)"      # List IoT-related background jobs
ps aux | grep -E "(sensor|iot|mqtt)" | grep -v grep  # Check running IoT processes

# Automated background process management
#!/bin/bash
# IoT service startup script
start_iot_services() {
    echo "Starting IoT services..."
    nohup python3 /opt/iot/sensor_collector.py > /var/log/iot/collector.log 2>&1 &
    echo $! > /var/run/iot_collector.pid
    
    nohup python3 /opt/iot/data_processor.py > /var/log/iot/processor.log 2>&1 &
    echo $! > /var/run/iot_processor.pid
    
    echo "IoT services started successfully"
}

# Process monitoring and restart
monitor_process() {
    local pid_file=$1
    local command=$2
    
    if [ -f "$pid_file" ]; then
        local pid=$(cat "$pid_file")
        if ! kill -0 "$pid" 2>/dev/null; then
            echo "Process died, restarting..."
            nohup $command > /var/log/iot/restart.log 2>&1 &
            echo $! > "$pid_file"
        fi
    fi
}

Process Priorities and Resource Management - Optimization for IoT

IoT devices often have limited CPU resources that must be shared among multiple processes. Understanding process priorities helps ensure critical processes get the resources they need while preventing less important tasks from degrading system performance.

# Understanding nice values and process priorities
# Nice values range: -20 (highest priority) to +19 (lowest priority)
# Default nice value: 0
# Only root can set negative values (higher priority)
# Lower nice value = higher priority = more CPU time

# Starting processes with specific priority
nice -n 10 backup_script.sh     # Start with lower priority (nice value 10)
nice -n -5 critical_iot_daemon   # Start with higher priority (root only)
nice --adjustment=5 data_processor.py  # Alternative syntax

# Changing priority of running processes
renice 10 PID                    # Change process priority to nice value 10
renice -n -5 PID                 # Set higher priority (root only)
renice +5 -u iot                 # Change priority for all 'iot' user processes
renice 0 -g iot_group            # Reset priority for process group

# Viewing process priorities
ps -eo pid,ni,cmd                # Show PID, nice value, and command
ps aux | grep PID                # Check specific process priority in NI column
top                              # Shows NI column for nice values
htop                             # Color-coded priority display

# IoT priority management strategies and examples
# High priority for critical IoT processes
sudo renice -10 $(pgrep mosquitto)          # High priority MQTT broker
sudo renice -5 $(pgrep -f "critical_sensor") # Critical sensor monitoring

# Medium priority for normal IoT operations
nice -n 0 python3 /opt/iot/sensor_reader.py &    # Default priority
renice 5 $(pgrep -f "data_logger")               # Slightly lower priority logging

# Low priority for background maintenance
nice -n 19 /opt/iot/scripts/cleanup.sh &        # Lowest priority cleanup
nice -n 15 rsync -av /data/ /backup/ &          # Low priority backup
nice -n 10 /opt/iot/scripts/log_rotation.sh &   # Low priority maintenance

# Real-world IoT priority scenarios
# Emergency response system
sudo nice -n -10 /opt/iot/emergency_monitor &   # Highest priority emergency monitoring
nice -n 5 /opt/iot/routine_sensors.py &         # Lower priority routine sensors

# Battery-powered IoT device optimization
nice -n 10 /opt/iot/power_saving_mode.py &      # Lower priority to save power
sudo renice -5 $(pgrep -f "battery_monitor")    # High priority battery monitoring

# Industrial IoT system
sudo renice -15 $(pgrep -f "safety_system")     # Critical safety monitoring
renice 0 $(pgrep -f "production_monitor")       # Normal priority production
nice -n 15 /opt/iot/analytics/report_generator.py &  # Low priority analytics

# Monitoring and adjusting priorities dynamically
#!/bin/bash
# Dynamic priority adjustment based on system load
adjust_priorities() {
    local load=$(cat /proc/loadavg | cut -d' ' -f1)
    local load_int=$(echo "$load * 100" | bc | cut -d'.' -f1)
    
    if [ $load_int -gt 150 ]; then  # Load > 1.5
        echo "High load detected, adjusting priorities..."
        renice 10 $(pgrep -f "backup")      # Lower backup priority
        renice 15 $(pgrep -f "analytics")   # Lower analytics priority
        sudo renice -5 $(pgrep -f "critical")  # Raise critical process priority
    fi
}

# Process resource limiting (advanced topic)
# Using systemd for IoT service resource management
# /etc/systemd/system/iot-sensor.service
[Unit]
Description=IoT Sensor Service
After=network.target

[Service]
Type=simple
User=iot
ExecStart=/opt/iot/bin/sensor_service
Nice=5                    # Set nice value
CPUQuota=50%             # Limit to 50% CPU
MemoryLimit=128M         # Limit memory usage
Restart=always

[Install]
WantedBy=multi-user.target

# Monitoring resource usage and priorities
watch "ps aux --sort=-%cpu | head -10"          # Monitor CPU usage
watch "ps -eo pid,ni,pcpu,pmem,cmd --sort=-pcpu | head -15"  # Monitor with priorities

3. System Services and Daemon Management - Production IoT Operations

Understanding systemd - The Modern Service Manager

systemd is the standard service manager for modern Linux distributions. It's essential for managing IoT services, ensuring they start automatically, restart on failure, and integrate properly with the system.

# Basic systemd service management
systemctl status service_name   # Check service status
systemctl start service_name    # Start a service
systemctl stop service_name     # Stop a service
systemctl restart service_name  # Restart a service
systemctl reload service_name   # Reload service configuration
systemctl enable service_name   # Enable service to start at boot
systemctl disable service_name  # Disable service from starting at boot

# IoT service management examples
systemctl status mosquitto      # Check MQTT broker status
systemctl start iot-sensor      # Start IoT sensor service
systemctl enable iot-dashboard  # Enable dashboard to start at boot
systemctl restart network-manager  # Restart network management

# Service discovery and listing
systemctl list-units --type=service  # List all services
systemctl list-units --state=running # List running services
systemctl list-units --state=failed  # List failed services
systemctl list-unit-files --type=service | grep enabled  # List enabled services

# IoT-specific service monitoring
systemctl list-units | grep -E "(iot|sensor|mqtt)"  # Find IoT services
systemctl status --no-pager -l iot-*  # Status of all IoT services
systemctl is-active mosquitto && echo "MQTT broker is running"

# Service logs and troubleshooting
journalctl -u service_name      # View service logs
journalctl -u service_name -f   # Follow service logs in real-time
journalctl -u service_name --since "1 hour ago"  # Recent logs
journalctl -u service_name --since today  # Today's logs
journalctl -u service_name -n 50  # Last 50 log entries

# IoT service debugging examples
journalctl -u iot-sensor -f     # Monitor sensor service logs
journalctl -u mosquitto --since "10 minutes ago" | grep ERROR
journalctl -u network-manager --since today | grep -i wifi

# Creating custom IoT service files
# Example: /etc/systemd/system/iot-temperature.service
[Unit]
Description=IoT Temperature Monitoring Service
After=network.target
Wants=network.target

[Service]
Type=simple
User=iot
Group=iot
WorkingDirectory=/opt/iot
ExecStart=/usr/bin/python3 /opt/iot/temperature_monitor.py
ExecReload=/bin/kill -HUP $MAINPID
Restart=always
RestartSec=10
StandardOutput=journal
StandardError=journal

# Resource limits for IoT devices
MemoryLimit=64M
CPUQuota=25%

# Security settings
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ReadWritePaths=/var/lib/iot /var/log/iot

[Install]
WantedBy=multi-user.target

# Service management workflow
sudo systemctl daemon-reload    # Reload systemd configuration
sudo systemctl enable iot-temperature  # Enable new service
sudo systemctl start iot-temperature   # Start the service
systemctl status iot-temperature       # Check if it's running

Process Monitoring and Alerting - Proactive System Management

Proactive monitoring prevents issues before they become critical. This is especially important for IoT systems that may be deployed in remote locations where manual intervention is difficult.

# System load monitoring
uptime                          # Quick load overview
cat /proc/loadavg               # Detailed load information
w                               # Users and load
who                             # Currently logged in users

# Memory monitoring and alerts
free -h                         # Human-readable memory info
cat /proc/meminfo | grep -E "(MemTotal|MemFree|MemAvailable|SwapTotal|SwapFree)"

# Disk space monitoring
df -h                           # Disk space usage
df -i                           # Inode usage
du -sh /var/log                 # Log directory size
du -sh /opt/iot                 # IoT application size

# Network monitoring
netstat -tuln                   # Network connections
ss -tuln                        # Modern network statistics
cat /proc/net/dev               # Network interface statistics

# Temperature monitoring (important for IoT devices)
cat /sys/class/thermal/thermal_zone*/temp 2>/dev/null | while read temp; do
    echo "Temperature: $((temp/1000))°C"
done

# Comprehensive IoT system monitoring script
#!/bin/bash
# IoT System Health Monitor

LOG_FILE="/var/log/iot/system_health.log"
ALERT_EMAIL="admin@example.com"
CPU_THRESHOLD=80
MEMORY_THRESHOLD=85
DISK_THRESHOLD=90
TEMP_THRESHOLD=70

log_message() {
    echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" | tee -a "$LOG_FILE"
}

check_cpu_usage() {
    local cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
    if (( $(echo "$cpu_usage > $CPU_THRESHOLD" | bc -l) )); then
        log_message "ALERT: High CPU usage: ${cpu_usage}%"
        return 1
    fi
    return 0
}

check_memory_usage() {
    local mem_usage=$(free | grep Mem | awk '{printf "%.0f", $3/$2 * 100.0}')
    if [ "$mem_usage" -gt "$MEMORY_THRESHOLD" ]; then
        log_message "ALERT: High memory usage: ${mem_usage}%"
        return 1
    fi
    return 0
}

check_disk_usage() {
    local disk_usage=$(df / | tail -1 | awk '{print $5}' | cut -d'%' -f1)
    if [ "$disk_usage" -gt "$DISK_THRESHOLD" ]; then
        log_message "ALERT: High disk usage: ${disk_usage}%"
        return 1
    fi
    return 0
}

check_temperature() {
    local temp_file="/sys/class/thermal/thermal_zone0/temp"
    if [ -f "$temp_file" ]; then
        local temp=$(($(cat "$temp_file")/1000))
        if [ "$temp" -gt "$TEMP_THRESHOLD" ]; then
            log_message "ALERT: High temperature: ${temp}°C"
            return 1
        fi
    fi
    return 0
}

check_iot_services() {
    local failed_services=()
    for service in mosquitto iot-sensor iot-dashboard; do
        if ! systemctl is-active --quiet "$service"; then
            failed_services+=("$service")
        fi
    done
    
    if [ ${#failed_services[@]} -gt 0 ]; then
        log_message "ALERT: Failed IoT services: ${failed_services[*]}"
        return 1
    fi
    return 0
}

# Main monitoring loop
main() {
    log_message "Starting system health check"
    
    local alerts=0
    check_cpu_usage || ((alerts++))
    check_memory_usage || ((alerts++))
    check_disk_usage || ((alerts++))
    check_temperature || ((alerts++))
    check_iot_services || ((alerts++))
    
    if [ "$alerts" -eq 0 ]; then
        log_message "System health check completed - All systems normal"
    else
        log_message "System health check completed - $alerts alerts generated"
    fi
}

main "$@"

# Process recovery automation
#!/bin/bash
# IoT Process Recovery Script

recover_process() {
    local service_name=$1
    local max_attempts=3
    local attempt=1
    
    while [ $attempt -le $max_attempts ]; do
        log_message "Attempting to restart $service_name (attempt $attempt/$max_attempts)"
        
        systemctl restart "$service_name"
        sleep 10
        
        if systemctl is-active --quiet "$service_name"; then
            log_message "Successfully restarted $service_name"
            return 0
        fi
        
        ((attempt++))
    done
    
    log_message "CRITICAL: Failed to restart $service_name after $max_attempts attempts"
    return 1
}

# Monitor and recover critical IoT services
for service in mosquitto iot-sensor iot-dashboard; do
    if ! systemctl is-active --quiet "$service"; then
        log_message "Service $service is not running, attempting recovery"
        recover_process "$service"
    fi
done

4. Hands-on Practice Exercises - Real-World IoT Process Management

Professional IoT Process Management Scenarios

These exercises simulate real IoT deployment scenarios that you'll encounter in professional environments. Each exercise builds practical skills for managing and optimizing IoT systems.

Exercise 1
Set up comprehensive process monitoring and optimization for resource-constrained IoT devices
Exercise 2
Create and manage custom systemd services for IoT applications with proper resource limits
Exercise 3
Implement automated process monitoring and recovery systems for production IoT deployments
Exercise 4
Troubleshoot and resolve complex process issues in multi-service IoT environments
# Exercise 1: IoT Process Monitoring and Optimization
# Set up comprehensive monitoring for IoT systems

# Step 1: Create sample IoT processes for monitoring
# Temperature sensor simulator
cat > /tmp/temp_sensor.py << 'EOF'
#!/usr/bin/env python3
import time
import random
import json
import os

while True:
    temp = 20 + random.uniform(-5, 15)
    data = {"temperature": temp, "timestamp": time.time()}
    print(f"Temperature: {temp:.2f}°C")
    
    # Simulate some CPU work
    for i in range(10000):
        pass
    
    time.sleep(2)
EOF

# Data processor simulator
cat > /tmp/data_processor.py << 'EOF'
#!/usr/bin/env python3
import time
import random

while True:
    # Simulate data processing
    print("Processing sensor data...")
    
    # Simulate variable CPU load
    work_amount = random.randint(50000, 200000)
    for i in range(work_amount):
        pass
    
    time.sleep(5)
EOF

chmod +x /tmp/temp_sensor.py /tmp/data_processor.py

# Step 2: Start processes with different priorities
nice -n 5 python3 /tmp/temp_sensor.py > /tmp/sensor.log 2>&1 &
SENSOR_PID=$!
echo $SENSOR_PID > /tmp/sensor.pid

nice -n 10 python3 /tmp/data_processor.py > /tmp/processor.log 2>&1 &
PROCESSOR_PID=$!
echo $PROCESSOR_PID > /tmp/processor.pid

# Step 3: Monitor process performance
echo "=== IoT Process Monitoring ==="
ps -eo pid,ni,pcpu,pmem,cmd | grep -E "(temp_sensor|data_processor)"
echo ""
echo "=== System Resource Usage ==="
free -h
echo ""
echo "=== Process Details ==="
ps aux | grep -E "(temp_sensor|data_processor)" | grep -v grep

# Step 4: Adjust priorities based on monitoring
echo "Adjusting process priorities..."
renice 0 $SENSOR_PID                        # Normal priority for sensor
renice 15 $PROCESSOR_PID                    # Lower priority for processor

# Step 5: Create monitoring script
cat > /tmp/iot_monitor.sh << 'EOF'
#!/bin/bash
echo "=== IoT System Health Monitor ==="
echo "Time: $(date)"
echo "Uptime: $(uptime)"
echo "Memory: $(free -h | grep Mem | awk '{print $3 "/" $2}')"
echo "Load: $(cat /proc/loadavg | cut -d' ' -f1-3)"
echo ""
echo "=== IoT Processes ==="
ps aux | grep -E "(temp_sensor|data_processor)" | grep -v grep
echo ""
echo "=== Resource Top Consumers ==="
ps aux --sort=-%cpu | head -5
EOF

chmod +x /tmp/iot_monitor.sh
/tmp/iot_monitor.sh

# Exercise 2: Custom systemd Service Creation
# Create production-ready IoT services

# Step 1: Create IoT application directory structure
sudo mkdir -p /opt/iot/{bin,config,logs}
sudo mkdir -p /var/lib/iot
sudo mkdir -p /var/log/iot

# Step 2: Create sample IoT application
sudo tee /opt/iot/bin/iot_temperature_service.py > /dev/null << 'EOF'
#!/usr/bin/env python3
import time
import json
import random
import signal
import sys
import logging

# Configure logging
logging.basicConfig(
    level=logging.INFO,
    format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)

class TemperatureService:
    def __init__(self):
        self.running = True
        signal.signal(signal.SIGTERM, self.signal_handler)
        signal.signal(signal.SIGINT, self.signal_handler)
        
    def signal_handler(self, signum, frame):
        logger.info(f"Received signal {signum}, shutting down gracefully...")
        self.running = False
        
    def read_temperature(self):
        # Simulate temperature reading
        return 20 + random.uniform(-5, 15)
        
    def run(self):
        logger.info("IoT Temperature Service starting...")
        
        while self.running:
            try:
                temp = self.read_temperature()
                data = {
                    "temperature": temp,
                    "timestamp": time.time(),
                    "unit": "celsius"
                }
                
                logger.info(f"Temperature reading: {temp:.2f}°C")
                
                # Write to data file
                with open("/var/lib/iot/temperature.json", "w") as f:
                    json.dump(data, f)
                    
                time.sleep(10)
                
            except Exception as e:
                logger.error(f"Error in temperature service: {e}")
                time.sleep(5)
                
        logger.info("IoT Temperature Service stopped")

if __name__ == "__main__":
    service = TemperatureService()
    service.run()
EOF

sudo chmod +x /opt/iot/bin/iot_temperature_service.py

# Step 3: Create systemd service file
sudo tee /etc/systemd/system/iot-temperature.service > /dev/null << 'EOF'
[Unit]
Description=IoT Temperature Monitoring Service
Documentation=https://example.com/iot-docs
After=network.target
Wants=network.target

[Service]
Type=simple
User=iot
Group=iot
WorkingDirectory=/opt/iot
ExecStart=/usr/bin/python3 /opt/iot/bin/iot_temperature_service.py
ExecReload=/bin/kill -HUP $MAINPID
Restart=always
RestartSec=10
TimeoutStopSec=30

# Logging
StandardOutput=journal
StandardError=journal
SyslogIdentifier=iot-temperature

# Resource limits for IoT devices
MemoryLimit=64M
CPUQuota=25%

# Security settings
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ReadWritePaths=/var/lib/iot /var/log/iot
ProtectHome=true
ProtectKernelTunables=true
ProtectControlGroups=true

[Install]
WantedBy=multi-user.target
EOF

# Step 4: Set up service user and permissions
sudo useradd -r -s /bin/false iot 2>/dev/null || true
sudo chown -R iot:iot /opt/iot /var/lib/iot /var/log/iot

# Step 5: Enable and start the service
sudo systemctl daemon-reload
sudo systemctl enable iot-temperature
sudo systemctl start iot-temperature

# Step 6: Monitor the service
echo "=== Service Status ==="
systemctl status iot-temperature --no-pager
echo ""
echo "=== Service Logs ==="
journalctl -u iot-temperature -n 10 --no-pager

# Exercise 3: Automated Process Monitoring and Recovery
# Create comprehensive monitoring system

# Step 1: Create process monitoring script
sudo tee /opt/iot/bin/iot_process_monitor.sh > /dev/null << 'EOF'
#!/bin/bash

# Configuration
LOG_FILE="/var/log/iot/process_monitor.log"
ALERT_THRESHOLD_CPU=80
ALERT_THRESHOLD_MEM=85
CHECK_INTERVAL=60

# Services to monitor
IOT_SERVICES=("iot-temperature" "mosquitto")

# Logging function
log_message() {
    echo "$(date '+%Y-%m-%d %H:%M:%S') - $1" | tee -a "$LOG_FILE"
}

# Check system resources
check_system_resources() {
    # CPU usage
    local cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | cut -d'%' -f1)
    if (( $(echo "$cpu_usage > $ALERT_THRESHOLD_CPU" | bc -l) )); then
        log_message "ALERT: High CPU usage: ${cpu_usage}%"
    fi
    
    # Memory usage
    local mem_usage=$(free | grep Mem | awk '{printf "%.0f", $3/$2 * 100.0}')
    if [ "$mem_usage" -gt "$ALERT_THRESHOLD_MEM" ]; then
        log_message "ALERT: High memory usage: ${mem_usage}%"
    fi
    
    # Disk usage
    local disk_usage=$(df / | tail -1 | awk '{print $5}' | cut -d'%' -f1)
    if [ "$disk_usage" -gt 90 ]; then
        log_message "ALERT: High disk usage: ${disk_usage}%"
    fi
}

# Check IoT services
check_iot_services() {
    for service in "${IOT_SERVICES[@]}"; do
        if systemctl is-active --quiet "$service"; then
            log_message "INFO: Service $service is running"
        else
            log_message "ALERT: Service $service is not running"
            restart_service "$service"
        fi
    done
}

# Restart failed service
restart_service() {
    local service=$1
    log_message "INFO: Attempting to restart $service"
    
    if systemctl restart "$service"; then
        log_message "INFO: Successfully restarted $service"
    else
        log_message "ERROR: Failed to restart $service"
    fi
}

# Main monitoring loop
main() {
    log_message "INFO: Starting IoT process monitoring"
    
    while true; do
        check_system_resources
        check_iot_services
        sleep "$CHECK_INTERVAL"
    done
}

# Handle signals for graceful shutdown
trap 'log_message "INFO: Process monitor shutting down"; exit 0' SIGTERM SIGINT

main "$@"
EOF

sudo chmod +x /opt/iot/bin/iot_process_monitor.sh

# Step 2: Create systemd service for the monitor
sudo tee /etc/systemd/system/iot-process-monitor.service > /dev/null << 'EOF'
[Unit]
Description=IoT Process Monitor
After=multi-user.target

[Service]
Type=simple
User=root
ExecStart=/opt/iot/bin/iot_process_monitor.sh
Restart=always
RestartSec=30

[Install]
WantedBy=multi-user.target
EOF

# Step 3: Enable and start the monitor
sudo systemctl daemon-reload
sudo systemctl enable iot-process-monitor
sudo systemctl start iot-process-monitor

# Exercise 4: Troubleshooting Complex Process Issues
# Simulate and resolve common IoT process problems

# Step 1: Create problematic process for troubleshooting
cat > /tmp/problematic_iot_app.py << 'EOF'
#!/usr/bin/env python3
import time
import random
import os

# Simulate memory leak
data_accumulator = []

while True:
    # Memory leak simulation
    for i in range(1000):
        data_accumulator.append(f"sensor_reading_{i}_{time.time()}")
    
    # CPU spike simulation
    if random.random() < 0.3:  # 30% chance
        print("CPU spike simulation")
        for i in range(1000000):
            pass
    
    # Occasional crash simulation
    if random.random() < 0.1:  # 10% chance
        print("Simulating application crash")
        os._exit(1)
    
    print(f"Memory usage growing... {len(data_accumulator)} items")
    time.sleep(5)
EOF

chmod +x /tmp/problematic_iot_app.py

# Step 2: Start problematic process and monitor
python3 /tmp/problematic_iot_app.py &
PROBLEM_PID=$!

# Step 3: Diagnostic commands for troubleshooting
echo "=== Troubleshooting IoT Process Issues ==="
echo "Problem PID: $PROBLEM_PID"
echo ""

# Monitor process behavior
echo "=== Process Information ==="
ps -p $PROBLEM_PID -o pid,ppid,pcpu,pmem,vsz,rss,stat,start,time,cmd

echo ""
echo "=== Memory Usage Tracking ==="
while kill -0 $PROBLEM_PID 2>/dev/null; do
    ps -p $PROBLEM_PID -o pid,pmem,vsz,rss --no-headers
    sleep 2
done &
MONITOR_PID=$!

# Let it run for a bit then clean up
sleep 20
kill $PROBLEM_PID 2>/dev/null
kill $MONITOR_PID 2>/dev/null

# Cleanup all test processes
echo ""
echo "=== Cleaning up test processes ==="
[ -f /tmp/sensor.pid ] && kill $(cat /tmp/sensor.pid) 2>/dev/null
[ -f /tmp/processor.pid ] && kill $(cat /tmp/processor.pid) 2>/dev/null
sudo systemctl stop iot-temperature 2>/dev/null
sudo systemctl stop iot-process-monitor 2>/dev/null

Advanced Troubleshooting Scenarios

# Scenario 1: High CPU usage investigation
# Problem: IoT system experiencing high CPU usage
# Solution: Identify and optimize resource-intensive processes

# Step 1: Identify CPU-intensive processes
top -b -n 1 | head -20                    # Snapshot of top processes
ps aux --sort=-%cpu | head -10            # Top CPU consumers
ps -eo pid,pcpu,cmd --sort=-pcpu | head -10  # Focused CPU view

# Step 2: Analyze specific process
PID=$(ps aux --sort=-%cpu | awk 'NR==2 {print $2}')  # Get top CPU process
ps -p $PID -o pid,pcpu,pmem,etime,cmd    # Detailed process info
cat /proc/$PID/status                     # Process status details
ls -la /proc/$PID/fd/                     # Open file descriptors

# Step 3: Monitor process behavior over time
watch "ps -p $PID -o pid,pcpu,pmem,vsz,rss"  # Real-time monitoring

# Scenario 2: Memory leak detection and resolution
# Problem: IoT application consuming increasing memory
# Solution: Identify and fix memory leaks

# Step 1: Memory usage analysis
free -h                                   # System memory overview
ps aux --sort=-%mem | head -10           # Top memory consumers
cat /proc/meminfo | grep -E "(MemTotal|MemFree|MemAvailable)"

# Step 2: Process memory tracking
PID=$(ps aux --sort=-%mem | awk 'NR==2 {print $2}')
echo "Tracking memory for PID: $PID"
while kill -0 $PID 2>/dev/null; do
    ps -p $PID -o pid,pmem,vsz,rss,cmd --no-headers
    sleep 5
done

# Step 3: Memory leak investigation
cat /proc/$PID/status | grep -E "(VmSize|VmRSS|VmData|VmStk)"
cat /proc/$PID/smaps | grep -E "(Size|Rss|Pss)" | awk '{sum+=$2} END {print "Total:", sum, "kB"}'

# Scenario 3: Zombie process cleanup
# Problem: Accumulating zombie processes
# Solution: Identify and clean up zombie processes

# Step 1: Find zombie processes
ps aux | grep -E " Z | "         # Find zombies
ps -eo pid,ppid,stat,cmd | grep " Z "     # Zombie processes with parent info

# Step 2: Identify parent processes
ps -eo pid,ppid,stat,cmd | awk '$3 ~ /Z/ {print "Zombie PID:", $1, "Parent PID:", $2}'

# Step 3: Clean up zombies (restart parent or kill parent)
# Note: Zombies are cleaned up when parent reads their exit status
# If parent is unresponsive, it may need to be restarted

# Scenario 4: Service dependency issues
# Problem: IoT services failing due to dependency problems
# Solution: Analyze and fix service dependencies

# Step 1: Check service status and dependencies
systemctl status iot-temperature         # Service status
systemctl list-dependencies iot-temperature  # Service dependencies
systemctl show iot-temperature | grep -E "(After|Before|Wants|Requires)"

# Step 2: Check failed services
systemctl --failed                       # List failed services
journalctl -u iot-temperature --since "1 hour ago" | grep -i error

# Step 3: Dependency resolution
systemctl cat iot-temperature            # View service file
systemctl edit iot-temperature           # Edit service dependencies

# Performance optimization script
#!/bin/bash
# IoT System Performance Optimizer

optimize_iot_system() {
    echo "=== IoT System Performance Optimization ==="
    
    # 1. Identify resource-intensive processes
    echo "Top CPU consumers:"
    ps aux --sort=-%cpu | head -5
    
    echo "Top memory consumers:"
    ps aux --sort=-%mem | head -5
    
    # 2. Optimize process priorities
    echo "Optimizing IoT process priorities..."
    
    # Lower priority for non-critical processes
    for pid in $(pgrep -f "backup\|sync\|update"); do
        renice 10 $pid 2>/dev/null && echo "Lowered priority for PID $pid"
    done
    
    # Higher priority for critical IoT processes
    for pid in $(pgrep -f "sensor\|mqtt\|critical"); do
        sudo renice -5 $pid 2>/dev/null && echo "Raised priority for PID $pid"
    done
    
    # 3. Clean up unnecessary processes
    echo "Cleaning up system..."
    
    # Remove old log files
    find /var/log -name "*.log" -mtime +30 -size +100M -exec gzip {} \;
    
    # Clear temporary files
    find /tmp -type f -mtime +7 -delete 2>/dev/null
    
    # 4. System resource summary
    echo "=== System Resource Summary ==="
    echo "CPU Load: $(cat /proc/loadavg | cut -d' ' -f1-3)"
    echo "Memory: $(free -h | grep Mem | awk '{print $3 "/" $2}')"
    echo "Disk: $(df -h / | tail -1 | awk '{print $5 " used"}')"
    
    echo "Optimization complete!"
}

optimize_iot_system

Session Summary & Next Steps

What You've Accomplished

Congratulations! You now have comprehensive process management skills essential for IoT systems. Here's what you can do:

  • Monitor System Performance: You can analyze running processes, identify bottlenecks, and optimize resource usage for IoT devices
  • Control Process Lifecycle: You can start, stop, prioritize, and manage processes including background services and daemons
  • Manage System Services: You can create, configure, and manage systemd services for production IoT deployments
  • Implement Monitoring Solutions: You can build automated monitoring and recovery systems for IoT environments
  • Troubleshoot Complex Issues: You can diagnose and resolve process-related problems in multi-service IoT systems

Real-World IoT Applications

These skills directly apply to professional IoT development and operations:

  • Production Deployment: Managing IoT services in production environments with proper monitoring and recovery
  • Performance Optimization: Optimizing resource usage for battery-powered and resource-constrained IoT devices
  • System Reliability: Ensuring IoT systems remain operational through automated monitoring and recovery
  • Scalability: Managing multiple IoT services and processes efficiently across large deployments
  • Troubleshooting: Quickly identifying and resolving issues in complex IoT system environments
  • Security: Implementing proper process isolation and resource limits for secure IoT operations

Key Commands Mastery Summary

ps aux, top, htop, pgrep, pstree
kill, killall, pkill, jobs, bg, fg, nohup
nice, renice, priority levels (-20 to +19)
systemctl, journalctl, systemd service files
free, df, uptime, vmstat, iostat
/proc filesystem, system logs, performance metrics

Preparation for Session 5: Package Management & Shell Scripting

In the next session, we'll explore package management and shell scripting automation. To prepare and reinforce today's learning:

  • Practice Process Management: Set up your own IoT services and practice monitoring them
  • Experiment with Priorities: Try different priority settings and observe their effects on system performance
  • Create Monitoring Scripts: Build your own system monitoring solutions for specific IoT scenarios
  • Study systemd: Explore systemd documentation and create custom service files
  • Performance Analysis: Use the monitoring tools to analyze your system's performance patterns
  • Join DevOps Communities: Participate in system administration and DevOps discussions
← Session 3 Session 5 →