Ansible Error Handling Lab

Error handling is a critical aspect of robust automation. Ansible provides several mechanisms to control how your playbooks respond to failures, define what constitutes a failure or change, and recover gracefully from unexpected conditions. Understanding these tools allows you to create resilient automation that can handle real-world scenarios effectively.

Goals

Understand different types of error handling in Ansible
Use failed_when to define custom failure conditions
Control change detection with changed_when
Implement error recovery with rescue blocks
Configure error strategies for different failure scenarios
Create robust playbooks that handle edge cases gracefully

Lab Setup

Ensure you are in the ciq-basics directory
Create a folder for this lab, let’s call it lab04

You are now ready to start the lab.

Understanding Failed When

The failed_when directive allows you to define custom conditions that determine when a task should be considered failed, regardless of the command’s exit code.

Create a playbook called failed_when_demo.yml:

- hosts: all
  gather_facts: false
  tasks:
    - name: Check disk space with custom failure condition
      shell: df -h / | tail -n 1 | awk '{print $5}' | sed 's/%//'
      register: disk_usage
      failed_when: disk_usage.stdout|int > 80
    
    - name: Command that always succeeds but we define failure
      command: echo "Operation completed"
      register: result
      failed_when: "'completed' not in result.stdout"
    
    - name: Check service status with custom failure logic
      shell: systemctl is-active NetworkManager || echo "inactive"
      register: service_status
      failed_when: 
        - service_status.rc != 0
        - "'inactive' in service_status.stdout"
    
    - name: Validate configuration file exists
      stat:
        path: /etc/hosts
      register: config_file
      failed_when: not config_file.stat.exists
    
    - name: Multiple failure conditions
      shell: uptime
      register: uptime_result
      failed_when:
        - uptime_result.rc != 0
        - "'load average' not in uptime_result.stdout"
        - uptime_result.stdout == ""

Execute the playbook and observe how different failure conditions work

Sample Output

    TASK [Check disk space with custom failure condition] **********************
    ok: [localhost]
    
    TASK [Command that always succeeds but we define failure] ******************
    ok: [localhost]
    
    TASK [Check service status with custom failure logic] **********************
    ok: [localhost]
    
    TASK [Validate configuration file exists] **********************************
    ok: [localhost]

Things to try

Modify the disk usage threshold to trigger a failure
Create failure conditions based on string patterns in command output
Use failed_when with register variables from multiple tasks

Controlling Change Detection

The changed_when directive allows you to define when a task should be marked as “changed”, giving you precise control over change detection for idempotent playbooks.

First, create a simple template file app.conf.j2 for one of our examples:

# Application Configuration
app_name={{ app_name | default('MyApp') }}
mode={{ app_mode | default('development') }}
debug={{ debug_enabled | default(true) }}
port={{ app_port | default(8080) }}

Create a playbook called changed_when_demo.yml:

- hosts: all
  gather_facts: false
  tasks:
    - name: Command that never reports changed
      command: date
      changed_when: false
    
    - name: Command that always reports changed
      command: echo "Configuration updated"
      changed_when: true
    
    - name: Update hosts file entry
      lineinfile:
        path: /etc/hosts
        line: "127.0.0.1 myapp.local"
        regexp: '^127\.0\.0\.1.*myapp\.local'
        state: present
      register: hosts_update
      changed_when: hosts_update.changed
      become: yes
    
    - name: Install package with conditional change detection
      package:
        name: curl
        state: present
      register: package_install
      changed_when: package_install.changed and 'Nothing to do' not in package_install.msg|default('')
      become: yes
    
    - name: Set file permissions with custom change detection
      file:
        path: /tmp/test_permissions.txt
        state: touch
        mode: '0644'
        owner: root
        group: root
      register: file_perms
      changed_when: 
        - file_perms.mode != '0644' or
          file_perms.owner != 'root' or 
          file_perms.group != 'root'
      become: yes
    
    - name: Create user with conditional change detection
      user:
        name: testuser
        state: present
        shell: /bin/bash
        home: /home/testuser
      register: user_creation
      changed_when: user_creation.changed and user_creation.state == 'present'
      become: yes
    
    - name: Template deployment with checksum-based change detection
      template:
        src: app.conf.j2
        dest: /tmp/app.conf
        backup: yes
      vars:
        app_mode: production
        debug_enabled: false
      register: template_result
      changed_when: 
        - template_result.changed
        - template_result.checksum != template_result.dest_checksum|default('')
    
    - name: Service management with state-based change detection
      service:
        name: NetworkManager
        state: started
        enabled: yes
      register: service_result
      changed_when:
        - service_result.changed
        - service_result.state == 'started' or service_result.enabled == true
      become: yes

Execute the playbook and observe the change indicators

Sample Output

    TASK [Command that never reports changed] **********************************
    ok: [localhost]
    
    TASK [Command that always reports changed] *********************************
    changed: [localhost]
    
    TASK [Update hosts file entry] *********************************************
    changed: [localhost]
    
    TASK [Install package with conditional change detection] *******************
    ok: [localhost]
    
    TASK [Template deployment with checksum-based change detection] ************
    changed: [localhost]

Notice how:

The lineinfile task only reports changed if the line was actually added or modified
The package task uses the built-in change detection but filters out “Nothing to do” messages
The file task checks multiple attributes to determine if permissions actually changed
The template task compares checksums to detect content changes
The service task only reports changed when the service state or enabled status actually changes

Things to try

Use changed_when with the git module to detect when repositories are actually updated
Combine changed_when with the copy module using backup and checksum comparison
Practice with changed_when and the cron module to detect when crontab entries change
Experiment with changed_when on mount module tasks to detect filesystem mounting changes

Ignoring Errors

The ignore_errors directive allows tasks to fail without stopping playbook execution, useful for optional operations or when you want to handle failures manually.

Create a playbook called ignore_errors_demo.yml:

- hosts: all
  gather_facts: false
  become: true
  tasks:
    - name: Attempt to stop a service that might not exist
      service:
        name: nonexistent-service
        state: stopped
      ignore_errors: yes
    
    - name: Try to remove optional packages
      package:
        name:
          - some-optional-package
          - another-optional-package
        state: absent
      ignore_errors: yes
      become: yes
    
    - name: Command that might fail but we continue anyway
      shell: cat /path/to/optional/file.txt || echo "File not found, using defaults"
      register: file_content
      ignore_errors: yes
    
    - name: Show what we got from the previous task
      debug:
        msg: "File content result: {{ file_content.stdout }}"
    
    - name: Cleanup task that shouldn't stop execution
      file:
        path: /tmp/temporary-file-that-might-not-exist
        state: absent
      ignore_errors: yes
    
    - name: This task will always run
      debug:
        msg: "Playbook execution continued despite previous failures"
    
    - name: Conditional logic based on previous failed tasks
      debug:
        msg: "Previous file operation failed, using alternative approach"
      when: file_content is failed

Execute the playbook and see how execution continues despite failures

Sample Output

    TASK [Attempt to stop a service that might not exist] **********************
    fatal: [localhost]: FAILED! => {"msg": "Could not find the requested service nonexistent-service"}
    ...ignoring
    
    TASK [Show what we got from the previous task] *****************************
    ok: [localhost] => {
        "msg": "File content result: File not found, using defaults"
    }
    
    TASK [This task will always run] *******************************************
    ok: [localhost] => {
        "msg": "Playbook execution continued despite previous failures"
    }

Things to try

Combine ignore_errors with conditionals to create fallback logic
Use ignore_errors for optional cleanup operations
Register results from failed tasks and use them in subsequent tasks

Using Rescue Blocks

Rescue blocks provide structured error handling, allowing you to define recovery actions when tasks fail, similar to try-catch blocks in programming languages.

Create a playbook called rescue_blocks_demo.yml:

- hosts: all
  gather_facts: false
  become: true
  tasks:
    - name: Primary configuration with fallback
      block:
        - name: Try to copy primary configuration
          copy:
            src: /path/to/primary/config.yml
            dest: /tmp/app-config.yml
        
        - name: Validate configuration
          shell: python -c "import yaml; yaml.safe_load(open('/tmp/app-config.yml'))"
        
      rescue:
        - name: Primary config failed, using backup
          debug:
            msg: "Primary configuration failed, falling back to default config"
        
        - name: Create default configuration
          copy:
            content: |
              app:
                name: "Default App"
                port: 8080
                debug: false
            dest: /tmp/app-config.yml
        
        - name: Log the fallback action
          lineinfile:
            path: /tmp/deployment.log
            line: "{{ ansible_date_time.iso8601 }}: Used default configuration due to primary config failure"
            create: yes
      
      always:
        - name: Ensure configuration exists
          stat:
            path: /tmp/app-config.yml
          register: final_config
        
        - name: Report final configuration status
          debug:
            msg: "Configuration file exists: {{ final_config.stat.exists }}"
    
    - name: Package installation with fallback
      block:
        - name: Install package from primary repository
          package:
            name: httpd
            state: present
          become: yes
        
        - name: Start and enable web service
          service:
            name: httpd
            state: started
            enabled: yes
          become: yes
        
      rescue:
        - name: Package installation failed, trying alternative
          debug:
            msg: "Primary package installation failed, installing alternative web server"
        
        - name: Install alternative web server
          package:
            name: nginx
            state: present
          become: yes
        
        - name: Start and enable nginx service
          service:
            name: nginx
            state: started
            enabled: yes
          become: yes
          
        - name: Log fallback action
          lineinfile:
            path: /tmp/deployment.log
            line: "{{ ansible_date_time.iso8601 }}: Used nginx instead of httpd due to installation failure"
            create: yes
      
      always:
        - name: Check if a web server is running
          shell: ss -tuln | grep ':80 '
          register: webserver_check
          ignore_errors: yes
        
        - name: Report web server status
          debug:
            msg: "Web server is {{ 'running' if webserver_check.rc == 0 else 'not running' }} on port 80"

Execute the playbook to see rescue blocks in action

Sample Output

    TASK [Try to copy primary configuration] ***********************************
    fatal: [localhost]: FAILED! => {"msg": "Could not find or access '/path/to/primary/config.yml'"}
    
    TASK [Primary config failed, using backup] *********************************
    ok: [localhost] => {
        "msg": "Primary configuration failed, falling back to default config"
    }
    
    TASK [Create default configuration] ****************************************
    changed: [localhost]
    
    TASK [Report final configuration status] ***********************************
    ok: [localhost] => {
        "msg": "Configuration file exists: true"
    }
    
    TASK [Install package from primary repository] **************************** 
    ok: [localhost]
    
    TASK [Start and enable web service] ****************************************
    changed: [localhost]
    
    TASK [Check if a web server is running] ************************************
    ok: [localhost]
    
    TASK [Report web server status] ********************************************
    ok: [localhost] => {
        "msg": "Web server is running on port 80"
    }

Things to try

Create nested rescue blocks for multi-level error handling
Use rescue blocks with different strategies for different types of failures
Combine rescue blocks with when conditions for conditional recovery

Configuring Error Strategies

Error strategies determine how Ansible behaves when tasks fail across multiple hosts, allowing you to control whether execution continues on other hosts when failures occur.

Create a playbook called error_strategy_demo.yml:

- hosts: localhost
  strategy: linear
  gather_facts: false
  vars:
    error_strategy_test: "{{ strategy_type | default('fail_fast') }}"
  tasks:
    - name: Set error strategy dynamically
      set_fact:
        ansible_strategy: "{{ error_strategy_test }}"
    
    - name: Task that might fail on some hosts
      shell: |
        # Simulate random failure for demonstration
        if [ $(( RANDOM % 3 )) -eq 0 ]; then
          echo "Simulated failure"
          exit 1
        else
          echo "Task succeeded"
        fi
      register: random_task
    
    - name: This task only runs if previous succeeded
      debug:
        msg: "Previous task output: {{ random_task.stdout }}"

Create a more comprehensive strategy demonstration in strategy_comparison.yml:

- hosts: localhost
  strategy: free
  gather_facts: false
  serial: 1
  max_fail_percentage: 30
  tasks:
    - name: Show current strategy
      debug:
        msg: "Running with strategy: {{ ansible_strategy | default('linear') }}"
    
    - name: Demonstrate free strategy behavior
      shell: sleep {{ ansible_play_hosts.index(inventory_hostname) + 1 }}; echo "Host {{ inventory_hostname }} completed"
      
    - name: Task with failure tolerance
      shell: |
        # Different behavior based on host
        case "{{ inventory_hostname }}" in
          *1) exit 0 ;;  # Success
          *2) exit 1 ;;  # Failure  
          *) echo "Processing..." ;;
        esac
      ignore_errors: "{{ ansible_strategy == 'free' }}"
    
    - name: Cleanup task that always runs
      debug:
        msg: "Cleanup completed on {{ inventory_hostname }}"

Create a playbook demonstrating different error handling approaches in comprehensive_error_handling.yml:

- hosts: localhost
  gather_facts: false
  strategy: linear
  any_errors_fatal: false
  max_fail_percentage: 50
  
  tasks:
    - name: Critical task that must succeed
      shell: echo "Critical operation"
      any_errors_fatal: true
      
    - name: Optional task with custom error handling
      block:
        - name: Risky operation
          shell: |
            if [ $(( RANDOM % 2 )) -eq 0 ]; then
              echo "Operation succeeded"
            else
              echo "Operation failed"
              exit 1
            fi
          register: risky_op
          
      rescue:
        - name: Handle the failure
          set_fact:
            operation_status: "failed"
            fallback_used: true
            
        - name: Implement fallback
          shell: echo "Using fallback procedure"
          register: fallback_result
          
      always:
        - name: Log operation result
          debug:
            msg: |
              Operation status: {{ operation_status | default('success') }}
              Fallback used: {{ fallback_used | default(false) }}
    
    - name: Final validation
      assert:
        that:
          - risky_op is succeeded or fallback_result is succeeded
        fail_msg: "Neither primary operation nor fallback succeeded"
        success_msg: "Operation completed successfully (primary or fallback)"

Sample Output

    TASK [Show current strategy] ********************************************
    ok: [localhost] => {
        "msg": "Running with strategy: linear"
    }
    
    TASK [Critical task that must succeed] **********************************
    ok: [localhost]
    
    TASK [Handle the failure] ************************************************
    ok: [localhost]
    
    TASK [Final validation] *************************************************
    ok: [localhost] => {
        "msg": "Operation completed successfully (primary or fallback)"
    }

Available Error Strategies:

linear (default): Tasks execute on all hosts before moving to next task
free: Each host executes tasks as fast as possible independently
host_pinned: Tasks are assigned to specific hosts and stay there
debug: Interactive debugging mode for troubleshooting

Strategy Options:

serial: Control how many hosts execute tasks simultaneously
max_fail_percentage: Set failure threshold before stopping execution
any_errors_fatal: Stop all execution if any host fails

Things to try

Test different strategies with multiple host inventories
Experiment with max_fail_percentage in multi-host scenarios
Use any_errors_fatal for critical sections of deployment playbooks
Combine error strategies with rescue blocks for comprehensive error handling

Advanced Error Handling Patterns

Create a playbook showing advanced error handling patterns in advanced_patterns.yml:

- hosts: localhost
  gather_facts: false
  vars:
    max_retries: 3
    retry_delay: 2
  
  tasks:
    - name: Retry logic with custom failure handling
      include_tasks: retry_task.yml
      vars:
        task_name: "Connect to external service"
        command_to_run: "curl -f http://httpbin.org/status/{{ item }}"
        expected_failures: [404, 500]
      loop: [200, 404, 200]
      register: retry_results
      
    - name: Conditional error handling based on error type
      shell: |
        case $RANDOM in
          *1) exit 1 ;;   # Retriable error
          *2) exit 2 ;;   # Fatal error  
          *) echo "Success" ;;
        esac
      register: operation_result
      failed_when: false
      
    - name: Handle different error types
      block:
        - name: Check for fatal errors
          fail:
            msg: "Fatal error occurred, cannot continue"
          when: operation_result.rc == 2
          
        - name: Handle retriable errors
          debug:
            msg: "Retriable error detected, implement retry logic"
          when: operation_result.rc == 1
          
      rescue:
        - name: Log error details
          copy:
            content: |
              Error occurred at: {{ ansible_date_time.iso8601 }}
              Task: {{ ansible_failed_task.name }}
              Error: {{ ansible_failed_result.msg }}
            dest: /tmp/error.log
            
        - name: Send notification (placeholder)
          debug:
            msg: "Would send alert about critical failure"

Create the retry task file retry_task.yml:

- name: "{{ task_name }} - Attempt {{ item }}"
  shell: "{{ command_to_run }}"
  register: task_result
  failed_when: 
    - task_result.rc != 0
    - task_result.rc not in (expected_failures | default([]))
  retries: "{{ max_retries }}"
  delay: "{{ retry_delay }}"
  until: task_result.rc == 0

Sample Output

    TASK [Retry logic with custom failure handling] ************************
    ok: [localhost] => (item=200)
    failed: [localhost] (item=404) => {"msg": "Expected failure code encountered"}
    ok: [localhost] => (item=200)

Things to try

Create reusable error handling roles for common patterns
Implement circuit breaker patterns using failed task tracking
Build comprehensive logging and alerting for production deployments
Use error handling with dynamic inventory and scaling operations

Return to Exercises