Ansible Error Handling Lab

Error handling is a critical aspect of robust automation. Ansible provides several mechanisms to control how your playbooks respond to failures, define what constitutes a failure or change, and recover gracefully from unexpected conditions. Understanding these tools allows you to create resilient automation that can handle real-world scenarios effectively.

Goals

Lab Setup

  1. Ensure you are in the ciq-basics directory

  2. Create a folder for this lab, let’s call it lab04

You are now ready to start the lab.

Understanding Failed When

The failed_when directive allows you to define custom conditions that determine when a task should be considered failed, regardless of the command’s exit code.

  1. Create a playbook called failed_when_demo.yml:
- hosts: all
  gather_facts: false
  tasks:
    - name: Check disk space with custom failure condition
      shell: df -h / | tail -n 1 | awk '{print $5}' | sed 's/%//'
      register: disk_usage
      failed_when: disk_usage.stdout|int > 80
    
    - name: Command that always succeeds but we define failure
      command: echo "Operation completed"
      register: result
      failed_when: "'completed' not in result.stdout"
    
    - name: Check service status with custom failure logic
      shell: systemctl is-active NetworkManager || echo "inactive"
      register: service_status
      failed_when: 
        - service_status.rc != 0
        - "'inactive' in service_status.stdout"
    
    - name: Validate configuration file exists
      stat:
        path: /etc/hosts
      register: config_file
      failed_when: not config_file.stat.exists
    
    - name: Multiple failure conditions
      shell: uptime
      register: uptime_result
      failed_when:
        - uptime_result.rc != 0
        - "'load average' not in uptime_result.stdout"
        - uptime_result.stdout == ""
  1. Execute the playbook and observe how different failure conditions work

Sample Output

    TASK [Check disk space with custom failure condition] **********************
    ok: [localhost]
    
    TASK [Command that always succeeds but we define failure] ******************
    ok: [localhost]
    
    TASK [Check service status with custom failure logic] **********************
    ok: [localhost]
    
    TASK [Validate configuration file exists] **********************************
    ok: [localhost]

Things to try

Controlling Change Detection

The changed_when directive allows you to define when a task should be marked as “changed”, giving you precise control over change detection for idempotent playbooks.

  1. First, create a simple template file app.conf.j2 for one of our examples:
# Application Configuration
app_name={{ app_name | default('MyApp') }}
mode={{ app_mode | default('development') }}
debug={{ debug_enabled | default(true) }}
port={{ app_port | default(8080) }}
  1. Create a playbook called changed_when_demo.yml:
- hosts: all
  gather_facts: false
  tasks:
    - name: Command that never reports changed
      command: date
      changed_when: false
    
    - name: Command that always reports changed
      command: echo "Configuration updated"
      changed_when: true
    
    - name: Update hosts file entry
      lineinfile:
        path: /etc/hosts
        line: "127.0.0.1 myapp.local"
        regexp: '^127\.0\.0\.1.*myapp\.local'
        state: present
      register: hosts_update
      changed_when: hosts_update.changed
      become: yes
    
    - name: Install package with conditional change detection
      package:
        name: curl
        state: present
      register: package_install
      changed_when: package_install.changed and 'Nothing to do' not in package_install.msg|default('')
      become: yes
    
    - name: Set file permissions with custom change detection
      file:
        path: /tmp/test_permissions.txt
        state: touch
        mode: '0644'
        owner: root
        group: root
      register: file_perms
      changed_when: 
        - file_perms.mode != '0644' or
          file_perms.owner != 'root' or 
          file_perms.group != 'root'
      become: yes
    
    - name: Create user with conditional change detection
      user:
        name: testuser
        state: present
        shell: /bin/bash
        home: /home/testuser
      register: user_creation
      changed_when: user_creation.changed and user_creation.state == 'present'
      become: yes
    
    - name: Template deployment with checksum-based change detection
      template:
        src: app.conf.j2
        dest: /tmp/app.conf
        backup: yes
      vars:
        app_mode: production
        debug_enabled: false
      register: template_result
      changed_when: 
        - template_result.changed
        - template_result.checksum != template_result.dest_checksum|default('')
    
    - name: Service management with state-based change detection
      service:
        name: NetworkManager
        state: started
        enabled: yes
      register: service_result
      changed_when:
        - service_result.changed
        - service_result.state == 'started' or service_result.enabled == true
      become: yes
  1. Execute the playbook and observe the change indicators

Sample Output

    TASK [Command that never reports changed] **********************************
    ok: [localhost]
    
    TASK [Command that always reports changed] *********************************
    changed: [localhost]
    
    TASK [Update hosts file entry] *********************************************
    changed: [localhost]
    
    TASK [Install package with conditional change detection] *******************
    ok: [localhost]
    
    TASK [Template deployment with checksum-based change detection] ************
    changed: [localhost]

Notice how:

Things to try

Ignoring Errors

The ignore_errors directive allows tasks to fail without stopping playbook execution, useful for optional operations or when you want to handle failures manually.

  1. Create a playbook called ignore_errors_demo.yml:
- hosts: all
  gather_facts: false
  become: true
  tasks:
    - name: Attempt to stop a service that might not exist
      service:
        name: nonexistent-service
        state: stopped
      ignore_errors: yes
    
    - name: Try to remove optional packages
      package:
        name:
          - some-optional-package
          - another-optional-package
        state: absent
      ignore_errors: yes
      become: yes
    
    - name: Command that might fail but we continue anyway
      shell: cat /path/to/optional/file.txt || echo "File not found, using defaults"
      register: file_content
      ignore_errors: yes
    
    - name: Show what we got from the previous task
      debug:
        msg: "File content result: {{ file_content.stdout }}"
    
    - name: Cleanup task that shouldn't stop execution
      file:
        path: /tmp/temporary-file-that-might-not-exist
        state: absent
      ignore_errors: yes
    
    - name: This task will always run
      debug:
        msg: "Playbook execution continued despite previous failures"
    
    - name: Conditional logic based on previous failed tasks
      debug:
        msg: "Previous file operation failed, using alternative approach"
      when: file_content is failed
  1. Execute the playbook and see how execution continues despite failures

Sample Output

    TASK [Attempt to stop a service that might not exist] **********************
    fatal: [localhost]: FAILED! => {"msg": "Could not find the requested service nonexistent-service"}
    ...ignoring
    
    TASK [Show what we got from the previous task] *****************************
    ok: [localhost] => {
        "msg": "File content result: File not found, using defaults"
    }
    
    TASK [This task will always run] *******************************************
    ok: [localhost] => {
        "msg": "Playbook execution continued despite previous failures"
    }

Things to try

Using Rescue Blocks

Rescue blocks provide structured error handling, allowing you to define recovery actions when tasks fail, similar to try-catch blocks in programming languages.

  1. Create a playbook called rescue_blocks_demo.yml:
- hosts: all
  gather_facts: false
  become: true
  tasks:
    - name: Primary configuration with fallback
      block:
        - name: Try to copy primary configuration
          copy:
            src: /path/to/primary/config.yml
            dest: /tmp/app-config.yml
        
        - name: Validate configuration
          shell: python -c "import yaml; yaml.safe_load(open('/tmp/app-config.yml'))"
        
      rescue:
        - name: Primary config failed, using backup
          debug:
            msg: "Primary configuration failed, falling back to default config"
        
        - name: Create default configuration
          copy:
            content: |
              app:
                name: "Default App"
                port: 8080
                debug: false
            dest: /tmp/app-config.yml
        
        - name: Log the fallback action
          lineinfile:
            path: /tmp/deployment.log
            line: "{{ ansible_date_time.iso8601 }}: Used default configuration due to primary config failure"
            create: yes
      
      always:
        - name: Ensure configuration exists
          stat:
            path: /tmp/app-config.yml
          register: final_config
        
        - name: Report final configuration status
          debug:
            msg: "Configuration file exists: {{ final_config.stat.exists }}"
    
    - name: Package installation with fallback
      block:
        - name: Install package from primary repository
          package:
            name: httpd
            state: present
          become: yes
        
        - name: Start and enable web service
          service:
            name: httpd
            state: started
            enabled: yes
          become: yes
        
      rescue:
        - name: Package installation failed, trying alternative
          debug:
            msg: "Primary package installation failed, installing alternative web server"
        
        - name: Install alternative web server
          package:
            name: nginx
            state: present
          become: yes
        
        - name: Start and enable nginx service
          service:
            name: nginx
            state: started
            enabled: yes
          become: yes
          
        - name: Log fallback action
          lineinfile:
            path: /tmp/deployment.log
            line: "{{ ansible_date_time.iso8601 }}: Used nginx instead of httpd due to installation failure"
            create: yes
      
      always:
        - name: Check if a web server is running
          shell: ss -tuln | grep ':80 '
          register: webserver_check
          ignore_errors: yes
        
        - name: Report web server status
          debug:
            msg: "Web server is {{ 'running' if webserver_check.rc == 0 else 'not running' }} on port 80"
  1. Execute the playbook to see rescue blocks in action

Sample Output

    TASK [Try to copy primary configuration] ***********************************
    fatal: [localhost]: FAILED! => {"msg": "Could not find or access '/path/to/primary/config.yml'"}
    
    TASK [Primary config failed, using backup] *********************************
    ok: [localhost] => {
        "msg": "Primary configuration failed, falling back to default config"
    }
    
    TASK [Create default configuration] ****************************************
    changed: [localhost]
    
    TASK [Report final configuration status] ***********************************
    ok: [localhost] => {
        "msg": "Configuration file exists: true"
    }
    
    TASK [Install package from primary repository] **************************** 
    ok: [localhost]
    
    TASK [Start and enable web service] ****************************************
    changed: [localhost]
    
    TASK [Check if a web server is running] ************************************
    ok: [localhost]
    
    TASK [Report web server status] ********************************************
    ok: [localhost] => {
        "msg": "Web server is running on port 80"
    }

Things to try

Configuring Error Strategies

Error strategies determine how Ansible behaves when tasks fail across multiple hosts, allowing you to control whether execution continues on other hosts when failures occur.

  1. Create a playbook called error_strategy_demo.yml:
- hosts: localhost
  strategy: linear
  gather_facts: false
  vars:
    error_strategy_test: "{{ strategy_type | default('fail_fast') }}"
  tasks:
    - name: Set error strategy dynamically
      set_fact:
        ansible_strategy: "{{ error_strategy_test }}"
    
    - name: Task that might fail on some hosts
      shell: |
        # Simulate random failure for demonstration
        if [ $(( RANDOM % 3 )) -eq 0 ]; then
          echo "Simulated failure"
          exit 1
        else
          echo "Task succeeded"
        fi
      register: random_task
    
    - name: This task only runs if previous succeeded
      debug:
        msg: "Previous task output: {{ random_task.stdout }}"
  1. Create a more comprehensive strategy demonstration in strategy_comparison.yml:
- hosts: localhost
  strategy: free
  gather_facts: false
  serial: 1
  max_fail_percentage: 30
  tasks:
    - name: Show current strategy
      debug:
        msg: "Running with strategy: {{ ansible_strategy | default('linear') }}"
    
    - name: Demonstrate free strategy behavior
      shell: sleep {{ ansible_play_hosts.index(inventory_hostname) + 1 }}; echo "Host {{ inventory_hostname }} completed"
      
    - name: Task with failure tolerance
      shell: |
        # Different behavior based on host
        case "{{ inventory_hostname }}" in
          *1) exit 0 ;;  # Success
          *2) exit 1 ;;  # Failure  
          *) echo "Processing..." ;;
        esac
      ignore_errors: "{{ ansible_strategy == 'free' }}"
    
    - name: Cleanup task that always runs
      debug:
        msg: "Cleanup completed on {{ inventory_hostname }}"
  1. Create a playbook demonstrating different error handling approaches in comprehensive_error_handling.yml:
- hosts: localhost
  gather_facts: false
  strategy: linear
  any_errors_fatal: false
  max_fail_percentage: 50
  
  tasks:
    - name: Critical task that must succeed
      shell: echo "Critical operation"
      any_errors_fatal: true
      
    - name: Optional task with custom error handling
      block:
        - name: Risky operation
          shell: |
            if [ $(( RANDOM % 2 )) -eq 0 ]; then
              echo "Operation succeeded"
            else
              echo "Operation failed"
              exit 1
            fi
          register: risky_op
          
      rescue:
        - name: Handle the failure
          set_fact:
            operation_status: "failed"
            fallback_used: true
            
        - name: Implement fallback
          shell: echo "Using fallback procedure"
          register: fallback_result
          
      always:
        - name: Log operation result
          debug:
            msg: |
              Operation status: {{ operation_status | default('success') }}
              Fallback used: {{ fallback_used | default(false) }}
    
    - name: Final validation
      assert:
        that:
          - risky_op is succeeded or fallback_result is succeeded
        fail_msg: "Neither primary operation nor fallback succeeded"
        success_msg: "Operation completed successfully (primary or fallback)"

Sample Output

    TASK [Show current strategy] ********************************************
    ok: [localhost] => {
        "msg": "Running with strategy: linear"
    }
    
    TASK [Critical task that must succeed] **********************************
    ok: [localhost]
    
    TASK [Handle the failure] ************************************************
    ok: [localhost]
    
    TASK [Final validation] *************************************************
    ok: [localhost] => {
        "msg": "Operation completed successfully (primary or fallback)"
    }

Available Error Strategies:

Strategy Options:

Things to try

Advanced Error Handling Patterns

  1. Create a playbook showing advanced error handling patterns in advanced_patterns.yml:
- hosts: localhost
  gather_facts: false
  vars:
    max_retries: 3
    retry_delay: 2
  
  tasks:
    - name: Retry logic with custom failure handling
      include_tasks: retry_task.yml
      vars:
        task_name: "Connect to external service"
        command_to_run: "curl -f http://httpbin.org/status/{{ item }}"
        expected_failures: [404, 500]
      loop: [200, 404, 200]
      register: retry_results
      
    - name: Conditional error handling based on error type
      shell: |
        case $RANDOM in
          *1) exit 1 ;;   # Retriable error
          *2) exit 2 ;;   # Fatal error  
          *) echo "Success" ;;
        esac
      register: operation_result
      failed_when: false
      
    - name: Handle different error types
      block:
        - name: Check for fatal errors
          fail:
            msg: "Fatal error occurred, cannot continue"
          when: operation_result.rc == 2
          
        - name: Handle retriable errors
          debug:
            msg: "Retriable error detected, implement retry logic"
          when: operation_result.rc == 1
          
      rescue:
        - name: Log error details
          copy:
            content: |
              Error occurred at: {{ ansible_date_time.iso8601 }}
              Task: {{ ansible_failed_task.name }}
              Error: {{ ansible_failed_result.msg }}
            dest: /tmp/error.log
            
        - name: Send notification (placeholder)
          debug:
            msg: "Would send alert about critical failure"
  1. Create the retry task file retry_task.yml:
- name: "{{ task_name }} - Attempt {{ item }}"
  shell: "{{ command_to_run }}"
  register: task_result
  failed_when: 
    - task_result.rc != 0
    - task_result.rc not in (expected_failures | default([]))
  retries: "{{ max_retries }}"
  delay: "{{ retry_delay }}"
  until: task_result.rc == 0

Sample Output

    TASK [Retry logic with custom failure handling] ************************
    ok: [localhost] => (item=200)
    failed: [localhost] (item=404) => {"msg": "Expected failure code encountered"}
    ok: [localhost] => (item=200)

Things to try

Return to Exercises