[← Back to Reviews Index](../Stewards%20Reviews%20Index.md)

# Python Resilience Review — Azure-AI-RAG-CSharp-Semantic-Kernel-Functions

| Field | Value |
|---|---|
| **Project** | Azure-AI-RAG-CSharp-Semantic-Kernel-Functions |
| **Date** | 2026-03-22 |
| **Steward** | Python Resilience Steward |
| **Scope** | DocumentLoaderFunction (Azure Functions blob trigger, LangChain embeddings) |
| **Critical** | 4 |
| **Notable** | 4 |
| **Minor** | 1 |
| **Info** | 2 |
| **Total** | 11 |

---

## 1. Resilience Architecture Overview

The `DocumentLoaderFunction` is a Python Azure Functions v2 app with a single blob-triggered function (`Loader`). When a blob is uploaded to the `load` container it:

1. Reads the blob content from the trigger input stream.
2. Acquires an Azure AD token via `DefaultAzureCredential` and sets it in environment variables.
3. Creates `SearchClient` and `SearchIndexClient` instances against Azure AI Search.
4. Parses the HTML blob into structured JSON + plain text.
5. Creates an `AzureOpenAIEmbeddings` client and invokes `embed_query` to generate a 1536-dimension vector.
6. Upserts the document into the AI Search index via `upload_documents`.
7. Copies the blob to the `completed` container.
8. Deletes the source blob from the `load` container.

**Overall resilience posture: Poor.** The function has no retry policy configured in `host.json`, swallows all exceptions without re-raising (preventing Azure Functions retry from ever activating), no timeouts on any external calls, and performs destructive cleanup (blob deletion) regardless of whether processing succeeded. Several critical-severity issues were found.

---

## 2. Azure Functions Retry Policy Assessment

Azure Functions supports retry policies configured in `host.json` under a `retry` key. For blob-triggered functions using Azure Blob Storage trigger (not Event Grid), the retry policy can also be declared with the `@retry` decorator in the function app.

**Finding:** `host.json` contains no `retry` block whatsoever. The file only configures logging and the extension bundle:

```json
{
  "version": "2.0",
  "logging": { ... },
  "extensionBundle": { ... }
}
```

No `retry` policy is present. There is also no `@retry` decorator applied to the `Loader` function in `function_app.py`. Additionally, `functionTimeout` is not set, so the default timeout of 5 minutes applies for Consumption plan (30 minutes for Dedicated/Premium). For a function that calls an embedding API and performs blob copy + delete, the 5-minute default may be appropriate, but it is undocumented and not explicitly governed.

**Combined with the error swallowing finding (§3):** Even if a retry policy were added, the function's `except Exception` blocks prevent re-raising, so Azure Functions would always see a successful return value and would never trigger a retry.

---

## 3. Error Handling Assessment

The `Loader` function has a top-level `try/except` block that catches `json.JSONDecodeError` and then `Exception`. However, **both exception handlers only log and then return normally** — they do not re-raise. This means:

- A transient Azure OpenAI rate limit (429) causes the function to return success.
- A network timeout on `embed_query` causes the function to return success.
- A failed AI Search upload causes the function to return success.
- The source blob is then deleted (line 100) regardless of whether processing succeeded (see §6 for the NameError issue this creates).

Additionally, lines 99–101 (the `blob_client` creation and `delete_blob` call) fall **outside** the `try/except` block:

```python
    except Exception as e:
        logging.error(f"loader Failed: {e}")
        logging.error(traceback.format_exc())

    blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name)
    blob_client.delete_blob()
```

If the `except` branch is taken (i.e., any error during processing), `blob_service_client` and `container_name` are likely undefined (they are assigned inside the `try` block), causing an unhandled `NameError` or `UnboundLocalError` that would propagate to the runtime — but by then the logging is already done, and the exception is a secondary error unrelated to the root cause.

The `populate_search_index` method does re-raise its inner exception (`raise ex`), which is correct. However, the caller (`Loader`) catches and swallows it.

---

## 4. Timeout Configuration Assessment

**No timeouts are configured on any external call:**

| External call | Timeout configured? |
|---|---|
| `DefaultAzureCredential().get_token(...)` | No |
| `SearchClient` / `SearchIndexClient` construction | No |
| `search_index_client.get_index(...)` | No |
| `search_index_client.create_index(...)` | No |
| `AzureOpenAIEmbeddings` construction | No |
| `self.embeddings.embed_query(content)` | No |
| `search_client.upload_documents(...)` | No |
| `BlobServiceClient` construction | No |
| `blob_client_completed.upload_blob(...)` | No |
| `blob_client.delete_blob()` | No |

The most dangerous missing timeout is on `embed_query`. LangChain's `AzureOpenAIEmbeddings` delegates to `openai` which defaults to a 10-minute `httpx` timeout in recent versions, but this is not explicitly constrained in the code. Under high load or transient Azure OpenAI slowness, the function could block for an extended period, consuming a worker slot without making progress.

`host.json` does not set `functionTimeout`, so the platform default applies (5 minutes on Consumption plan). This is not explicit and could silently change if the hosting plan changes.

---

## 5. Idempotency Assessment

**Positive finding:** The AI Search `upload_documents` call is effectively an upsert because Azure AI Search's upload action replaces existing documents with the same key. The key field is `reference_code`, which is extracted from the HTML content. Re-processing the same blob will overwrite the same document — this is idempotent for the search index.

**Concern:** The blob copy to the `completed` container uses `overwrite=True`, so re-processing the same blob will overwrite the completed copy. This is idempotent for that operation.

**Concern:** Index creation (`create_index`) is guarded by a `get_index` check, so it will not fail if the index already exists. However, if `get_index` raises an exception other than `ResourceNotFoundError` (e.g., a network error), the code proceeds to `create_index`, which could fail with a conflict error or create a duplicate attempt.

---

## 6. Resource Cleanup Assessment

**Source blob is deleted unconditionally even on processing failure:**

Lines 99–101 are outside the `try` block. If processing fails (any exception), `blob_service_client` and `container_name` are undefined (they are set inside the `try` block). The code will raise `UnboundLocalError`, and the delete will not execute. However, the original exception is already swallowed, so the function returns normally to Azure Functions runtime. The blob stays in the `load` container (which is actually the safer outcome), but the error is invisible.

If processing succeeds (no exception), `delete_blob()` will execute normally.

**No `async with` or `try/finally` for blob clients:** All client objects (`blob_service_client`, `search_client`, `search_index_client`) are created but never explicitly closed. Azure SDK clients manage their own HTTP sessions via connection pools; they rely on garbage collection or explicit `.close()`. In a long-running Azure Functions worker, these connections may accumulate across invocations.

**`AzureOpenAIEmbeddings` client:** Created inside the function handler on every invocation. This is safe for cleanup (garbage collected after the invocation) but inefficient (see §8).

---

## 7. Graceful Degradation Assessment

The function processes exactly one blob per invocation — there is no batch processing loop, so partial batch failure does not apply here.

**Dead-letter handling:** There is no dead-letter mechanism. If a blob fails to process (e.g., it is malformed HTML, or the embedding API is unavailable), the blob stays in the `load` container due to the NameError described in §6. On retry (if a retry policy were configured), the same blob would be picked up again. However, because errors are swallowed and Azure Functions never sees a failure, the blob trigger will not retry automatically. The blob will sit indefinitely in the `load` container, causing the trigger to fire again on any re-upload but not on automatic retry.

**Malformed HTML:** `html_to_json` uses BeautifulSoup and accesses fields like `json_data["reference_code"]`. If the HTML does not contain the expected `<h2>Reference Code:</h2>` section, the field defaults to `"No reference code"`. This fallback is acceptable for parsing, but the document is stored with a potentially incorrect key, which could cause collision or data quality issues.

---

## 8. Findings

| Severity | ID | Title | File |
|---|---|---|---|
| 🔴 Critical | PYRES-RETRY-001 | No retry policy in host.json | `src/DocumentLoaderFunction/host.json` |
| 🔴 Critical | PYRES-ERR-001 | All exceptions swallowed — function always returns success on failure | `src/DocumentLoaderFunction/function_app.py` |
| 🔴 Critical | PYRES-ERR-002 | Blob delete executed outside try block — references undefined variables on error path | `src/DocumentLoaderFunction/function_app.py` |
| 🔴 Critical | PYRES-TIMEOUT-001 | No timeout on embedding API call — function can block indefinitely | `src/DocumentLoaderFunction/function_app.py` |
| 🟡 Notable | PYRES-TIMEOUT-002 | functionTimeout not explicitly set in host.json | `src/DocumentLoaderFunction/host.json` |
| 🟡 Notable | PYRES-CLEANUP-001 | Blob clients never explicitly closed — connection pool leak risk | `src/DocumentLoaderFunction/function_app.py` |
| 🟡 Notable | PYRES-DLQ-001 | No dead-letter handling for permanently unprocessable blobs | `src/DocumentLoaderFunction/function_app.py` |
| 🟡 Notable | PYRES-ERR-003 | Index creation not guarded against non-ResourceNotFoundError exceptions | `src/DocumentLoaderFunction/function_app.py` |
| 🟢 Minor | PYRES-RETRY-002 | AzureOpenAIEmbeddings client created on every invocation — no connection reuse | `src/DocumentLoaderFunction/function_app.py` |
| ℹ️ Info | PYRES-IDEM-001 | AI Search upload_documents is effectively idempotent via key-based replace | `src/DocumentLoaderFunction/function_app.py` |
| ℹ️ Info | PYRES-IDEM-002 | Blob copy to completed container uses overwrite=True — idempotent | `src/DocumentLoaderFunction/function_app.py` |

---

### Finding Details

**PYRES-RETRY-001 — No retry policy in host.json** (🔴 Critical)

`host.json` has no `retry` section. Azure Functions blob triggers will not automatically retry failed invocations unless a retry policy is configured. Any transient failure (network blip, embedding API rate limit, AI Search timeout) permanently loses the document. A fixed-delay or exponential-backoff policy with a sensible max retry count (e.g., 5 attempts) should be added.

**PYRES-ERR-001 — All exceptions swallowed** (🔴 Critical)

Both exception handlers in `Loader` log and return without re-raising. Azure Functions interprets a normal return as success and does not retry. Any transient or permanent failure is invisible to the platform. Errors must be re-raised (or a specific non-retriable exception type used to signal permanent failure) so that the retry policy can activate.

**PYRES-ERR-002 — Blob delete outside try block** (🔴 Critical)

Lines 99–101 reference `blob_service_client` and `container_name`, which are only assigned inside the `try` block. If an exception occurs before those assignments (e.g., during `DefaultAzureCredential` token acquisition), these lines raise `UnboundLocalError`. While this accidentally prevents premature blob deletion, it also means the secondary error masks the original failure in a confusing way. The blob cleanup logic must be reorganised so that it only runs when processing has succeeded, and it must be inside a controlled scope.

**PYRES-TIMEOUT-001 — No timeout on embedding API call** (🔴 Critical)

`self.embeddings.embed_query(content)` has no explicit timeout. The Azure OpenAI API can become slow or unresponsive during service incidents. Without a timeout, the function worker is blocked for up to `functionTimeout` (default 5 minutes on Consumption plan), holding a slot and making no progress. A timeout of 30–60 seconds should be set on the `AzureOpenAIEmbeddings` client or the underlying HTTP client.

**PYRES-TIMEOUT-002 — functionTimeout not set** (🟡 Notable)

`host.json` does not declare `functionTimeout`. The default is 5 minutes on Consumption plan and unlimited on Dedicated/Premium plan. If the hosting plan changes, the effective timeout changes silently. An explicit `"functionTimeout": "00:05:00"` (or an appropriate value for the expected processing load) should be set.

**PYRES-CLEANUP-001 — Blob clients never explicitly closed** (🟡 Notable)

`BlobServiceClient`, `SearchClient`, and `SearchIndexClient` are created on every invocation and never closed. Azure SDK clients hold HTTP connection pools. In a long-running worker that processes many blobs, these pools may accumulate open connections. Clients should be closed with `client.close()` in a `finally` block, or ideally instantiated once at module level so the pool is shared across invocations.

**PYRES-DLQ-001 — No dead-letter handling** (🟡 Notable)

If a blob is permanently unprocessable (malformed HTML, invalid reference code, persistent embedding API errors), there is no mechanism to move it to a dead-letter location. The blob will remain in the `load` container, but because errors are swallowed, the trigger does not fire again automatically. Adding a `dead-letter` container and moving failed blobs there (with an error metadata tag) would make failures visible and recoverable.

**PYRES-ERR-003 — Index creation not guarded against non-ResourceNotFoundError** (🟡 Notable)

In `populate_search_index`, the `get_index` call is only wrapped to catch `ResourceNotFoundError`. If `get_index` raises any other exception (e.g., `ServiceRequestError` on network failure), `index_exists` remains `False` and the code proceeds to call `create_index`, which will likely fail with a different error. The guard should re-raise unexpected exceptions from `get_index`.

**PYRES-RETRY-002 — Embeddings client created on every invocation** (🟢 Minor)

`AzureOpenAIEmbeddings` is instantiated inside the `Loader` function body. Each instantiation creates a new HTTP client and connection pool. Moving this to module-level initialisation would allow connection reuse across invocations and reduce latency and resource overhead.

**PYRES-IDEM-001 — AI Search upload_documents is effectively idempotent** (ℹ️ Info)

Azure AI Search's upload action replaces existing documents with the same key (`reference_code`). Re-processing the same blob will overwrite, not duplicate, the document. This is good resilience behaviour.

**PYRES-IDEM-002 — Blob copy to completed uses overwrite=True** (ℹ️ Info)

The copy to the `completed` container uses `overwrite=True`, so re-processing the same blob on retry does not leave orphaned copies. This is correct behaviour.

---

## 9. Recommended Improvements

| Finding | Recommended Action | Priority |
|---|---|---|
| PYRES-RETRY-001 | Add a `retry` block to `host.json` with `strategy: fixedDelay`, `maxRetryCount: 5`, `delayInterval: 00:00:30` | P0 — Before shipping |
| PYRES-ERR-001 | Re-raise exceptions in `Loader` after logging, so Azure Functions retry policy can activate | P0 — Before shipping |
| PYRES-ERR-002 | Move blob delete and copy into `try` block; use `finally` or conditional logic to only delete on success | P0 — Before shipping |
| PYRES-TIMEOUT-001 | Pass `timeout=30` to `AzureOpenAIEmbeddings` (via `request_timeout` parameter) or configure the `openai` HTTP client | P0 — Before shipping |
| PYRES-TIMEOUT-002 | Add `"functionTimeout": "00:05:00"` to `host.json` to make the timeout explicit and stable across plan changes | P1 — Fix soon |
| PYRES-CLEANUP-001 | Close `BlobServiceClient`, `SearchClient`, `SearchIndexClient` in `finally` blocks, or move them to module-level singletons | P1 — Fix soon |
| PYRES-DLQ-001 | Move blobs that fail permanently to a `dead-letter` container with error metadata; log the move | P1 — Fix soon |
| PYRES-ERR-003 | In `populate_search_index`, re-raise exceptions from `get_index` that are not `ResourceNotFoundError` | P1 — Fix soon |
| PYRES-RETRY-002 | Move `AzureOpenAIEmbeddings` instantiation to module level for connection reuse | P2 — Fix when convenient |

### Example host.json retry configuration

```json
{
  "version": "2.0",
  "retry": {
    "strategy": "fixedDelay",
    "maxRetryCount": 5,
    "delayInterval": "00:00:30"
  },
  "functionTimeout": "00:05:00",
  "logging": {
    "applicationInsights": {
      "samplingSettings": {
        "isEnabled": true,
        "excludedTypes": "Request"
      }
    }
  },
  "extensionBundle": {
    "id": "Microsoft.Azure.Functions.ExtensionBundle",
    "version": "[4.*, 5.0.0)"
  }
}
```

### Example corrected error handling skeleton

```python
@app.blob_trigger(arg_name="myblob", path="load", connection="BlobTriggerConnection")
def Loader(myblob: func.InputStream):
    blob_content = myblob.read()
    blob_service_client = None
    processing_succeeded = False
    try:
        # ... all processing ...
        processing_succeeded = True
    except Exception as e:
        logging.error(f"Loader failed for blob {myblob.name}: {e}")
        logging.error(traceback.format_exc())
        raise  # Let Azure Functions retry policy activate
    finally:
        if blob_service_client:
            blob_service_client.close()
        if processing_succeeded:
            # Only delete source blob on success
            blob_client = blob_service_client.get_blob_client(...)
            blob_client.delete_blob()
```

---

## Cross-Cutting Observations

- The function sets `OPENAI_API_KEY` and `AZURE_OPENAI_AD_TOKEN` as environment variables at runtime. Mutating `os.environ` inside a function invocation is not thread-safe in a multi-threaded worker and could cause token leakage or stale tokens across invocations if the worker is reused. This is an application-level concern but also a resilience risk (stale tokens could cause auth failures without surfacing as retriable errors). The Python Observability Steward and Python Config Steward may also wish to note this.
- `requirements.txt` pins `azure-identity==1.17.1` and `azure-core==1.29.0` but leaves `langchain`, `langchain-openai`, `langchain-community`, `azure-functions`, `azure-storage-blob`, `azure-keyvault-secrets`, and `requests` unpinned. Unpinned dependencies are a resilience risk: a breaking update to `langchain-openai` or `azure-storage-blob` could cause silent processing failures after an automatic dependency refresh. The Python Best Practices Steward owns this concern.

---

*This review is based on static analysis of source files as of 2026-03-22. It does not reflect runtime behaviour, deployment configuration beyond the files present, or findings from dynamic testing. Generated by the Python Resilience Steward (PYRES).*
