[← Back to Reviews Index](../Stewards%20Reviews%20Index.md)

# API Observability Review — Azure-AI-RAG-CSharp-Semantic-Kernel-Functions

| Field | Value |
|---|---|
| **Project** | Azure-AI-RAG-CSharp-Semantic-Kernel-Functions |
| **Review Date** | 2026-03-21 |
| **Steward** | API Observability Steward (AOTL) |
| **Target** | `src/ChatAPI` (C# / ASP.NET Core, .NET 8) |
| **Critical** | 3 |
| **Notable** | 5 |
| **Minor** | 3 |
| **Info** | 3 |
| **Total** | 14 |

---

## 1. Observability Architecture Overview

The ChatAPI is a .NET 8 ASP.NET Core Web API that acts as a Retrieval-Augmented Generation (RAG) orchestrator. It integrates Azure OpenAI (chat completion and text embeddings), Azure AI Search (vector + semantic search), and Azure Cosmos DB (product catalog and chat history). Semantic Kernel plugins mediate calls to those services. A React front end and a Python Azure Function (document loader) exist in the same repository but are out of scope for this audit.

The current observability stack is minimal:

- **OpenTelemetry**: `Azure.Monitor.OpenTelemetry.AspNetCore` v1.1.1 is referenced and wired up via `UseAzureMonitor()`. This provides a baseline pipeline.
- **Structured logging**: `ILogger<T>` is used throughout; no `Console.WriteLine` or `Debug.WriteLine` calls exist in production code. Log messages mostly use structured message templates correctly.
- **Health checks**: None configured.
- **Custom spans/activities**: None configured.
- **Sampling**: Default adaptive sampling inherited from `UseAzureMonitor()` — not explicitly configured.
- **Correlation ID propagation**: Relies entirely on the default W3C TraceContext support baked into `UseAzureMonitor()`; no explicit middleware or validation.

Overall observability maturity is **low**. The Azure Monitor integration creates a baseline, but the pipeline is untuned, health checks are absent, critical paths lack custom spans, error paths use wrong log levels, and sensitive data may reach Application Insights through unbounded response logging.

---

## 2. OpenTelemetry Pipeline Assessment

### What is present

`Program.cs` lines 65–69 wire up the pipeline:

```csharp
builder.Services.AddOpenTelemetry().UseAzureMonitor(options =>
{
    options.ConnectionString = builder.Configuration["APPLICATIONINSIGHTS_CONNECTION_STRING"];
    options.Credential = new DefaultAzureCredential();
});
```

`UseAzureMonitor()` from `Azure.Monitor.OpenTelemetry.AspNetCore` automatically configures:
- Traces with ASP.NET Core and HttpClient auto-instrumentation.
- Metrics (ASP.NET Core request metrics, runtime metrics).
- Log export to Azure Monitor.

### Gaps identified

1. **No explicit `WithTracing`, `WithMetrics`, or `WithLogging` configuration** — the pipeline configuration is entirely implicit. There is no `AddAspNetCoreInstrumentation()`, no `AddHttpClientInstrumentation()`, and no Cosmos DB instrumentation explicitly added. While `UseAzureMonitor()` injects some defaults, the lack of explicit configuration makes it impossible to know (from code review alone) what is actually being traced.

2. **No service name or version in resource attributes** — `UseAzureMonitor()` without a `ConfigureResource()` call means the service name reported to Application Insights defaults to the process name or entry assembly name, which may not be stable or meaningful. No `ResourceBuilder.CreateDefault().AddService("ChatAPI", serviceVersion: "1.0.0")` is configured.

3. **No OTLP exporter configured for non-Azure Monitor targets** — the pipeline is 100% Azure Monitor. There is no OTLP exporter for use with local OpenTelemetry Collector or alternative backends, which constrains local debugging and future portability.

4. **No Cosmos DB / Semantic Kernel instrumentation** — the most critical upstream calls (Cosmos DB reads/writes, Azure AI Search queries, Azure OpenAI completions) are not explicitly instrumented. `UseAzureMonitor()` may capture HttpClient spans for some of these, but Cosmos DB SDK v3 has its own `AddCosmosDbClientInstrumentation()` package, which is not referenced.

5. **Swagger is unconditionally enabled** (lines 87–88 in Program.cs — the `if (app.Environment.IsDevelopment())` guard is commented out). This is not a direct observability issue but is worth noting for the audit record.

---

## 3. Structured Logging Assessment

### Positive observations

- `ILogger<T>` is consistently injected via primary constructor parameters throughout all classes.
- No `Console.WriteLine` or `Debug.WriteLine` calls are present anywhere in the ChatAPI project.
- Most log messages use structured named parameters correctly (e.g., `_logger.LogInformation("Chat History Count {count}", chatHistory.Count)`).

### Gaps identified

**String concatenation instead of structured template (GenerateProductInfo.cs, line 67):**

```csharp
_logger.LogInformation("Product Content: " + productContent);
```

This uses string concatenation rather than a structured message template. It defeats the structured logging system, prevents log aggregation by template, and — critically — logs the full serialised product JSON content at `Information` level with no size bound.

**Wrong log level on exception paths — error surfaced as Information:**

In `AISearchDataPlugin.cs` (line 46) and `ProductDataPlugin.cs` (lines 38, 62):

```csharp
_logger.LogInformation(ex, "Error retrieving aisearch data: {question}", question);
_logger.LogInformation(ex, "Error retrieving product data for ProductId: {ProductId}", product_id);
```

These are exceptions with a re-throw, meaning they represent actual failures. They must be logged at `LogError`, not `LogInformation`. Using `LogInformation` on failure paths means:
- Error counts in Application Insights will be wrong.
- Alert rules based on error log severity will miss these failures.
- The AI Search plugin also passes the user's question as a structured parameter to the error log, which may include PII (see §8).

**Full AI response logged at Information level (ChatService.cs, line 55):**

```csharp
_logger.LogInformation("Response {response}", resp);
```

The entire LLM response (which can be lengthy and may include content derived from the user's question or personal context) is logged without any truncation or scrubbing. This is a data volume concern and a potential PII concern.

**Full product JSON logged at Information level (ProductDataPlugin.cs, line 31):**

```csharp
_logger.LogInformation("Product data retrieved: {ProductData}", productJson);
```

Complete product JSON is logged unconditionally. While product data in this context is relatively controlled, the pattern of logging full serialised objects without destructuring or size limits is risky.

**ChatHistoryData uses `ILogger<ProductData>` — wrong generic type parameter:**

```csharp
public sealed class ChatHistoryData(CosmosClient cosmosClient, ILogger<ProductData> logger, IConfiguration config)
```

The logger category will appear as `ChatAPI.Data.ProductData` in all logs from `ChatHistoryData`, not `ChatAPI.Data.ChatHistoryData`. This makes log filtering and analysis harder.

---

## 4. Custom Spans / Activities Assessment

No `ActivitySource`, `Activity`, or custom span creation exists anywhere in the ChatAPI codebase. The following critical operations are currently represented in traces only if the underlying HTTP client or SDK emits spans automatically:

| Operation | Where | Custom Span? |
|---|---|---|
| LLM completion call | `ChatService.GetResponseAsync` | No |
| Vector embedding generation | `AISearchDataPlugin.ResourceLookup` | No |
| Azure AI Search query | `AISearchData.RetrieveDocumentationAsync` | No |
| Cosmos DB product read | `ProductData.GetProductAsync` | No |
| Cosmos DB chat history read | `ChatHistoryData.GetMessagesBySessionIdAsync` | No |
| Cosmos DB chat history write | `ChatHistoryData.AddUserMessageAsync` / `AddAssistantMessageAsync` | No |
| Startup data population | `GenerateProductInfo.PopulateCosmosAsync` | No |

Without custom spans on the LLM completion path and the vector search path, it is not possible to isolate latency contributions from the OpenAI API vs Azure AI Search vs Cosmos DB in Application Insights' end-to-end transaction views.

---

## 5. Health Check Assessment

No health check infrastructure is configured in the ChatAPI:

- `AddHealthChecks()` is absent from `Program.cs`.
- `MapHealthChecks()` is absent.
- No `/health`, `/healthz`, or `/ready` endpoint exists.

The application has three critical downstream dependencies:
- **Azure Cosmos DB** — used for chat history and product catalog.
- **Azure AI Search** — used for RAG document retrieval.
- **Azure OpenAI** — used for completions and embeddings.

Without health checks, container orchestrators (Azure Container Apps, AKS, App Service) cannot determine whether the application is ready to serve traffic after startup. The startup logic in `PopulateCosmosAsync()` already performs a Cosmos DB connectivity check implicitly, but this result is not exposed as a health probe.

---

## 6. Azure Monitor Integration Assessment

### Positive observations

- `Azure.Monitor.OpenTelemetry.AspNetCore` v1.1.1 is correctly referenced in the `.csproj`.
- `UseAzureMonitor()` uses `APPLICATIONINSIGHTS_CONNECTION_STRING` from configuration (not hardcoded).
- `DefaultAzureCredential` is correctly used for Azure Monitor authentication — avoiding key-based authentication.

### Gaps identified

- **No sampling configuration** — `UseAzureMonitor()` defaults to adaptive sampling. For an LLM-based API where each conversation involves multiple round trips (embedding + search + completion), sampling without explicit configuration may either over-sample expensive traces or under-sample in high-traffic scenarios.
- **`APPLICATIONINSIGHTS_CONNECTION_STRING` is not validated at startup** — if the environment variable is missing or empty, the monitor connection will silently fail or use a null connection string. No guard or startup validation exists.
- **No `ConfigureResource()` call** — service name and version are not explicitly set.

---

## 7. Correlation ID Propagation Assessment

### What is present

`UseAzureMonitor()` enables W3C TraceContext propagation by default. Incoming HTTP requests will have a trace ID assigned, and this trace ID is propagated in outgoing HttpClient calls automatically.

### Gaps identified

- **No correlation ID in error responses** — when the `ChatController` or any downstream service throws an exception, the HTTP error response does not include the trace ID or a correlation ID header. Consumers cannot correlate a failed request to a specific trace in Application Insights without server-side log access.
- **No explicit middleware validation** — there is no `app.UseRouting()` + middleware that logs or validates incoming `traceparent`/`tracestate` headers. Propagation is entirely implicit and unverified.
- **No structured logging enrichment** — log messages do not use `BeginScope` or similar to enrich log entries with `SessionId` or `TraceId`, which would make it easier to correlate logs for a single conversation across multiple requests.

---

## 8. Log Scrubbing Assessment

### Gaps identified

- **Full LLM response logged** — `ChatService.cs` line 55 logs the complete AI response. In a support assistant scenario, the response may echo user-provided information (names, account details) that constitutes PII.
- **User question logged on error path** — `AISearchDataPlugin.cs` line 46 includes the raw user question in the error log: `"Error retrieving aisearch data: {question}", question`. User queries may include PII.
- **Full product JSON logged** — `ProductDataPlugin.cs` line 31 logs complete product serialisation. While likely low-risk for product data, the pattern should be consistent: log a summary (e.g., product ID and name only).
- **No destructuring policies** — there are no log enricher configurations, no `Serilog`-style destructuring policies, and no log scrubbing middleware. The default `ILogger` pipeline provides no automatic PII filtering.

---

## 9. Findings

| Severity | ID | Title | File |
|---|---|---|---|
| 🔴 Critical | AOTL-HEALTH-001 | No health check endpoint configured | `Program.cs` |
| 🔴 Critical | AOTL-LOG-001 | Exceptions logged at Information level instead of Error | `Plugins/AISearchDataPlugin.cs`, `Plugins/ProductDataPlugin.cs` |
| 🔴 Critical | AOTL-LOG-002 | Full AI response and full product JSON logged — potential PII and data volume risk | `Services/ChatService.cs`, `Plugins/ProductDataPlugin.cs` |
| 🟡 Notable | AOTL-TRACE-001 | No custom spans on critical LLM/search/Cosmos DB paths | `Services/ChatService.cs`, `Data/AISearchData.cs`, `Data/ChatHistoryData.cs` |
| 🟡 Notable | AOTL-PIPE-001 | No explicit OpenTelemetry pipeline configuration — pipeline is entirely implicit | `Program.cs` |
| 🟡 Notable | AOTL-PIPE-002 | No service name or version set in OTel resource attributes | `Program.cs` |
| 🟡 Notable | AOTL-PIPE-003 | No Cosmos DB SDK instrumentation package referenced | `ChatAPI.csproj` |
| 🟡 Notable | AOTL-LOG-003 | String concatenation used instead of structured template | `Data/sample_data/products/GenerateProductInfo.cs` |
| 🟡 Notable | AOTL-CORR-001 | No correlation ID included in error responses | `Controllers/ChatController.cs` |
| 🟢 Minor | AOTL-LOG-004 | `ChatHistoryData` injects `ILogger<ProductData>` — wrong category | `Data/ChatHistoryData.cs` |
| 🟢 Minor | AOTL-MON-001 | No sampling explicitly configured for Azure Monitor | `Program.cs` |
| 🟢 Minor | AOTL-MON-002 | No startup validation for `APPLICATIONINSIGHTS_CONNECTION_STRING` | `Program.cs` |
| ℹ️ Info | AOTL-INFO-001 | `UseAzureMonitor()` with `DefaultAzureCredential` is a good pattern — no key-based auth | `Program.cs` |
| ℹ️ Info | AOTL-INFO-002 | `ILogger<T>` is consistently used; no Console.WriteLine calls present | All files |
| ℹ️ Info | AOTL-INFO-003 | W3C TraceContext propagation is implicitly enabled via UseAzureMonitor | `Program.cs` |

---

## 10. Recommended Improvements

| Finding | Recommended Action | Priority |
|---|---|---|
| AOTL-HEALTH-001 | Add `builder.Services.AddHealthChecks()` with liveness checks for Cosmos DB, Azure AI Search, and Azure OpenAI. Register `/healthz` and `/ready` endpoints. Exclude from authentication. | Must fix |
| AOTL-LOG-001 | Change `LogInformation(ex, ...)` to `LogError(ex, ...)` on all exception catch blocks that re-throw. Errors must be logged at Error severity. | Must fix |
| AOTL-LOG-002 | Truncate or omit large response payloads from logs. Use `resp.Length` or a preview (first 200 chars) instead of the full response. Remove full product JSON logging; log only `ProductId`. | Must fix |
| AOTL-TRACE-001 | Create a static `ActivitySource` (e.g., `"ChatAPI"`) and wrap `GetResponseAsync`, `RetrieveDocumentationAsync`, and Cosmos DB data calls in `using var activity = _activitySource.StartActivity(...)`. Add key attributes (sessionId, questionLength, resultCount). | Fix soon |
| AOTL-PIPE-001 | Augment `AddOpenTelemetry()` with explicit `WithTracing(b => b.AddAspNetCoreInstrumentation().AddHttpClientInstrumentation())` and `WithMetrics(...)` to make the pipeline configuration visible and maintainable. | Fix soon |
| AOTL-PIPE-002 | Add `.ConfigureResource(r => r.AddService("ChatAPI", serviceVersion: "1.0.0"))` before `UseAzureMonitor()`. | Fix soon |
| AOTL-PIPE-003 | Add `OpenTelemetry.Instrumentation.CosmosDB` (or equivalent) to capture Cosmos DB SDK spans explicitly. | Fix soon |
| AOTL-LOG-003 | Replace `"Product Content: " + productContent` with `_logger.LogInformation("Product Content: {ProductContent}", productContent)`. | Fix soon |
| AOTL-CORR-001 | Add a global exception handler (e.g., `app.UseExceptionHandler`) that includes `Activity.Current?.TraceId.ToString()` in the error response body or as a `X-Correlation-Id` response header. | Fix soon |
| AOTL-LOG-004 | Change `ILogger<ProductData>` to `ILogger<ChatHistoryData>` in the `ChatHistoryData` constructor. | Fix when convenient |
| AOTL-MON-001 | Configure explicit sampling: e.g., `options.SamplingRatio = 0.1` for high-volume environments, or keep `1.0` for low-volume but document the decision. | Fix when convenient |
| AOTL-MON-002 | Add a startup guard: `if (string.IsNullOrEmpty(builder.Configuration["APPLICATIONINSIGHTS_CONNECTION_STRING"])) _logger.LogWarning(...)` to surface misconfiguration early. | Fix when convenient |

---

## 11. Observability Maturity Checklist

| Check | Status | Notes |
|---|---|---|
| OpenTelemetry package referenced | PASS | `Azure.Monitor.OpenTelemetry.AspNetCore` v1.1.1 |
| `AddOpenTelemetry()` / `UseAzureMonitor()` configured | PASS | Present in `Program.cs` |
| Service name and version in resource attributes | FAIL | Not configured |
| Traces pipeline explicit | FAIL | Fully implicit via `UseAzureMonitor()` |
| Metrics pipeline explicit | FAIL | Fully implicit via `UseAzureMonitor()` |
| OTLP exporter for non-Azure Monitor targets | FAIL | Not configured |
| ASP.NET Core auto-instrumentation | PARTIAL | Implicit via `UseAzureMonitor()` |
| HttpClient auto-instrumentation | PARTIAL | Implicit via `UseAzureMonitor()` |
| Cosmos DB SDK instrumentation | FAIL | No explicit package or configuration |
| Custom spans on critical paths | FAIL | None present |
| `ILogger<T>` used consistently | PASS | All classes use constructor injection |
| No Console.WriteLine in production code | PASS | None found |
| Structured log templates (no string interpolation) | PARTIAL | One concatenation in `GenerateProductInfo.cs` |
| Correct log levels on error paths | FAIL | Multiple plugin exceptions logged at Information |
| PII / large payload scrubbing | FAIL | Full LLM responses and product JSON logged |
| Wrong logger category | FAIL | `ChatHistoryData` uses `ILogger<ProductData>` |
| Health check endpoint registered | FAIL | Not present |
| Downstream health checks (Cosmos DB, AI Search, OpenAI) | FAIL | Not present |
| Azure Monitor connection string from config | PASS | Uses `APPLICATIONINSIGHTS_CONNECTION_STRING` |
| Azure Monitor uses managed identity | PASS | `DefaultAzureCredential` configured |
| No hard-coded connection strings | PASS | None found |
| Sampling configured | FAIL | Implicit adaptive only |
| Correlation ID in error responses | FAIL | Not implemented |
| W3C TraceContext propagation | PARTIAL | Implicit; no explicit validation or enrichment |
| Log scope enrichment with SessionId | FAIL | Not implemented |

---

## Footer

This review is based on **static analysis** of the source files in `src/ChatAPI` as of 2026-03-21. No runtime behaviour, live telemetry, or deployed Application Insights data was examined. Findings may not capture runtime-only issues such as sampling gaps, actual trace completeness, or missed correlations that only manifest under load.

Generated by the **API Observability Steward** (`api-otl-steward`). PREFIX: `AOTL`.
