[← Back to Reviews Index](../Stewards%20Reviews%20Index.md)

# API Telemetry Review — Azure-AI-RAG-CSharp-Semantic-Kernel-Functions

| Field | Value |
|---|---|
| Project | Azure-AI-RAG-CSharp-Semantic-Kernel-Functions |
| Date | 2026-03-21 |
| Steward | API Telemetry Steward |
| Scope | ChatAPI (C# / ASP.NET Core) |
| Critical | 3 |
| Notable | 5 |
| Minor | 3 |
| Info | 2 |
| Total | 13 |

---

## 1. Assessment Overview

The ChatAPI project is an ASP.NET Core 8 application that implements a RAG (Retrieval-Augmented Generation) chat service using Semantic Kernel and Azure AI Search. The application is wired to Azure Monitor / Application Insights via `Azure.Monitor.OpenTelemetry.AspNetCore`, which provides baseline infrastructure telemetry (HTTP request traces, dependency calls, exceptions propagated to the runtime).

However, **business event telemetry is almost entirely absent.** No structured business events are emitted across any of the critical user flows: chat completion, session creation, vector search, or product lookup. The application relies exclusively on unstructured `ILogger.LogInformation` calls, which are diagnostic logs rather than business events. There are no `ActivitySource`-based spans, no custom `Meter` counters or histograms, and no Semantic Kernel token usage tracking.

The OpenTelemetry infrastructure package is present but carries the full weight of telemetry alone. Without business event instrumentation, it is impossible to answer operational questions such as: How many chat requests succeeded today? What is the P95 latency of chat completions? How often does RAG retrieval fail? What is the token cost per conversation?

---

## 2. Business Flow Inventory

The following critical business operations were identified in the ChatAPI:

| # | Operation | Entry Point | Notes |
|---|---|---|---|
| 1 | **Session created** | `GET /session` → `SessionController.GetSession` | Generates a new session GUID |
| 2 | **Chat message received** | `POST /chat` → `ChatController.Post` | Receives user input and session ID |
| 3 | **Chat history initialized** | `ChatService.GetResponseAsync` | Loads prior messages from CosmosDB on first turn |
| 4 | **Chat message persisted (user)** | `ChatHistoryData.AddUserMessageAsync` | Saves user turn to CosmosDB |
| 5 | **AI completion requested** | `ChatService.GetResponseAsync` → `IChatCompletionService` | Core LLM call via Semantic Kernel |
| 6 | **Semantic Kernel function invoked: troubleshoot_lookup** | `AISearchDataPlugin.ResourceLookup` | Vector search over AI Search index |
| 7 | **Vector embedding generated** | `AISearchDataPlugin.ResourceLookup` → `_embedding.GenerateEmbeddingAsync` | Embedding for RAG retrieval |
| 8 | **Document search executed** | `AISearchData.RetrieveDocumentationAsync` | Hybrid semantic + vector query |
| 9 | **Semantic Kernel function invoked: get_azure_product_by_id** | `ProductDataPlugin.GetAzureProductDetailsById` | CosmosDB point read |
| 10 | **Chat response returned** | `ChatController.Post` | Final JSON response to client |
| 11 | **Chat message persisted (assistant)** | `ChatHistoryData.AddAssistantMessageAsync` | Saves assistant turn to CosmosDB |
| 12 | **Startup data population** | `GenerateProductInfo.PopulateCosmosAsync` | Seeding CosmosDB on startup |

---

## 3. Telemetry Coverage Map

| Operation | Telemetry Event | Coverage | Status |
|---|---|---|---|
| Session created | None | ❌ No event | Missing |
| Chat message received | `LogInformation("Result: {Result}")` — log only, after completion | ⚠️ Log only | Insufficient |
| Chat history initialized | `LogInformation("Init Chat History")` | ⚠️ Log only | Insufficient |
| Chat message persisted (user) | None | ❌ No event | Missing |
| AI completion requested | None — no event before or after the LLM call | ❌ No event | Missing |
| SK function: troubleshoot_lookup invoked | None on entry; error path has `LogInformation` (wrong level) | ❌ No event | Missing |
| Vector embedding generated | None | ❌ No event | Missing |
| Document search executed | None | ❌ No event | Missing |
| SK function: get_azure_product_by_id invoked | `LogInformation` on entry/exit only | ⚠️ Log only | Insufficient |
| Chat response returned | `LogInformation("Result: {Result}")` — full response payload in log | ⚠️ Log only (+ PII risk) | Insufficient |
| Chat message persisted (assistant) | None | ❌ No event | Missing |
| SK token usage | None — not captured anywhere | ❌ No event | Missing |
| Errors on chat completion | No try/catch; no error event | ❌ No error telemetry | Missing |
| Errors on session generation | No try/catch; no error event | ❌ No error telemetry | Missing |

**Summary:** 7 of 12 operations have no telemetry whatsoever. The remaining 5 have only unstructured log statements, not structured business events.

---

## 4. Event Naming Assessment

No structured business events (custom `Activity` spans or named event records) exist in the codebase. All current "telemetry" consists of `ILogger.LogInformation` calls with unstructured message strings. The following issues apply to existing log statements used in lieu of events:

| Location | Current Log Message | Issue |
|---|---|---|
| `ChatController.cs:26` | `"Result: {Result}"` | Vague name — does not express business intent; logs full response content |
| `ChatService.cs:32` | `"Chat History Count {count}"` | Diagnostic/implementation detail, not a business event |
| `ChatService.cs:36` | `"Init Chat History"` | Vague; no session context or count |
| `ChatService.cs:55` | `"Response {response}"` | Logs full LLM response — potential PII/data exposure in logs |
| `AISearchDataPlugin.cs:46` | `"Error retrieving aisearch data: {question}"` | Uses `LogInformation` for an error — wrong severity level; question text may be PII |
| `ProductDataPlugin.cs:31` | `"Product data retrieved: {ProductData}"` | Logs entire product JSON payload — may be excessive |

No event follows the required verb-noun past-tense format (e.g., `chat.message.sent`, `search.query.executed`). There is no consistent namespace or naming convention.

---

## 5. Payload Quality Assessment

Since no structured events exist, this assessment evaluates what would be needed and what is currently absent:

| Required Field | Present? | Notes |
|---|---|---|
| Event name (semantic) | ❌ | No named events; only log messages |
| Timestamp | ⚠️ | Azure Monitor auto-injects timestamps on logs; not present in custom events |
| Correlation ID / Trace ID | ⚠️ | OpenTelemetry propagates W3C trace context via Azure Monitor; not explicitly set in business events |
| Session ID | ❌ | Session ID exists as a method parameter but is never added to logs or spans as a structured field |
| User/Request ID | ❌ | No user identity or request ID is tagged on any event |
| Error code / message on failure | ❌ | `ChatService.GetResponseAsync` has no try/catch; errors are not caught or reported |
| Duration / latency | ❌ | No stopwatch or histogram records LLM call latency, search latency, or embedding latency |
| Token usage | ❌ | `ChatMessageContent.InnerContent` contains usage data but is never read or logged |
| Search result count | ❌ | The number of documents returned by RAG retrieval is not recorded |

**Notable risk:** `ChatService.cs:55` logs `resp` (the full LLM response) and `ProductDataPlugin.cs:31` logs the entire product JSON. If responses contain user data or product pricing, this constitutes logging of potentially sensitive business data.

---

## 6. Metric Coverage Assessment

No custom metrics (`Meter`, `Counter`, `Histogram`, `ObservableGauge`) are defined anywhere in the ChatAPI. The application has zero business-level metrics.

| Metric | Status | Impact |
|---|---|---|
| Chat requests per second | ❌ Missing | Cannot detect traffic spikes or rate-limit thresholds |
| Chat errors (count by error type) | ❌ Missing | No alerting on LLM or search failures |
| LLM call latency histogram | ❌ Missing | Cannot detect degraded AI response times |
| RAG search latency histogram | ❌ Missing | Cannot detect slow Azure AI Search queries |
| Embedding generation latency | ❌ Missing | Cannot detect embedding model latency |
| Token usage counter (prompt/completion) | ❌ Missing | Cannot monitor AI cost |
| Session creation rate | ❌ Missing | No session volume visibility |
| CosmosDB write errors | ❌ Missing | Persistence failures are silent |

Azure Monitor's `UseAzureMonitor()` will capture HTTP request duration and dependency call counts automatically, but these are infrastructure-level metrics, not business metrics.

---

## 7. Semantic Kernel Telemetry Assessment

Semantic Kernel 1.31.0 is used. SK provides built-in OpenTelemetry support via the `Microsoft.SemanticKernel.Diagnostics` namespace, but this project does not configure it.

| SK Telemetry Concern | Status | Notes |
|---|---|---|
| SK diagnostic source enabled | ❌ Not configured | `SEMANTICKERNEL_EXPERIMENTAL_GENAI_ENABLE_OTEL_DIAGNOSTICS` env var not set; no explicit OTel source registration for SK |
| Prompt token count tracking | ❌ Missing | `response.Metadata` / `OpenAIChatMessageContent` contains `CompletionUsage` — never read |
| Completion token count tracking | ❌ Missing | Same as above |
| Total token cost tracking | ❌ Missing | No derived cost metric |
| Model name / deployment recorded | ❌ Missing | `AZURE_OPENAI_DEPLOYMENT` is configured but never tagged on events |
| Function invocation telemetry | ❌ Missing | `FunctionChoiceBehavior.Auto()` is used but SK function invocations emit no custom span |
| Function invocation failure event | ⚠️ Partial | `AISearchDataPlugin` catches exceptions but uses `LogInformation` instead of `LogError` and re-throws without telemetry |
| FunctionChoiceBehavior auto-invocation logging | ❌ Missing | Which functions were auto-invoked by the planner is not recorded |

SK's built-in telemetry hooks would emit spans for chat completion and function calls if the diagnostic source were registered with the OpenTelemetry pipeline. Currently, none of these spans appear in Application Insights.

---

## 8. Findings

| Severity | ID | Title | File |
|---|---|---|---|
| 🔴 Critical | ATEL-COVER-001 | No telemetry on the chat completion flow | `Services/ChatService.cs` |
| 🔴 Critical | ATEL-COVER-002 | LLM response content logged in full — potential PII exposure | `Services/ChatService.cs`, `Controllers/ChatController.cs` |
| 🔴 Critical | ATEL-SKTEL-001 | Semantic Kernel OTel diagnostics not enabled — no AI operation spans | `Program.cs` |
| 🟡 Notable | ATEL-COVER-003 | No telemetry on session creation | `Controllers/SessionController.cs` |
| 🟡 Notable | ATEL-COVER-004 | No telemetry on RAG search execution | `Data/AISearchData.cs`, `Plugins/AISearchDataPlugin.cs` |
| 🟡 Notable | ATEL-SKTEL-002 | Token usage not tracked — no cost visibility | `Services/ChatService.cs` |
| 🟡 Notable | ATEL-EVENT-001 | Error paths use `LogInformation` instead of `LogError` | `Plugins/AISearchDataPlugin.cs` |
| 🟡 Notable | ATEL-COVER-005 | No error telemetry on chat completion failure path | `Services/ChatService.cs` |
| 🟢 Minor | ATEL-EVENT-002 | Session ID never tagged on logs or spans | `Services/ChatService.cs` |
| 🟢 Minor | ATEL-METRIC-001 | No custom business metrics defined | Entire project |
| 🟢 Minor | ATEL-EVENT-003 | Log messages use vague, non-semantic names | Multiple files |
| ℹ️ Info | ATEL-INFRA-001 | Azure Monitor / OpenTelemetry infrastructure is correctly wired | `Program.cs` |
| ℹ️ Info | ATEL-SKTEL-003 | SK version 1.31.0 supports OTel diagnostics — upgrade path available | `ChatAPI.csproj` |

---

### Finding Details

#### ATEL-COVER-001 — 🔴 Critical: No telemetry on the chat completion flow

`ChatService.GetResponseAsync` is the core business operation. It has no try/catch, no span creation, no timing, and no structured event for either success or failure. If the LLM call hangs, fails, or returns an error, there is no business-level record of the failure. The chat flow is effectively a black box beyond infrastructure HTTP traces.

**Recommendation:** Wrap the LLM call in a try/catch. Emit a structured event on both success (with latency, token counts, session ID) and failure (with error type, message, session ID). Add an `ActivitySource` span scoped to the full `GetResponseAsync` method.

---

#### ATEL-COVER-002 — 🔴 Critical: LLM response content logged in full — potential PII exposure

`ChatService.cs:55` logs `"Response {response}"` containing the full LLM assistant reply. `ChatController.cs:26` logs `"Result: {Result}"` with the same serialized JSON. If a user asks about personal information, financial data, or healthcare topics, that content is written verbatim to Application Insights logs. This is a PII risk in production and violates most data residency and privacy policies.

**Recommendation:** Remove or truncate log statements that emit full response content. Log only metadata: response length in characters, session ID, request ID, and whether the response contained a function call. Never log the free-text response body.

---

#### ATEL-SKTEL-001 — 🔴 Critical: Semantic Kernel OTel diagnostics not enabled

SK 1.31.0 ships with built-in OpenTelemetry support for AI operations, but it requires explicit activation. The project does not set `SEMANTICKERNEL_EXPERIMENTAL_GENAI_ENABLE_OTEL_DIAGNOSTICS=true` and does not register the SK `ActivitySource` with the OTel pipeline. As a result, no AI spans (chat completion, function call, embedding) appear in Application Insights, even though the OTel pipeline is otherwise active. This means the most valuable telemetry — LLM calls with model, token counts, and latency — is entirely missing.

**Recommendation:** In `Program.cs`, configure the SK diagnostic source:

```csharp
builder.Services.AddOpenTelemetry()
    .UseAzureMonitor(...)
    .WithTracing(tracing => tracing
        .AddSource("Microsoft.SemanticKernel*"));
```

Also set the environment variable or add it to `appsettings.json`:

```json
"SEMANTICKERNEL_EXPERIMENTAL_GENAI_ENABLE_OTEL_DIAGNOSTICS": "true"
```

---

#### ATEL-COVER-003 — 🟡 Notable: No telemetry on session creation

`SessionController.GetSession` generates a session ID and returns it. No event is emitted. There is no way to know how many sessions are created per day, detect anomalous session creation rates, or correlate a session ID with downstream chat activity.

**Recommendation:** Emit a `session.created` event with the session ID (or a hash of it) and a timestamp.

---

#### ATEL-COVER-004 — 🟡 Notable: No telemetry on RAG search execution

`AISearchData.RetrieveDocumentationAsync` executes a hybrid semantic + vector search but emits nothing. The number of results returned, the search latency, and any failure conditions are not recorded. `AISearchDataPlugin.ResourceLookup` catches exceptions but logs at `LogInformation` severity and re-throws without any structured event.

**Recommendation:** Emit a `search.query.executed` event after each search, including: query (hashed or truncated), result count, search latency, and model/index used.

---

#### ATEL-SKTEL-002 — 🟡 Notable: Token usage not tracked

`IChatCompletionService.GetChatMessageContentAsync` returns a `ChatMessageContent` whose `InnerContent` (when cast to `OpenAI.Chat.ChatCompletion`) exposes `Usage.InputTokenCount` and `Usage.OutputTokenCount`. This data is never read. Without token tracking, there is no visibility into AI API cost, no ability to alert on runaway token consumption, and no baseline for cost optimization.

**Recommendation:** After the LLM call in `ChatService.GetResponseAsync`, extract and log token usage:

```csharp
if (response.InnerContent is OpenAI.Chat.ChatCompletion completion)
{
    _logger.LogInformation("Token usage — prompt: {PromptTokens}, completion: {CompletionTokens}, session: {SessionId}",
        completion.Usage.InputTokenCount,
        completion.Usage.OutputTokenCount,
        sessionId);
}
```

For full cost tracking, add these as custom metrics using `System.Diagnostics.Metrics.Counter<long>`.

---

#### ATEL-EVENT-001 — 🟡 Notable: Error paths use `LogInformation` instead of `LogError`

`AISearchDataPlugin.ResourceLookup` catches exceptions and calls `_logger.LogInformation(ex, ...)` before re-throwing. Using `LogInformation` for exception paths means these errors are not surfaced in Application Insights as errors or failures — they appear as normal informational events. This suppresses alerts and error-rate dashboards.

**Recommendation:** Change all exception-path log calls from `LogInformation` to `LogError`. Apply the same review to `ProductDataPlugin` — it correctly uses `LogInformation` for the exception path on line 38, which should also be `LogError`.

---

#### ATEL-COVER-005 — 🟡 Notable: No error telemetry on chat completion failure path

`ChatService.GetResponseAsync` has no try/catch block. If `chatCompletion.GetChatMessageContentAsync` throws (network failure, rate limit, content filter rejection), the exception propagates unhandled to the ASP.NET middleware. Azure Monitor will capture the HTTP 500 response, but no business-level event is emitted recording what was being processed, which session was affected, or what the error type was.

**Recommendation:** Add a try/catch in `GetResponseAsync`. On failure, log a structured error event including session ID, error type, and error message. Do not log the user's question text if it may contain PII.

---

#### ATEL-EVENT-002 — 🟢 Minor: Session ID never tagged on logs or spans

`sessionId` is passed to `ChatService.GetResponseAsync` but is never added to the ambient logging scope or to any OTel span attribute. All log statements in `ChatService` and downstream components are disconnected from the session context, making it impossible to correlate log lines for a single conversation.

**Recommendation:** Add the session ID to a scoped logging context at the start of `GetResponseAsync`:

```csharp
using (_logger.BeginScope(new Dictionary<string, object> { ["SessionId"] = sessionId }))
{
    // all logs within this block will carry SessionId
}
```

---

#### ATEL-METRIC-001 — 🟢 Minor: No custom business metrics defined

No `Meter`, `Counter`, or `Histogram` is defined anywhere. Infrastructure metrics (HTTP requests, dependency durations) are captured by Azure Monitor, but no business metrics exist. The absence of metrics makes it impossible to build dashboards or alerts for chat volume, error rates, search hit rates, or token consumption.

**Recommendation:** Define a `ChatMetrics` class using `System.Diagnostics.Metrics.Meter` with counters for chat requests, errors, and histograms for LLM call latency and token usage. Register it as a singleton and inject it into `ChatService`.

---

#### ATEL-EVENT-003 — 🟢 Minor: Log messages use vague, non-semantic names

Existing log messages (`"Chat History Count {count}"`, `"Init Chat History"`, `"Response {response}"`) are diagnostic in nature and do not follow a semantic naming convention. They cannot be used as structured business events without significant reformulation. No message follows the recommended verb-noun past-tense pattern.

**Recommendation:** Establish a naming convention for log messages that will also serve as event names. At minimum: `"chat.history.initialized"`, `"chat.response.completed"`, `"product.lookup.completed"` with consistent structured properties.

---

#### ATEL-INFRA-001 — ℹ️ Info: Azure Monitor / OpenTelemetry infrastructure correctly wired

`Program.cs` registers `Azure.Monitor.OpenTelemetry.AspNetCore` via `UseAzureMonitor()` with `DefaultAzureCredential`. This provides automatic collection of HTTP request traces, Azure SDK dependency spans (CosmosDB, AI Search), and exception telemetry for unhandled exceptions. The foundation is sound; the gap is entirely at the business event layer.

---

#### ATEL-SKTEL-003 — ℹ️ Info: SK 1.31.0 supports OTel diagnostics via experimental feature flag

Semantic Kernel 1.31.0 includes built-in OpenTelemetry support under the `Microsoft.SemanticKernel.Diagnostics` activity source. Enabling it requires setting `SEMANTICKERNEL_EXPERIMENTAL_GENAI_ENABLE_OTEL_DIAGNOSTICS=true` and registering the source with the OTel pipeline. This would automatically emit spans for chat completion, function calls, and prompt rendering — providing significant value at low implementation cost.

---

## 9. Recommended Telemetry Additions

| Finding | Recommended Action | Priority |
|---|---|---|
| ATEL-SKTEL-001 | Register `Microsoft.SemanticKernel*` activity source in OTel pipeline; set SK diagnostic env var | 🔴 Critical |
| ATEL-COVER-001 | Add try/catch + `ActivitySource` span to `ChatService.GetResponseAsync` with success/failure events | 🔴 Critical |
| ATEL-COVER-002 | Remove full LLM response content from logs; log only metadata (length, session, function call flag) | 🔴 Critical |
| ATEL-SKTEL-002 | Extract and log token usage from `ChatMessageContent.InnerContent` after each completion | 🟡 Notable |
| ATEL-COVER-005 | Add try/catch to `ChatService.GetResponseAsync`; emit structured error event on failure | 🟡 Notable |
| ATEL-EVENT-001 | Change exception-path `LogInformation` calls to `LogError` in both plugin classes | 🟡 Notable |
| ATEL-COVER-003 | Emit `session.created` event (with hashed session ID) in `SessionController` | 🟡 Notable |
| ATEL-COVER-004 | Emit `search.query.executed` event in `AISearchDataPlugin.ResourceLookup` with result count and latency | 🟡 Notable |
| ATEL-EVENT-002 | Use `_logger.BeginScope` to attach `SessionId` to all log statements in `ChatService` | 🟢 Minor |
| ATEL-METRIC-001 | Define `ChatMetrics` class with `Meter`-based counters and histograms for core operations | 🟢 Minor |
| ATEL-EVENT-003 | Standardize log message format to semantic verb-noun past-tense pattern | 🟢 Minor |

---

## Footer

> This review is based on static source code analysis only. It reflects the state of the codebase as of the run date (2026-03-21) and does not incorporate runtime behavior, Azure Monitor configuration, or Application Insights dashboards that may exist outside the repository. Findings represent best-practice gaps, not confirmed runtime failures.
>
> Steward: API Telemetry Steward (`api-telemetry-steward.md`) | PREFIX: `ATEL`
