[← Back to Reviews Index](../Stewards%20Reviews%20Index.md)

# CosmosDB Review — Azure-AI-RAG-CSharp-Semantic-Kernel-Functions

| Field | Value |
|---|---|
| **Project** | Azure-AI-RAG-CSharp-Semantic-Kernel-Functions |
| **Review Date** | 2026-03-21 |
| **Steward** | CosmosDB Steward |
| **SDK Version** | Microsoft.Azure.Cosmos 3.39.1 |
| **Critical** | 2 |
| **Notable** | 4 |
| **Minor** | 3 |
| **Info** | 2 |
| **Total** | 11 |

---

## 1. Cosmos DB Usage Summary

The project uses Azure Cosmos DB in two contexts:

**C# ChatAPI (`src/ChatAPI`)**

| Container | Purpose | Partition Key | Primary Access Pattern |
|---|---|---|---|
| `CosmosDb_ProductContainer` | Azure product catalog | `/id` (set at creation) | Point read by `id`, query by `name` |
| `CosmosDb_ChatContainer` | Chat message history | `/id` (inferred from `new PartitionKey(chatMessage.id)`) | Query by `sessionid` |

**Python DocumentLoaderFunction (`src/DocumentLoaderFunction`)**

No Cosmos DB usage detected. The Python Azure Function exclusively uses Azure AI Search and Blob Storage. It is outside the scope of this CosmosDB audit.

**Files audited:**

- `src/ChatAPI/Program.cs` — DI registration, client construction
- `src/ChatAPI/Data/ChatHistoryData.cs` — chat history container access
- `src/ChatAPI/Data/ProductData.cs` — product container access
- `src/ChatAPI/Data/sample_data/products/GenerateProductInfo.cs` — startup data seeding

---

## 2. Client Initialization Assessment

**Positive:** `CosmosClient` is registered as a `Singleton` in `Program.cs` (line 23), which is the correct pattern. It is reused across all data classes via constructor injection.

**Issue:** The client is constructed with only a connection string and no `CosmosClientOptions`:

```csharp
builder.Services.AddSingleton(serviceProvider =>
    new CosmosClient(builder.Configuration["CosmosDb_ConnectionString"]));
```

This means:
- No explicit `ConnectionMode` is set (defaults to `Gateway`; `Direct` mode is preferred for throughput in production).
- No `CosmosSerializationOptions` — property naming is left at the SDK default (PascalCase), which can cause mismatches with camelCase JSON stored in Cosmos.
- No `MaxRetryAttemptsOnRateLimitedRequests` or `MaxRetryWaitTimeOnRateLimitedRequests` configured — the SDK will use its own defaults (9 retries, 30 seconds), which may be too aggressive for a chat workload.
- No `RequestTimeout` override.

---

## 3. Partition Key Design Assessment

### Products container

The partition key is `/id` (set in `GenerateProductInfo.cs` line 37). Product reads by `id` (`ReadItemAsync`) correctly supply the partition key, making them efficient single-partition point reads.

**Critical issue:** `GetProductByNameAsync` queries by `c.name` without including the partition key in the query or in `QueryRequestOptions`. Because `/id` is the partition key and the `WHERE` clause filters on `c.name`, this is a **full cross-partition fan-out** on every call:

```csharp
var query = new QueryDefinition("SELECT * FROM c WHERE c.name = @productName")
    .WithParameter("@productName", productName);
var iterator = container.GetItemQueryIterator<Dictionary<string, object>>(query);
```

No `QueryRequestOptions` with `PartitionKey` is passed, so the SDK issues a cross-partition query.

### Chat history container

The partition key used at write time is `chatMessage.id` (a new `Guid` per message):

```csharp
await container.CreateItemAsync(chatMessage, new PartitionKey(chatMessage.id));
```

However, retrieval queries by `sessionid`:

```csharp
new QueryDefinition("SELECT * FROM c WHERE c.sessionid = @sessionId")
```

This means every `GetMessagesBySessionIdAsync` call is a **cross-partition query** — the partition key is `id` but the query filters on `sessionid`. As the chat history grows, this becomes a full fan-out across all logical partitions. The natural partition key for this access pattern is `/sessionid`, not `/id`.

**Note:** The chat history container is provisioned implicitly (not created in code by this project — it must be pre-created or rely on another path). This also means the partition key for the chat container cannot be verified from this codebase directly. However, the write pattern `new PartitionKey(chatMessage.id)` strongly implies the container is keyed on `/id`, cementing the mismatch.

---

## 4. Query Pattern Assessment

### `SELECT *` usage

Both `GetProductByNameAsync` and `GetMessagesBySessionIdAsync` use `SELECT *`:

```sql
SELECT * FROM c WHERE c.name = @productName
SELECT * FROM c WHERE c.sessionid = @sessionId
```

For the products container this pulls back the full document for every result, including potentially large `description` fields, when in practice the caller only needs to return a single matched product. For the chat history container it loads all fields for every message in the session.

### Missing `QueryRequestOptions` / `MaxItemCount`

Neither query sets `QueryRequestOptions.MaxItemCount`. Without this, the SDK decides how many items to retrieve per page, which may cause large result sets to be pulled into memory in a single round trip.

### Startup seeding: sequential individual creates

`GenerateProductInfo.PopulateCosmosAsync()` seeds the product catalog in a `foreach` loop with individual `CreateItemStreamAsync` calls (line 73). This is sequential and does not use `TransactionalBatch` or `BulkExecutionOptions`. For the small sample data set this is acceptable, but the pattern would not scale.

### Null return from `GetProductByNameAsync`

The method returns `null` when no product is found rather than throwing or returning an `Optional`/`Result` type. The caller (`ProductDataPlugin`) must null-check this return value; if it does not, a `NullReferenceException` will propagate. This is a correctness concern adjacent to the query design.

---

## 5. Error Handling Assessment

### Products container

`GetProductAsync` catches `CosmosException` for `NotFound` correctly:

```csharp
catch (CosmosException ex) when (ex.StatusCode == System.Net.HttpStatusCode.NotFound)
```

However, it re-throws immediately after logging. Any other `CosmosException` (including `429 TooManyRequests`) is not caught — it will propagate as an unhandled exception to the caller with no retry logic at the application layer.

`GetProductByNameAsync` has the same structure: only `NotFound` is caught; 429 and other status codes propagate. The catch block is also semantically wrong — it catches `NotFound` but the condition can never be true for a cross-partition query that returns zero results (the SDK returns an empty result set, not a 404).

### Chat history container

`AddUserMessageAsync` and `AddAssistantMessageAsync` have **no error handling at all**. Any `CosmosException` — including 429 throttling — will propagate unhandled to the request pipeline. A transient 429 will cause the HTTP request to fail rather than being retried.

`GetMessagesBySessionIdAsync` also has no error handling. A failure during session initialization silently corrupts the chat context.

### Startup seeding

`GenerateProductInfo.PopulateCosmosAsync()` wraps everything in a broad `catch (Exception ex)` with only a log. This suppresses seeding failures silently, and the application continues to serve requests against an empty or partially populated database.

### No 429 / retry policy at the application layer

No code path implements an explicit `RetryAfter` delay for 429 responses. The application relies entirely on the SDK's built-in retry policy (which is not configured — see §2). If the SDK exhausts its retries, the exception propagates with no application-level fallback.

---

## 6. Consistency Level Assessment

No explicit consistency level is set anywhere in the codebase. The `CosmosClient` is constructed without `CosmosClientOptions`, so the account-level default applies. For a chat history + product catalog workload, `Session` consistency is appropriate and is the SDK default. There is no evidence of overriding to `Strong` (which would be costly) or `Eventual` (which would be unsafe for chat ordering).

This is acceptable for the use case, but it is worth documenting the intentional reliance on the account default rather than making it implicit.

---

## 7. SDK Configuration Assessment

| Configuration Item | Status |
|---|---|
| `CosmosClientOptions` provided | No — not configured |
| `ConnectionMode` | Not set (defaults to Gateway) |
| `CosmosSerializationOptions` (camelCase) | Not set — SDK defaults to PascalCase |
| `MaxRetryAttemptsOnRateLimitedRequests` | Not set (SDK default: 9) |
| `MaxRetryWaitTimeOnRateLimitedRequests` | Not set (SDK default: 30 s) |
| `RequestTimeout` | Not set |
| Consistency level | Not set (account default) |
| `EnableContentResponseOnWrite` | Not set (returns full document on write — unnecessary RU charge) |

Notably, the `ChatMessage` model uses lowercase property names (`id`, `sessionid`, `role`) while the `Product` model class (`Product.cs`) uses PascalCase (`Id`, `Name`). The `GenerateProductInfo` class uses a source-generated serializer (`ProductJsonContext`), sidestepping the Cosmos SDK serializer for writes. For reads via `ReadItemAsync<Dictionary<string, object>>`, the SDK's default serializer is used. The inconsistency in serialization paths is a maintainability risk.

---

## 8. Findings

| Severity | ID | Title | File |
|---|---|---|---|
| 🔴 | CSDB-PARTKEY-001 | Chat history partitioned by `/id` but queried by `sessionid` — full cross-partition scan on every session load | `src/ChatAPI/Data/ChatHistoryData.cs` |
| 🔴 | CSDB-QUERY-001 | `GetProductByNameAsync` issues cross-partition fan-out query with no partition key filter | `src/ChatAPI/Data/ProductData.cs` |
| 🟡 | CSDB-ERROR-001 | Chat history write methods have no error handling — 429 and other Cosmos errors propagate unhandled | `src/ChatAPI/Data/ChatHistoryData.cs` |
| 🟡 | CSDB-ERROR-002 | No 429 retry logic at application layer; SDK retry policy is not configured | `src/ChatAPI/Program.cs` |
| 🟡 | CSDB-QUERY-002 | `SELECT *` used in both containers — fetches full documents when only specific fields are needed | `src/ChatAPI/Data/ChatHistoryData.cs`, `src/ChatAPI/Data/ProductData.cs` |
| 🟡 | CSDB-ERROR-003 | `GetProductByNameAsync` returns `null` on no-match; callers must null-check or risk `NullReferenceException` | `src/ChatAPI/Data/ProductData.cs` |
| 🟢 | CSDB-CONFIG-001 | No `CosmosClientOptions` configured — connection mode, retry policy, and serialization options are all at defaults | `src/ChatAPI/Program.cs` |
| 🟢 | CSDB-CONFIG-002 | `EnableContentResponseOnWrite` not set to `false` — write operations return full document payloads, consuming unnecessary RUs | `src/ChatAPI/Program.cs` |
| 🟢 | CSDB-TTL-001 | Chat history container has no TTL configured — old session messages accumulate indefinitely with no automatic expiry | `src/ChatAPI/Data/ChatHistoryData.cs` |
| ℹ️ | CSDB-CLIENT-001 | `CosmosClient` correctly registered as `Singleton` — connection pool is properly shared | `src/ChatAPI/Program.cs` |
| ℹ️ | CSDB-SEED-001 | Startup seeding catches all exceptions and logs them, preventing application startup failures — but silently masks seeding errors that leave the container empty | `src/ChatAPI/Data/sample_data/products/GenerateProductInfo.cs` |

---

## 9. Recommended Improvements

| Finding | Recommended Action | Priority |
|---|---|---|
| CSDB-PARTKEY-001 | Change the chat history container partition key from `/id` to `/sessionid`. Update write calls to use `new PartitionKey(chatMessage.sessionid)` and supply `QueryRequestOptions` with `PartitionKey` on reads. | High |
| CSDB-QUERY-001 | If `/id` must remain the partition key for products, maintain a secondary index or cache for name-to-id lookup. Alternatively, change the partition key to `/name` if name-based access is the primary pattern. Pass `QueryRequestOptions` with `EnableCrossPartitionQuery = false` to surface the problem clearly in development. | High |
| CSDB-ERROR-001 | Wrap `CreateItemAsync` calls in `AddUserMessageAsync` and `AddAssistantMessageAsync` with `try/catch (CosmosException)`. At minimum log the failure and consider whether a failed write should fail the request or be tolerated as a best-effort history entry. | Medium |
| CSDB-ERROR-002 | Configure `CosmosClientOptions` with explicit `MaxRetryAttemptsOnRateLimitedRequests` (e.g., 3) and `MaxRetryWaitTimeOnRateLimitedRequests` (e.g., 10 s). Consider a Polly retry policy at the service layer for 429 responses on critical paths. | Medium |
| CSDB-QUERY-002 | Replace `SELECT *` with projected fields: for chat history use `SELECT c.id, c.sessionid, c.message, c.role, c.Timestamp`; for product name lookup use `SELECT c.id, c.name, c.description`. | Medium |
| CSDB-ERROR-003 | Change `GetProductByNameAsync` to return `Dictionary<string, object>?` and annotate with `#nullable enable`. Update callers to handle the null case explicitly. Alternatively return a `Result<T>` wrapper. | Medium |
| CSDB-CONFIG-001 | Construct `CosmosClient` with `CosmosClientOptions` specifying `ConnectionMode.Direct`, `CosmosSerializationOptions` with `PropertyNamingPolicy = CosmosPropertyNamingPolicy.CamelCase`, and explicit retry settings. | Low |
| CSDB-CONFIG-002 | Set `CosmosClientOptions.EnableContentResponseOnWrite = false` to avoid returning full document bodies on upserts and creates, reducing RU consumption and payload size. | Low |
| CSDB-TTL-001 | Enable TTL on the chat history container (e.g., 30 days). Set `DefaultTimeToLive` on the container and optionally set `ttl` per document for finer control. | Low |
| CSDB-CLIENT-001 | No action required. | — |
| CSDB-SEED-001 | Consider propagating seeding failures (or at minimum re-throwing after logging) so that a misconfigured deployment fails fast rather than serving requests against an empty product catalog. | Low |

---

## Cross-Cutting Observations

The `ChatHistory` object (Semantic Kernel `ChatHistory`) is registered as a **Singleton** in `Program.cs` (line 30), meaning all concurrent requests share a single in-memory chat history instance. Combined with the Cosmos-persisted history, this creates a race condition where concurrent sessions will corrupt each other's in-memory history. This is a significant architectural issue, but it is a general DI/service-lifetime concern rather than a Cosmos DB concern specifically — it should be flagged by the Interface or DI steward.

---

## Footer

This review is based on static analysis of source code and configuration files at the time of the audit run (2026-03-21). It does not reflect runtime behavior, actual RU consumption, or Cosmos DB account configuration (throughput mode, indexing policy, replication regions). Findings should be validated against the live environment before remediation.

**Steward:** CosmosDB Steward (`cosmosdb-steward`) | **Prefix:** CSDB
