Can a .NET endpoint handle a million requests per second?

There is a trap in this question.
When someone asks whether a .NET endpoint can handle a million requests per second, the instinct is to jump straight into Minimal APIs, Kestrel tuning, JSON serialisation, async code and benchmarks. Those things are useful, but theyre not the real answer.
A million requests per second is rarely an endpoint problem. Its a system design problem. The endpoint is only the front door. Behind it you have load balancers, TLS termination, network limits, CPU, memory allocation, the list goes on.
So the real question is what kind of endpoint are we talking about, and what work does each request force the system to do? Thats where the answer changes completely.
A million requests per second is not one thing
There are three very different versions of this target. A benchmark endpoint is the simplest case. It receives a request and returns a tiny response. It does not authenticate the caller, touch a database, call another service, or run business rules. It is useful for proving the raw HTTP stack can move traffic, but it tells you very little about the production system.
A cached read endpoint is more realistic. It might return a feature flag, a pricing value, a public product summary, a lookup list, or a configuration document. If the response is served from an edge cache, memory cache or Redis, the API can stay fast because most requests avoid the database.
A write endpoint is different. If every request creates an order, starts a payment, uploads a claim, writes an audit trail, updates relational tables and publishes integration events, you are no longer benchmarking ASP.NET Core. You are benchmarking the slowest shared dependency in the system. Most of the time that will be the database, the message broker, the network, or an external service.
This distinction is important because a million requests per second means this:
1,000,000 requests per second
60,000,000 requests per minute
3,600,000,000 requests per hour
86,400,000,000 requests per day
If each request writes one row, you are designing for 86.4 billion rows per day. That is not a controller problem.
Start with the capacity model
Before writing code, define the unit of work. For a simple read endpoint, the question is how many requests each API instance can serve when the response is already available in memory or a nearby cache. For a write endpoint, the question is how much durable ingestion capacity the system has, how quickly workers can process the backlog, how the data is partitioned, and how the system behaves when downstream services slow down.
A reasonable first model:
Target throughput: 1,000,000 RPS
Expected API instance throughput: 10,000 RPS
Required API instances: 100
Headroom target: 40 percent
Operational target: 140 API instances
Thats a simple model, but it is already more realistic than imagining one huge server doing all the work. The real capacity model needs to include latency targets too. One million RPS with terrible latency is not success. For a public API, you care about p50, p95, p99 and error rate. The average does not tell you enough. At high scale, the tail becomes the product.
The architecture for a million RPS endpoint
The architecture depends on whether the endpoint is read-heavy or write-heavy, but the shape usually looks like this.
The important part is that the HTTP endpoint does not do unlimited work. It does the minimum safe work and then hands off the rest. For reads, it should avoid the database as much as possible. For writes, it should validate, accept, deduplicate, enqueue and return. The expensive processing happens behind the API where it can be batched, retried and scaled independently.
The endpoint should be thin
The hot path should be brutally simple. It should not contain complex middleware. It should not perform chatty database access. It should not synchronously call external services. It should not create huge objects. It should not log full payloads for every request. It should not use reflection-heavy mapping on every call. It should not do anything that scales linearly into a disaster.
A fast endpoint is usually simple.
var builder = WebApplication.CreateSlimBuilder(args);
builder.WebHost.ConfigureKestrel(options =>
{
options.AddServerHeader = false;
});
builder.Services.ConfigureHttpJsonOptions(options =>
{
options.SerializerOptions.TypeInfoResolverChain.Insert(
0,
ApiJsonSerializerContext.Default);
});
builder.Services.AddSingleton<IPriceCache, PriceCache>();
var app = builder.Build();
app.MapGet("/prices/{productId:int}", async (
int productId,
IPriceCache cache,
CancellationToken stopToken) =>
{
var price = await cache.GetAsync(productId, stopToken);
return price is null
? Results.NotFound()
: Results.Ok(price);
});
app.Run();
public sealed record PriceResponse(
int ProductId,
decimal Amount,
string Currency,
DateTimeOffset LastUpdatedAt);
[JsonSerializable(typeof(PriceResponse))]
internal sealed partial class ApiJsonSerializerContext : JsonSerializerContext
{
}
This example is intentionally small. It uses Minimal APIs, CreateSlimBuilder, async I/O, explicit cancellation and source-generated JSON metadata. It doesnt mean every API should look exactly like this. It means the hot path should avoid unnecessary framework and application overhead.
Minimal APIs are a good fit for the hot path
Controllers are fine for many applications. They give you structure, filters, conventions, model binding patterns and a familiar MVC programming model. For a very high-throughput endpoint, Minimal APIs are usually the better starting point. You get a direct route handler, fewer moving pieces, less ceremony and a clearer execution path. That does not magically give you a million RPS, but it removes overhead you do not need. The real benefit is architectural discipline. Minimal APIs make it easier to see what the endpoint actually does. If the handler starts growing into validation, mapping, authorisation checks, database reads, database writes, external calls and logging, you can see the problem quickly. A hot endpoint should look small because the expensive work should live somewhere else.
Kestrel is not usually the first bottleneck
Kestrel is fast. ASP.NET Core is fast. The framework is not normally the weakest part of a real production endpoint. The bottleneck is usually one of these - database access, external service calls, excessive logging, payload size, TLS cost, network bandwidth, memory allocation, lock contention, connection pool starvation, slow clients, queue throughput, partition design, or noisy neighbours in the infrastructure.
That doesnt mean Kestrel settings are irrelevant. It means Kestrel tuning should happen after you understand the workload. For example, theres no point raising connection limits if the database connection pool is already exhausted. There is no point squeezing another 10 percent out of JSON serialisation if every request writes to one hot SQL table. Theres no point scaling to 200 pods if Redis has become the shared choke point.
Read endpoints need cache-first design
A read endpoint that needs one million RPS should not treat the database as the primary read path. It should treat the database as the source of truth, then serve traffic from faster layers.
The best request is the one your API never sees, because the edge cache serves it before it reaches your infrastructure. The next best request is served directly from memory, followed by one served from Redis. The worst request is the one that reaches the primary database during peak traffic. That is not because databases are bad. It is because the database is usually the most expensive shared dependency in the request path, and once every request starts competing for the same database resources, your API performance is no longer really controlled by the API.
ASP.NET Core gives you several caching options, including in-memory caching, distributed caching, HybridCache, response caching and output caching. For a cloud or server farm deployment, distributed cache becomes important because any API instance can receive the request. Redis is a common choice because it gives lower latency and higher throughput than using SQL Server as a cache in most applications.
A simple cache-backed abstraction keeps the endpoint clean.
public interface IPriceCache
{
Task<PriceResponse?> GetAsync(
int productId,
CancellationToken stopToken);
}
public sealed class PriceCache : IPriceCache
{
private readonly HybridCache _cache;
private readonly IPriceStore _store;
public PriceCache(
HybridCache cache,
IPriceStore store)
{
_cache = cache;
_store = store;
}
public Task<PriceResponse?> GetAsync(
int productId,
CancellationToken stopToken)
{
var cacheKey = $"price:{productId}";
return _cache.GetOrCreateAsync(
cacheKey,
async token => await _store.GetAsync(productId, token),
cancellationToken: stopToken);
}
}
The endpoint should not care whether the response came from memory, Redis or the database. It should care that the cache abstraction has clear expiry, invalidation and failure behaviour.
Output caching can protect simple HTTP responses
For endpoints where the full HTTP response can be cached, output caching is worth considering.
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddOutputCache(options =>
{
options.AddPolicy("public-config", policy =>
{
policy.Expire(TimeSpan.FromSeconds(30));
policy.SetVaryByRouteValue("tenantId");
});
});
var app = builder.Build();
app.UseOutputCache();
app.MapGet("/config/{tenantId}", async (
string tenantId,
IConfigReader reader,
CancellationToken stopToken) =>
{
var config = await reader.GetAsync(tenantId, stopToken);
return config is null
? Results.NotFound()
: Results.Ok(config);
})
.CacheOutput("public-config");
app.Run();
This is useful for stable responses where a short amount of staleness is acceptable. Its not a magic switch for every endpoint. You need to understand cache keys, variation, authorisation, tenant boundaries and invalidation. Caching the wrong thing at this scale is not a performance problem. It is a production incident.
Write endpoints need an ingestion design
A write-heavy million RPS endpoint should usually not attempt to fully process every request synchronously. A better model is to accept the request, perform cheap validation, enforce idempotency, publish to a durable stream and return a 202 Accepted response.
This gives you three useful properties. The API stays fast because it is not trying to do all the work during the request. The queue or stream absorbs spikes, so every downstream dependency does not have to keep up instantly. The workers can then process messages in batches, which is usually far more efficient than running one database transaction for every HTTP request.
A very simple endpoint:
app.MapPost("/events", async (
EventRequest request,
IIdempotencyStore idempotencyStore,
IEventPublisher publisher,
CancellationToken stopToken) =>
{
if (string.IsNullOrWhiteSpace(request.EventType))
{
return Results.BadRequest(new ErrorResponse("event_type_required"));
}
if (string.IsNullOrWhiteSpace(request.IdempotencyKey))
{
return Results.BadRequest(new ErrorResponse("idempotency_key_required"));
}
var existing = await idempotencyStore.TryGetAsync(
request.IdempotencyKey,
stopToken);
if (existing is not null)
{
return Results.Accepted($"/events/status/{existing.OperationId}");
}
var operationId = Ulid.NewUlid().ToString();
await publisher.PublishAsync(
new IngestedEvent(
operationId,
request.IdempotencyKey,
request.EventType,
request.Payload,
DateTimeOffset.UtcNow),
stopToken);
await idempotencyStore.StoreAcceptedAsync(
request.IdempotencyKey,
operationId,
stopToken);
return Results.Accepted($"/events/status/{operationId}");
});
public sealed record EventRequest(
string IdempotencyKey,
string EventType,
JsonElement Payload);
public sealed record IngestedEvent(
string OperationId,
string IdempotencyKey,
string EventType,
JsonElement Payload,
DateTimeOffset AcceptedAtUtc);
public sealed record ErrorResponse(string Code);
In a real system, the ordering of idempotency storage and publishing needs careful design. You may use an outbox, transactional store, broker-side deduplication, or an idempotency state machine. The right answer depends on whether duplicate events are acceptable, whether exactly-once effects are required, and what the downstream system can tolerate. At this scale, you should assume duplicate delivery will happen. The design should make duplicate processing harmless.
Use batching behind the API
The worker side is where you regain efficiency.
public sealed class EventIngestionWorker : BackgroundService
{
private readonly IEventConsumer _consumer;
private readonly IEventWriter _writer;
private readonly ILogger<EventIngestionWorker> _logger;
public EventIngestionWorker(
IEventConsumer consumer,
IEventWriter writer,
ILogger<EventIngestionWorker> logger)
{
_consumer = consumer;
_writer = writer;
_logger = logger;
}
protected override async Task ExecuteAsync(CancellationToken stopToken)
{
await foreach (var batch in _consumer.ReadBatchesAsync(
maxBatchSize: 1_000,
maxWaitTime: TimeSpan.FromMilliseconds(100),
stopToken))
{
try
{
await _writer.WriteBatchAsync(batch, stopToken);
_logger.BatchProcessed(batch.Count);
}
catch (Exception ex)
{
_logger.BatchFailed(ex, batch.Count);
throw;
}
}
}
}
internal static partial class WorkerLog
{
[LoggerMessage(
EventId = 1001,
Level = LogLevel.Information,
Message = "Processed ingestion batch with {Count} events.")]
public static partial void BatchProcessed(
this ILogger logger,
int count);
[LoggerMessage(
EventId = 1002,
Level = LogLevel.Error,
Message = "Failed to process ingestion batch with {Count} events.")]
public static partial void BatchFailed(
this ILogger logger,
Exception exception,
int count);
}
The source-generated logging pattern avoids some of the overhead of regular logging extension methods and gives you structured logs without unnecessary allocations. The key design point is batching. One database call for a thousand events is usually far cheaper than a thousand database calls for one event each.
Databases need partitioning, not hope
If your endpoint depends on one relational database table with one hot index, the system will break long before the API layer reaches a million RPS. A high-throughput write system needs partitioning by a key that spreads load. That might be tenant ID, account ID, region, product ID, event type, customer shard, time bucket, or a generated partition key. The right key depends on the access pattern.
Bad partitioning creates hot shards. Hot shards make horizontal scale look better on a diagram than it behaves in production. For example, partitioning only by date might look sensible until every request for the current day hits the same partition. Partitioning only by tenant might work until one large tenant generates most of the traffic. Partitioning by a random key can spread writes, but make reads and reprocessing harder.
The data model has to match the traffic model.
A million RPS design should also separate the write model from the read model when needed. You may ingest events into a durable stream, write to append-only storage, project into read models, and serve queries from denormalised stores. That is more complex than a simple CRUD application, but CRUD is rarely the right model for this volume.
EF Core is not automatically wrong, but know where it fits
EF Core is good for a lot of business applications. It gives you change tracking, LINQ, migrations and a productive unit-of-work model. For a million RPS hot path, EF Core is usually not the first tool I would reach for inside the endpoint itself. That does not mean removing EF Core from the system. It means keeping the hot path lean and moving heavier data work into workers, batch processors or specialised repositories.
For read-heavy endpoints, the ideal path is cache first, so EF Core might only appear during cache misses or background refresh. For write-heavy endpoints, the API may not touch the relational database at all. It may append to a broker and let workers use bulk insert, Dapper, raw ADO.NET, database-specific copy APIs, or EF Core where the throughput is acceptable. The mistake is not using EF Core. The mistake is pretending a high-level ORM can hide a bad throughput model.
Auth and authorisation need a plan
Security is often where benchmark designs fall apart. A real endpoint may need authentication, authorisation, tenant isolation, quotas, fraud checks, WAF rules and audit logging. Each of those has a cost. The solution is to make it scale. JWT validation is usually cheaper than introspecting a token against an identity provider on every request. Tenant entitlements should be cached. Authorisation decisions should avoid remote calls in the hot path. API keys should be hashed and cached safely. Rate limits should exist at multiple levels.
A typical production layout
The app should still reject invalid traffic, but it should not be the first and only place abusive traffic is handled.
Rate limiting protects the system
Rate limiting is a stability feature. In ASP.NET Core, the rate limiting middleware can be used to apply fixed window, sliding window, token bucket or concurrency policies.
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddRateLimiter(options =>
{
options.AddFixedWindowLimiter("tenant-window", limiter =>
{
limiter.PermitLimit = 10_000;
limiter.Window = TimeSpan.FromSeconds(1);
limiter.QueueLimit = 0;
});
options.RejectionStatusCode = StatusCodes.Status429TooManyRequests;
});
var app = builder.Build();
app.UseRateLimiter();
app.MapPost("/events", (
EventRequest request,
CancellationToken stopToken) =>
{
return Results.Accepted();
})
.RequireRateLimiting("tenant-window");
app.Run();
For a real multi-tenant system, you probably need partitioned limits by tenant, API key, client ID, IP range, region or workload type. You also need upstream limits at the WAF, gateway or load balancer layer. Application rate limiting should be the final guardrail, not the only guardrail.
Backpressure is not optional
A million RPS system must have a clear answer for what happens when downstream systems cannot keep up. Without backpressure, the API keeps accepting work until something fails badly. That might be memory, thread pool, queue capacity, connection pools, database locks, disk, broker partitions, or cloud spend. Good systems reject or shed load deliberately. For a write endpoint, this might mean returning 429 when a tenant exceeds quota, returning 503 when the broker is unhealthy, or accepting only priority traffic during an incident.
For an internal worker, it might mean slowing consumption, reducing batch size, pausing low-priority partitions, or switching to a degraded processing mode. A simple in-process channel can demonstrate the idea, although a real distributed system would use a durable broker.
builder.Services.AddSingleton(_ =>
{
return Channel.CreateBounded<IngestedEvent>(
new BoundedChannelOptions(capacity: 100_000)
{
FullMode = BoundedChannelFullMode.Wait,
SingleReader = false,
SingleWriter = false
});
});
app.MapPost("/events/local", async (
EventRequest request,
Channel<IngestedEvent> channel,
CancellationToken stopToken) =>
{
var accepted = await channel.Writer.WaitToWriteAsync(stopToken);
if (!accepted)
{
return Results.StatusCode(StatusCodes.Status503ServiceUnavailable);
}
var item = new IngestedEvent(
Ulid.NewUlid().ToString(),
request.IdempotencyKey,
request.EventType,
request.Payload,
DateTimeOffset.UtcNow);
if (!channel.Writer.TryWrite(item))
{
return Results.StatusCode(StatusCodes.Status429TooManyRequests);
}
return Results.Accepted();
});
This is not a replacement for Kafka, Event Hubs, RabbitMQ or another durable broker. It is a useful pattern inside a process when you need bounded work and explicit pressure. The key word is bounded. Unbounded queues are delayed outages.
Logging can become your bottleneck
Logging every request at high volume is expensive. At one million RPS, even a tiny log line per request becomes a massive ingestion problem. If each request emits 500 bytes of logs, that is roughly 500 MB per second before indexing overhead. That is not observability. That is a bill and probably an incident. The better model is structured, sampled and aggregated telemetry. Log errors. Log state transitions. Log unusual behaviour. Log important business events. Sample high-volume success paths. Use metrics for counts, latency, queue depth, cache hit ratio and error rate. Use distributed tracing carefully, with sampling.
High-performance logging in .NET should use source-generated logging for hot paths.
internal static partial class ApiLog
{
[LoggerMessage(
EventId = 2001,
Level = LogLevel.Warning,
Message = "Rejected event for tenant {TenantId} because the queue is full.")]
public static partial void QueueFull(
this ILogger logger,
string tenantId);
[LoggerMessage(
EventId = 2002,
Level = LogLevel.Warning,
Message = "Rejected duplicate request with idempotency key {IdempotencyKey}.")]
public static partial void DuplicateRequest(
this ILogger logger,
string idempotencyKey);
}
Do not log full request bodies on the hot path. If you need payload capture for debugging, make it sampled, temporary and protected. Also make sure it does not capture secrets or personal data.
Memory allocation decides how far you get
High RPS magnifies small allocation mistakes. Allocating a few extra kilobytes per request sounds harmless until you multiply it by one million. At that point you are generating gigabytes of allocation pressure per second, and the garbage collector becomes part of your latency profile. The first rule is simple. Measure allocations before guessing. Use load tests, dotnet-counters, dotnet-trace, Application Insights, OpenTelemetry metrics, GC counters and allocation profiling. Watch allocation rate, Gen 0 collections, Gen 2 collections, LOH pressure and pause times. Common causes include large JSON payloads, repeated string concatenation, unnecessary mapping, buffering request bodies, creating HttpClient instances incorrectly, excessive LINQ in hot paths, reflection-heavy serialisation, and logging templates that allocate before the log level is checked.
For hot endpoints, prefer small request and response contracts, source-generated JSON, pooled reusable objects only where justified, and streaming where payloads are large. Dont optimise everything. Optimise what profiling proves is hot.
Network bandwidth can become the real limit
A million RPS with a 100 byte response is a different problem from a million RPS with a 50 KB response.
The rough maths.
1,000,000 RPS x 1 KB response = about 1 GB/s before protocol overhead
1,000,000 RPS x 10 KB response = about 10 GB/s before protocol overhead
1,000,000 RPS x 50 KB response = about 50 GB/s before protocol overhead
That has consequences for instance size, network interface limits, load balancer capacity, cross-zone traffic, Redis bandwidth, observability ingestion and cloud cost. Payload design is infrastructure design. Keep responses small. Compress only when it helps. Avoid returning large graphs from hot endpoints. Use pagination, projections, field selection, ETags and cacheable resources.
Infrastructure is part of the endpoint
A production design on AKS or another Kubernetes platform.
The API instances need to scale horizontally. The node pool needs enough capacity to schedule them. The autoscaler needs metrics that represent real pressure, not just CPU. CPU can be low while the system is failing because the bottleneck is queue depth, Redis latency, connection pool exhaustion or downstream throttling.
For Kubernetes, you need sensible CPU and memory requests so the scheduler can place pods correctly. You need limits carefully. Too low and you throttle healthy pods. Too high and one pod can hurt the node. You need pod disruption budgets so deployments and node maintenance do not take out too much capacity at once. You need readiness probes that remove unhealthy pods from traffic. You need liveness probes that restart broken pods. You need startup probes if cold start is slow.
A minimal deployment shape.
apiVersion: apps/v1
kind: Deployment
metadata:
name: hot-api
spec:
replicas: 20
selector:
matchLabels:
app: hot-api
template:
metadata:
labels:
app: hot-api
spec:
containers:
- name: hot-api
image: myregistry.azurecr.io/hot-api:1.0.0
ports:
- containerPort: 8080
resources:
requests:
cpu: "1000m"
memory: "512Mi"
limits:
cpu: "2000m"
memory: "1024Mi"
readinessProbe:
httpGet:
path: /health/ready
port: 8080
periodSeconds: 5
failureThreshold: 2
livenessProbe:
httpGet:
path: /health/live
port: 8080
periodSeconds: 10
failureThreshold: 3
And the autoscaler.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: hot-api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: hot-api
minReplicas: 20
maxReplicas: 200
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
CPU-based autoscaling is only a starting point. For a serious ingestion endpoint, custom metrics such as queue depth, broker publish latency, p99 latency, request rate per pod and rejection rate are often better scaling signals.
Health checks should reflect dependency health
Health endpoints are easy to get wrong. A liveness check should tell the platform whether the process is alive. It should not fail just because Redis or the database is slow. If liveness checks depend on external systems, the orchestrator may restart healthy pods during a dependency outage and make the incident worse.
A readiness check should tell the platform whether the pod should receive traffic. Readiness can include critical dependency checks, warmup state and local queue pressure.
builder.Services
.AddHealthChecks()
.AddCheck("self", () => HealthCheckResult.Healthy())
.AddRedis(redisConnectionString, name: "redis");
var app = builder.Build();
app.MapHealthChecks("/health/live", new HealthCheckOptions
{
Predicate = check => check.Name == "self"
});
app.MapHealthChecks("/health/ready", new HealthCheckOptions
{
Predicate = check => check.Name is "self" or "redis"
});
The exact checks depend on the endpoint. A cached read endpoint may be ready if it has a warm local cache even during a short Redis issue. A write endpoint may not be ready if it cannot publish to the broker. Readiness is not a formality. It controls traffic.
Event streams need partition planning
If the endpoint accepts writes and publishes to Event Hubs, Kafka or another broker, broker capacity becomes part of the design. You need enough partitions to parallelise producers and consumers. You need enough throughput capacity to handle ingress and egress. You need a partition key that spreads load without destroying ordering requirements. You need consumer groups and worker scaling that match partition count. You need replay strategy, retention settings, poison message handling and dead-letter flows.
With Azure Event Hubs, throughput is controlled by concepts such as throughput units, processing units, capacity units and partitions depending on tier. Auto-inflate can help the standard tier scale up throughput units when load increases, but it is not a substitute for capacity modelling. Premium and Dedicated tiers give stronger isolation and higher scale options for demanding workloads.
The API code can look clean while the broker is under-partitioned. That is why broker metrics are as important as API metrics.
External calls do not belong in the hot path
A million RPS endpoint should not synchronously depend on a third-party HTTP service unless there is no alternative. External calls introduce latency, retry storms, rate limits, DNS issues, TLS overhead, regional failure modes and unpredictable tail latency. If you must call another service, use IHttpClientFactory, timeouts, circuit breakers, bulkheads and clear retry policy. But for the hottest paths, prefer local data, cache, precomputed state, async workflows and background reconciliation.
Synchronous fan-out is one of the fastest ways to destroy tail latency.
This shape looks simple, but the request is now only as fast and reliable as the slowest dependency. At high scale, it also multiplies traffic internally. The better shape is often to precompute what the endpoint needs.
The endpoint becomes a read from a purpose-built model instead of a live integration workflow.
Native AOT can help, but it is not the main answer
Native AOT can reduce startup time and memory footprint. That can help in serverless environments, scale-out scenarios, cold starts and dense hosting. ASP.NET Core supports Native AOT for suitable app shapes, with Minimal APIs being the natural fit. However, Native AOT does not fix a database bottleneck, a bad partition key, excessive logging or an endpoint that calls five services per request. Use it where the constraints fit. Be aware of reflection, dynamic code generation, serialisation requirements and library compatibility. Its a deployment and runtime optimisation, not a system architecture strategy.
Do not confuse load testing with benchmarking
A benchmark asks how fast one thing can go under controlled conditions. A load test asks how the system behaves under expected and unexpected traffic. You need both, but they answer different questions. Start with the smallest possible endpoint to understand the ceiling of your API host. Then test the real endpoint with real payloads, auth, caching, logging, rate limiting, queue publishing and dependency behaviour. Then test failure modes.
A local smoke test might use wrk.
wrk -t16 -c1024 -d60s http://localhost:8080/health-fast
A more realistic API test might use k6.
import http from "k6/http";
import { check, sleep } from "k6";
export const options = {
vus: 500,
duration: "5m",
thresholds: {
http_req_failed: ["rate<0.001"],
http_req_duration: ["p(95)<100", "p(99)<250"]
}
};
export default function () {
const payload = JSON.stringify({
idempotencyKey: crypto.randomUUID(),
eventType: "page_view",
payload: {
page: "/products/123"
}
});
const response = http.post("https://api.example.com/events", payload, {
headers: {
"Content-Type": "application/json"
}
});
check(response, {
"accepted": r => r.status === 202
});
sleep(1);
}
Dont stop when the happy path passes. Test Redis latency. Test broker throttling. Test database failover. Test a bad deploy. Test a region outage. Test noisy tenant traffic. Test what happens when logs cannot be exported. Test what happens when the queue is full. The system should fail predictably.
The metrics you actually need
For the API layer, watch request rate, p50, p95, p99, p999 if needed, error rate, saturation, CPU, memory, allocation rate, GC pause time, thread pool queue length, active connections and response size. For the cache layer, watch hit ratio, miss ratio, latency, evictions, memory pressure, command rate, hot keys and network bandwidth. For the broker, watch publish latency, ingress throughput, egress throughput, throttling, partition skew, consumer lag, failed publishes and retry count. For workers, watch batch size, batch duration, processing rate, retry rate, poison messages, dead-letter count and backlog age. For the database, watch write latency, lock waits, deadlocks, CPU, I/O, log flush waits, index pressure, hot partitions, connection count and replication lag. For the platform, watch pod restarts, readiness failures, HPA behaviour, node pressure, cross-zone traffic, load balancer errors and WAF rejects. If you cannot see these numbers, you are not ready to claim the system can handle a million RPS.
Deployment strategy
At high scale, deployments are traffic events. A rolling deployment that replaces too many pods at once can cut capacity. A bad image can trigger mass restarts. A cold cache can stampede the database. A schema migration can lock a table. A new log line can overload your telemetry pipeline. Use progressive delivery. Deploy to a small slice first. Warm caches before taking full traffic. Use readiness gates. Keep enough surge capacity. Separate database migrations from application rollout when possible. Use backward-compatible schema changes. Watch metrics automatically and roll back quickly when error rate or latency crosses a threshold. The deployment process should protect capacity, not merely ship code.
Cost is part of the architecture
One million RPS can get expensive quickly. API compute is only one line item. You also pay for load balancing, WAF, bandwidth, cross-zone traffic, Redis, broker throughput, storage writes, database capacity, logging, metrics, traces and retained data.
Logging can cost more than compute. Cross-zone traffic can surprise you. Cache misses can become database spend. Overly aggressive autoscaling can hide inefficient code by adding machines. A serious design should include a cost per million requests, not just a latency chart.
What I would build first
I would not start with the full million RPS system. I would build a thin Minimal API endpoint that represents the real request contract. I would make it cache-first for reads or broker-first for writes. I would add source-generated JSON, cheap validation, cancellation tokens, bounded work, rate limiting, health checks and structured source-generated logs.
Then I would run a single-instance benchmark to understand the ceiling. Then a small multi-instance test behind a load balancer. Then I would add Redis, Event Hubs or Kafka, workers and the real persistence model. Then load test the full path and measure p99, cache hit ratio, broker lag, database write throughput and error rate.
Only after that would I tune Kestrel, pod CPU, GC settings, serialiser details or Native AOT. Those optimisations are useful, but only after the architecture stops doing obviously expensive things.
A practical reference implementation shape
The solution structure.
src/
HotEndpoint.Api/
Program.cs
Contracts/
Json/
Middleware/
Health/
HotEndpoint.Application/
Ingestion/
Caching/
Idempotency/
RateLimits/
HotEndpoint.Infrastructure.Redis/
RedisPriceCache.cs
RedisIdempotencyStore.cs
HotEndpoint.Infrastructure.EventHubs/
EventHubPublisher.cs
EventHubConsumer.cs
HotEndpoint.Workers/
EventIngestionWorker.cs
Projections/
HotEndpoint.Storage/
EventWriter.cs
ReadModels/
tests/
HotEndpoint.LoadTests/
HotEndpoint.IntegrationTests/
The API project stays thin. The application layer owns the use cases. Infrastructure projects own Redis, Event Hubs and storage integrations. Workers scale separately from API pods.
That separation is useful because a million RPS design needs independent scaling. The API layer, cache layer, broker layer, worker layer and storage layer all have different bottlenecks.
The honest answer
Can a .NET endpoint handle a million requests per second?
Yes, if the endpoint is designed as part of a horizontally scaled system, the request path is short, reads are cached, writes are queued, dependencies are partitioned, backpressure is deliberate, and the infrastructure is built for the traffic.
No, if the endpoint means one normal API method that authenticates, logs, validates, calls other services, writes to SQL and returns a fully processed result for every request.
Minimal APIs, Kestrel, async I/O, source-generated JSON, output caching, high-performance logging and Native AOT can all help. But the architecture matters more. At this scale, the endpoint is not the hero. The design around the endpoint is.
Microsoft, ASP.NET Core best practices
Microsoft, Kestrel web server in ASP.NET Core ASP.NET Core
Microsoft, Configure options for the ASP.NET Core Kestrel web server
Microsoft, Minimal APIs quick reference
Microsoft, ASP.NET Core support for Native AOTpport for Native AOT
Microsoft, System.Text.Json source generation
Microsoft, ASP.NET Core caching overviewerview
Microsoft, Distributed caching in ASP.NET Core
Microsoft, HybridCache library in ASP.NET Core
Microsoft, Rate limiting middleware in ASP.NET Coreare in ASP.NET Core
Microsoft, Health checks in ASP.NET Core
Microsoft, High-performance logging in .NETance logging in .NET
Microsoft, Compile-time logging source generation
Microsoft, Azure Kubernetes Service scaling conceptse scaling concepts
Microsoft, AKS scalability considerationsions
Microsoft, Application Gateway Ingress Controller
Microsoft, Azure Event Hubs scalability guideguide
Microsoft, Azure Event Hubs Auto-inflateo-inflate
Microsoft, Azure Cache for Redis output cache provider for ASP.NET Coreider for ASP.NET Core





