AWS: App Runner Deep Dive

2021-06-18

Launched last month, AWS App Runner gets a lot of things right to simplify spinning up simple web applications – to the point that no prior knowledge of all the AWS infrastructure bells and whistles is required.

There’s also a new mode of billing, which is very exciting to me.

Let’s take a look, especially at the technical details!

Overview

The concept is awesome: App Runner is what happens when you mix CodeBuild, CodeDeploy, Beanstalk, Lambda and Fargate in a vial, shake it thoroughly, and then distill it into something beautifully simple.

It aims to provide the whole shebang from source code to production, while taking care of seamless deployments. I just plug in my web application, hit play, and I have a TLS-secured load-balanced application running in multiple availability zones. And it autoscales out of the box! That’s pretty cool.

To say this again: Virtually no prior AWS-specific knowledge is required. I don’t need to know anything about VPC networking. I don’t have to fumble with Elastic Loadbalancing for half an hour. I don’t need to know who ACM even is, or Fargate, or ECS, or – heaven forbid – EKS.

Note that I said web application: App Runner is explicitly designed for something that speaks HTTP, is health-checkable by HTTP and is used from the outside world via HTTP. And, needless to say, the application needs to be stateless – just like any web application that wants to be loadbalanced and scaled properly.

App Runner operates on containers. I can either simply use some container from a registry.

Or I can have App Runner build a container for me. But we’re not talking about the classic mode of “here’s my Dockerfile, go build that”. Instead, App Runner provides base images for some programming languages. It connects to a Github repository, places the actual source code in the base image, and builds a container from that. This is actually nice, because I don’t have to worry about maintaining and security-patching the base image. And this “just add source code” concept is already known from AWS Lambda. It’s magic.

Exciting New Billing Model

I’m really excited about a feature that got little attention: CPU pricing is only charged when the application is actually handling requests. Repeat: An idle application will not incur any CPU charges.

This brings the whole model much closer to what AWS Lambda does. For almost all applications, that’s a significant reduction. Not only for development and QA environments, but also for most production applications. While Lambda can do this per request with millisecond precision, App Runner does this by the second, which already is great.

Keep in mind that on AWS, paying less translates pretty well into less waste of resources – and therefore to my application’s carbon footprint. Less waste is good for everyone, not just for the CFO.

App Runner can do this easily because it’s in control of the loadbalancing. It knows when there are no active requests.

In contrast to Lambda, App Runner cannot fully stop an application when there are no requests. There may be background threads, cron jobs and so on. Also, the container’s startup time might be way too high to have the next request wait for it.

The secret here is that App Runner keeps the application running, but it throttles the CPU massively. This is a brilliant solution for containerized applications. And it’s a huge step towards the vision that I’m driving with re:Web.

This is also why only CPU charges are dynamic; memory is charged for the whole time the application is available – but memory pricing isn’t too significant in comparison.

An awesome next step would be if App Runner allowed to opt-in to fully stopping the application, so there is actually no instance running while there are no requests – like Lambda does.

Technical Deep Dive

Enough chitchat. You can get all the usual details from the AWS blog or the AWS blog.

Let’s take a look under the hood, as far as we can do that from a customer’s perspective.

Building

Building from Github is actually the only process where I get proper log output. It looks very much like I’d expect a Docker build to look:

[AppRunner] Starting to build your application source code.
[Build] foo
[Build] Sending build context to Docker daemon  9.216kB
[Build] Step 1/5 : FROM 172267607524.dkr.ecr.eu-west-1.amazonaws.com/awsfusionruntime-python3:3.8.5
[Build] 3.8.5: Pulling from awsfusionruntime-python3
[...]
[Build] Status: Downloaded newer image for 172267607524.dkr.ecr.eu-west-1.amazonaws.com/awsfusionruntime-python3:3.8.5
[Build]  ---> cca0c7fe0d48
[Build] Step 2/5 : COPY . /app
[Build]  ---> 1809bc3e008b
[Build] Step 3/5 : WORKDIR /app
[Build]  ---> Running in 1c8671d6366f
[Build] Removing intermediate container 1c8671d6366f
[Build]  ---> 5a491f6a232a
[Build] Step 4/5 : RUN echo foo
[Build]  ---> Running in 33150a894724
[Build] foo
[Build] Removing intermediate container 33150a894724
[Build]  ---> 23df2b9f7940
[Build] Step 5/5 : EXPOSE 8080
[Build]  ---> Running in 0d6f001a39d7
[Build] Removing intermediate container 0d6f001a39d7
[Build]  ---> e0d9f18a3c50
[Build] Successfully built e0d9f18a3c50
[Build] Successfully tagged application-image:latest

In case you’re wondering: echo foo is my build command, because I was forced to give one.

When not using the Console, the build configuration is maintained in apprunner.yml, as part of the repository. A simple example looks like this:

version: 1.0
runtime: python3 
build:
  commands:
    build:
      - pip install pipenv
      - pipenv install 
run: 
  command: python app.py

(taken from the documentation)

One small weirdness: The resulting container will have the generated Dockerfile in /app. Nothing to see there, though.

Sadly, I was not allowed to pull the awsfusionruntime-python3 image myself.

Architecture

From what I can tell, App Runner isn’t built on Application Loadbalancer. There seems to be a huge fleet of envoy proxies handling the requests from the internet.

The application container itself runs on an AWS-owned ECS cluster.

App Runner isn’t orchestrating other AWS services for me – it encapsulates them. This means that any resources used by App Runner will not show up in my account. I won’t see the ACM certificate or the ECS service, for example. Luckily, this also means that I will not have the usual IAM headaches to grant carefully scoped access to other AWS services.

DNS

The DNS record returns three IPs (even when only one instance is available), which I assume is one per availability zone. But for reasons unknown, App Runner takes a quick CNAME detour to explore the length limits of DNS names:

ysvp2d42qc.eu-west-1.awsapprunner.com. 60 IN CNAME 8a3b877e0fff4ad5abcbb491626a7531.5ysp12zfctly24v37dziw9tv9vgyog35m8cxbz6vyafmg15s70.6dd878t7dna7cgqgxld7zry5aq57tlrwg7uq14fk6g3sr656w4.eu-west-1.awsapprunner.com.
8a3b877e0fff4ad5abcbb491626a7531.5ysp12zfctly24v37dziw9tv9vgyog35m8cxbz6vyafmg15s70.6dd878t7dna7cgqgxld7zry5aq57tlrwg7uq14fk6g3sr656w4.eu-west-1.awsapprunner.com. 60 IN A 54.75.254.235
8a3b877e0fff4ad5abcbb491626a7531.5ysp12zfctly24v37dziw9tv9vgyog35m8cxbz6vyafmg15s70.6dd878t7dna7cgqgxld7zry5aq57tlrwg7uq14fk6g3sr656w4.eu-west-1.awsapprunner.com. 60 IN A 34.255.134.39
8a3b877e0fff4ad5abcbb491626a7531.5ysp12zfctly24v37dziw9tv9vgyog35m8cxbz6vyafmg15s70.6dd878t7dna7cgqgxld7zry5aq57tlrwg7uq14fk6g3sr656w4.eu-west-1.awsapprunner.com. 60 IN A 52.210.161.22

I don’t know how I feel about the 60 seconds TTL – seems like introducing a lot of delay for repeated lookups.

App Runner’s DNS names do not use DNSSEC.

TLS

There is no support for TLS 1.3 (bad). Only version 1.2 is supported, i.e. no older TLS versions (good).

Certificates for custom domain names are created with Certificate Transparency.

The Qualys SSL Labs server test grades an App Runner-hosted site as A (I’d assume that A+ were possible if the application emits HSTS headers).

Frontend

Most notably, App Runner-hosted applications will let everyone know they’re running behind envoy in the HTTP headers:

server: envoy
x-envoy-upstream-service-time: 2005

This even happens for successful requests. It’s rather irritating that a proxy overwrites my Server header.

Also there is no support for HTTP/1.0 (envoy limitation). While this isn’t an issue for modern browsers, some tools still default to HTTP/1.0 (Python’s http.server and ApacheBench, for example).

During my tests, I could see about 30 different proxy IPs, all from the 10.0.0.0/16 range. The TCP connection comes from some link-local address.

In the access log, it looks like this:

169.254.175.249 - - [18/Jun/2021 14:07:50] "GET / HTTP/1.1" 200 -

And X-Forwarded-For looks like this (where the first one is my obfuscated client IP):

X-Forwarded-For: 37.123.234.345, 10.0.187.62

Loadbalancing seems to be simple round-robin or connection count or something – I did not observe any kind of persistence, neither with a cookie nor by client IP. I consider this a good thing.

I tried three instances concurrently, and they were placed in eu-west-1{a,b,c}, so I’m guessing App Runner always spreads them evenly.

By the way: When the service is paused, envoy returns HTTP 404 (Not found). I’m not sure if that’s a smart move, especially with regard to search engine crawlers etc.

Container execution / ECS

The container will find itself in an ECS environment. This comes with the expected environment variables being injected:

AWS_EXECUTION_ENV=AWS_ECS_FARGATE
AWS_DEFAULT_REGION=eu-west-1
AWS_REGION=eu-west-1
ECS_CONTAINER_METADATA_URI=http://169.254.170.2/v3/58f40ddf9c4b42fc83bfc6c7fe1a1cd8-193386898
ECS_CONTAINER_METADATA_URI_V4=http://169.254.170.2/v4/58f40ddf9c4b42fc83bfc6c7fe1a1cd8-193386898
[...]

Tickling $ECS_CONTAINER_METADATA_URI_V4/task reveals some interesting bits:

{
 "Cluster": "arn:aws:ecs:eu-west-1:172068907549:cluster/bullet-srv-999999999999",
 "TaskARN": "arn:aws:ecs:eu-west-1:172068907549:task/bullet-srv-999999999999/58f40ddf9c4b42fc83bfc6c7fe1a1cd8",
 "Family": "bullet-td-6b4120addb68419590ea47b94506ed43-8777b77689784d7a9cd6815f70cf4bc5",
 "Revision": "1",
 "Containers": [
   {
     "DockerId": "58f40ddf9c4b42fc83bfc6c7fe1a1cd8-193386898",
     "Name": "instance",
     "DockerName": "instance",
     "Image": "736217495344.dkr.ecr.eu-west-1.amazonaws.com/image-repo-6b4120addb68419590ea47b94506ed43@sha256:02b8e350b106f837b1346b52e64590281ed953fd506737779f6a59072d4c5cdb",
     "ImageID": "sha256:02b8e350b106f837b1346b52e64590281ed953fd506737779f6a59072d4c5cdb",
     "Labels": {
       "com.amazonaws.ecs.cluster": "arn:aws:ecs:eu-west-1:172068907549:cluster/bullet-srv-999999999999",
       "com.amazonaws.ecs.container-name": "instance",
       "com.amazonaws.ecs.task-arn": "arn:aws:ecs:eu-west-1:172068907549:task/bullet-srv-999999999999/58f40ddf9c4b42fc83bfc6c7fe1a1cd8",
       "com.amazonaws.ecs.task-definition-family": "bullet-td-6b4120addb68419590ea47b94506ed43-8777b77689784d7a9cd6815f70cf4bc5",
       "com.amazonaws.ecs.task-definition-version": "1"
     },
     "Limits": {
       "CPU": 960
     },
   },
   {
     "DockerId": "58f40ddf9c4b42fc83bfc6c7fe1a1cd8-329321356",
     "Name": "aws-fargate-request-proxy",
     "DockerName": "aws-fargate-request-proxy",
     "Image": "172068907549.dkr.ecr.eu-west-1.amazonaws.com/aws-fargate-request-proxy:cell9",
     "ImageID": "sha256:71bb81a8e1a15726cc5711f7334e11ccf27f53504724128a75bf3915ace1960b",
     "Labels": {
       "com.amazonaws.ecs.cluster": "arn:aws:ecs:eu-west-1:172068907549:cluster/bullet-srv-999999999999",
       "com.amazonaws.ecs.container-name": "aws-fargate-request-proxy",
       "com.amazonaws.ecs.task-arn": "arn:aws:ecs:eu-west-1:172068907549:task/bullet-srv-999999999999/58f40ddf9c4b42fc83bfc6c7fe1a1cd8",
       "com.amazonaws.ecs.task-definition-family": "bullet-td-6b4120addb68419590ea47b94506ed43-8777b77689784d7a9cd6815f70cf4bc5",
       "com.amazonaws.ecs.task-definition-version": "1"
     },
     "Limits": {
       "CPU": 64,
       "Memory": 40
     }
   }
 ],
 "Limits": {
   "CPU": 1,
   "Memory": 2048
 },
 "AvailabilityZone": "eu-west-1c"
}

(boring parts removed; and 999999999999 is where my account-id appeared)

I wonder what the story behind bullet is. It seems to be the internal name of App Runner, or they changed the name last-minute. Additional data point: At some point, the Console used to link to bullet.yml instead of apprunner.yml.

That fargate-request-proxy sounds interesting, so I tried pulling those images, of course. I wasn’t allowed to.

Other than that, I didn’t notice anything out of the ordinary.

Auto-Scaling

Scaling happens quite quickly.

I’ve configured that each container instance can stomach ten concurrent connections.

Then I overwhelmed the service by going immediately from 0 connections (idle container instance) to 40 concurrent connections. The available container instance responded immediately, as it should. A second container instance started serving requests just 22 seconds later, and the third instance another 28 seconds later.

The maximum number of instances was configured at three, so a fourth instance was never started.

The important part is to configure the concurrency per container instance lower than what it actually can handle. That leaves some headroom until scaling kicks in – or in case one container instance (or even a whole availability zone) fails.

I’d say scaling reacts fast enough for pretty much all use-cases (unless you’re expecting Advent of Code-style traffic peaks).

Throttling

Now this is super interesting.

As mentioned earlier, the application gets throttled almost immediately when there are no active requests.

Let’s put a number on that. Running some python code that does 10,000 iterations of random math stuff, here’s how long that takes:

~0.9 seconds on my 2017 iMac
~2.5 seconds on App Runner with 1 CPU configured, while active
~150 seconds while inactive

Yes, 2.5 minutes. The task is throttled by a factor of 60. While that throttled CPU time is actually free, I’ll assume that mining coins still isn’t worth it, even if memory pricing is small.

Limitations

Source code: Github only
Images: ECR only, but ECR Public is supported
Managed runtimes: Python and Node only
No VPC support (therefore no EFS support either)
No support for IPv6, of course
No support for secrets (Secrets Manager / Parameter Store)
No support for ECS Exec
No support for WAF
No support for ARM (amd64 / “Intel” only)
No support for HTTP/2
HTTP/1.0 is not supported
Backend: no HTTPS support between App Runner proxies and the application – which is perfectly fine with me, because nobody needs that within a VPC, really (even if it’s a VPC in an AWS-owned account)
Frontend: TLS version 1.2 only
Frontend: plain HTTP is not supported: port 80 is blocked (will time out) – I think a forced redirect to HTTPS would have made more sense
Scaling based on concurrent connections only

Many of those items are already on the roadmap.

Premature Launch

What follows is a “grab bag” of nuisances. App Runner is a mere month old now, but still, it feels surprisingly unpolished. Maybe it would have been worth delaying the launch a bit.

That being said, this is all minor stuff and I’m sure they will be resolved quickly.

Initial setup of Github causes a pop-up window – it took me a minute or two until I realized this (Safari is very good at hiding blocked pop-ups)
Maybe it’s because I’m a Github noob, but this sentence confused me just as much: “Choose from locations where a GitHub app is already installed or install another elsewhere in your account.” (I didn’t even know Github had “apps”)
The lack of CodeCommit support at launch is… an interesting statement
Completing the Github connection gave me this meaningful error: null is not an object (evaluating 'Ke.postMessage') – but, as I figured out later, it was successful anyway
There is a message “Deleting foo” when you click refresh (that was a new one for me)
The application setup page needs a clear hint that auto-deploy costs a very silly $1/month – or rather, that silly $1 needs to go away entirely
A build command is required by the form, even though it’s not needed in many scenarios
All names need to be at least four characters long – I get upset when I can’t name my things “foo”
Healthcheck claims to be TCP, when it’s actually HTTP
Healthcheck interval has a vague definition – description: “interval can’t be less than the timeout”, yet the error message says “must be greater than timeout”
Healthcheck timeout has some funkyness too – my test application does sleep(2), but a timeout of 3 caused it to fail; setting it to 4 worked
The log is displayed upside down (newest lines first)… I mean… what?
When creating the service fails deployment, the log claims to roll back – to what?
When creating the service fails deployment, it cannot be updated and has to be deleted
An obvious no-win-situation like command not found when starting the container should fail immediately, but App Runner keeps trying for minutes
Generally, any kind of screwed up deployment will incur a five to 20-minute penalty, often with no helpful log output whatsoever – having to wait “forever” and having virtually no insight into what’s happening is highly frustrating
Even better is this:
```
12:05:27 [AppRunner] Service resume started.
12:05:27 [AppRunner] Service status is set to OPERATION_IN_PROGRESS.
12:22:34 [AppRunner] Service resume failed.
12:22:35 [AppRunner] Service status is set to PAUSED.
```
When the application is paused, and it won’t come up healthy, you can only delete it – this means a new URL is generated and updating the custom domain’s DNS records, potentially running into caching issues (depending on the DNS TTL). And yes, that’s all the log there is. Note the timestamps…
Updating a service that is in the PAUSED state will give this error: Service status must be RUNNING or PAUSED to be updated
Deployment is slooooow, even when successful
Initially my application’s stdout wouldn’t show up as log for an hour or so, but later, everything worked as expected – and I couldn’t reproduce that, so maybe it was just my fat fingers
Feature request: Make the built container available in my ECR (as an export-style option)
Feature request: Generate apprunner.yml for a manually configured service
When a healthcheck suddenly fails, the container is correctly restarted – but there is no logging about that whatsoever
Custom domain: Only CNAME is supported, no Alias (just give me the Hosted Zone ID…?)
Custom domain: This needs an ACM-style button “this zone lives in Route53, just create the records for me”
The very last log entry of a deleted service is Service status is set to OPERATION_IN_PROGRESS (no log output to confirm deletion).
When re-creating a service with the same name, Cloudwatch metrics will from then on be labelled with the new service-id for ActiveInstances and labelled with the old service-id for all request metrics

Apparently, even pausing a service can fail:

06-19-2021 03:54 PM [AppRunner] Service pause started.
06-19-2021 03:54 PM [AppRunner] Service status is set to OPERATION_IN_PROGRESS.
06-19-2021 04:12 PM [AppRunner] Service pause failed.

And again note the timestamps – 18 minutes and no reason given. Also, here’s the obligatory xkcd about ISO 8601.

Pricing

Official pricing infos here.

Sadly, the smallest possible configuration is 2 GB RAM / 1 vCPU, which puts the monthly cost between $10.5 and $58.50 (depending on the amount of idle time).

There really need to be configurations with less RAM.

Saturday Morning Coffee Hacking

And now for a small detour that happened after finishing this article:

If you payed close attention to App Runner’s build output, you’ll have noticed that the build command is executed twice: In the build environment and as part of the Dockerfile. I can only assume that this maps to the pre-build and build commands of the configuration file. Anyway, this reminded me that I should poke around in the build environment, too.

This is the build environment’s STS get-caller-identity:

[Build] {
[Build]     "UserId": "AROA2W2P3O4YHNMOKQJK7:AWSCodeBuild-2834f7f3-ca90-4f78-b630-0183887718a0",
[Build]     "Account": "736217495344",
[Build]     "Arn": "arn:aws:sts::736217495344:assumed-role/bullet-system-build-role-6b4120addb68419590ea47b94506ed43/AWSCodeBuild-2834f7f3-ca90-4f78-b630-0183887718a0"
[Build] }

Some interesting environment variables:

[Build] BULLET_IMAGE_REPO=736217495344.dkr.ecr.eu-west-1.amazonaws.com/image-repo-6b4120addb68419590ea47b94506ed43:latest
[Build] AWS_CONTAINER_CREDENTIALS_RELATIVE_URI=/v2/credentials/739b1a1e-ff7d-42b0-9b8c-05b07d3cc414
[Build] N_SRC_DIR=/n
[Build] CODEBUILD_BUILD_URL=https://eu-west-1.console.aws.amazon.com/codebuild/home?region=eu-west-1#/builds/Bufkpuli2wgb7b:96d122fd-7ae9-443f-b021-4a439e4a62fb/view/new
[Build] AWS_EXECUTION_ENV=AWS_ECS_EC2
[Build] BULLET_ASSETS_ACCOUNT_ID=172267607524
[Build] BULLET_AWS_ACCOUNT_ID=736217495344
[Build] CODEBUILD_SOURCE_REPO_URL=fusion-source-prod-eu-west-1-cell18/ufkpuli2wgb7b/d173d836d33a414595e5e5f316ad5e88/app

The ECS /task metadata didn’t reveal anything interesting; as far as I can tell, it’s just a standard CodeBuild environment. Though the image’s tag caught my eye:

      "Image": "570169269855.dkr.ecr.eu-west-1.amazonaws.com/codefactory-eu-west-1-prod-default-amazonlinux2-x86_64-standard:3.0-thirdwave"`.

Interesting that they are using ECS-EC2 instead of Fargate.

Retrieving the container credentials works too, as documented, but refuses to give me the complete credentials: "SecretAccessKey":"***". As it turns out, this is just a clever filter in the log output (in Cloudwatch maybe?) – piping it through base64 makes it possible to liberate the full credentials and use them on any EC2 instance.

This allowed me to conveniently get an ECR authorization token and docker-pull the base image, as well as the image that App Runner creates for me:

[ec2-user@ip-10-0-0-230 ~]$ docker pull 172267607524.dkr.ecr.eu-west-1.amazonaws.com/awsfusionruntime-python3:3.8.5
3.8.5: Pulling from awsfusionruntime-python3
a1534962cb01: Pull complete 
30101d64442c: Pull complete 
Digest: sha256:06cffb6a687d8ca78c477e3de301b2a57650080fefa43259cdbf74308e5ce87f
Status: Downloaded newer image for 172267607524.dkr.ecr.eu-west-1.amazonaws.com/awsfusionruntime-python3:3.8.5
172267607524.dkr.ecr.eu-west-1.amazonaws.com/awsfusionruntime-python3:3.8.5

(same for 736217495344.dkr.ecr.eu-west-1.amazonaws.com/image-repo-6b4120addb68419590ea47b94506ed43)

I couldn’t find much else I can do with this role’s credentials. Neither was it possible to pull aws-fargate-request-proxy nor did the role have any permissions that it shouldn’t have, as far as I could tell from trying a few harmless things. So, while interesting, everything looks sane here.

Conclusion

Conceptually, App Runner is great – it’s very simple, yet fits a lot of simple web applications perfectly. Also the billing model is a huge step in the right direction to reduce CPU time waste.

I’m really curious to see how App Runner will evolve – especially how the team will fight off feature creep. Some additions will be necessary (VPC support for example), but keeping the service simple will be very tough. If they add everything someone might be wishing for, we’ll end up where we started before App Runner came along.

Discuss and/or follow on Twitter!