Friday, December 07, 2018

Testing and using AWS EKS #kubernetes - findings


I have been working in a team where we use kubernetes in production (not the nginx example- the real shit) for 2 years now. I have configured and used Kubernetes clusters from version 1.4.x with tools like kube-aws to 1.6-1.7 configured with kops. Amazon's EKS is the third breed of kubernetes provisioning solutions that I have the chance to try and this post is about my recent experiences for a week, trying to bring a production level EKS into life and check if it would cut it for our production needs.

This post would not have been possible without the contribution and hard work of my colleague JV - thanks!!! 

EKS basics

For those who seek an executive summary of EKS. Its an AWS managed Service (like for example your Amazon Elastic Cache). Amazon provisions, updates and patches the brains of your cluster, aka the control planes + etcd. There is kind of a flat rate (price) for the masters + EC2 standard billing for your worker fleet. AWS also provides a  custom networking layer, eliminating the need to use any additional overlay network solutions like you would do if you create the cluster on your own. You are responsible for provisioning and attaching - the worker nodes. AWS provides templates (Cloud-formation) with pre-configured workers. You are responsible for installing on top of the cluster all the other services or applications that are needed by your platform e.g how to collect logs, how to scrape metrics, other specific daemons etc. Also make note that once the cluster is up, there is nothing AWS specific, you get a vanilla experience (exception is the networking plugin).

How do I start?

There are a couple of options for spinning an EKS cluster
  1. The infamous click click on the dashboard (its good if you want to play but not production ready, meaning if you want to re-provision and test)
  2. Go through the official guide of EKS installation using command like tools like aws eks etc. Its a good option especially if you love the aws command line tooling.
  3. Use third party command line tools that offer behind the scenes extra functionality, namely things like eksctl . It's a very promising tool by the way.
  4. Terraform all the things!
    1. Followed the official guide here.
    2. Or use samples like this interesting module see here
Despite being slightly unrelated with the above 4 points, dont forget to bookmark and read the eksworkshop. One of the best written getting started guides I have seen lately - many thanks to

We started the PoC with option 4.1 . So we used the official terraform guide (thank you Hashicorp) and then the worker provision was terraformed as well. So we did not keep the standard cloudformation extract from AWS. As you can understand, the tool of choice sometimes is dictated by the available levels of skills and experience within the team. In general we love terraform (especially us the developers) .

Other things to consider before I start?

So, as we discovered and of course it was very well documented, an EKS cluster due to the networking features that it brings (more on this later), really shines when it occupies its own VPC! Its not that you can not spin an EKS cluster on your existing VPCs  but make sure you have enough free IPs and ranges available since by default the cluster - and specifically the workers, will start eating your IPs. No this is not a bug, its a feature and it actually makes really sense. It is one of the things that I really loved with EKS. 

First milestone - spin the masters and attach workers

The first and most important step is to spin your masters and then provision your workers. Once the workers are being accepted and join the cluster you more or less have the core ready. Spinning just masters (like many articles out there feature is like 50% of the work). Once you can create an auto-scaling group where your workers will be created and then added to the cluster - this is like very close to the real thing.

Coming back to the Pod Networking feature

If you have ever provisioned a kubernetes clusters on AWS, using tools like kops or kube-aws, then you most probably have already installed or even configured the overlay network plugin that will provide pod networking in your clusters. As you know, pods have IPs, overlay networks on a kubenretes cluster, provide this abstraction see (calico, flannel etc). On an EKS cluster, by default you don't get this overlay layer. Amazon has actually managed to bridge the pod networking world (kubernetes networking) with its native AWS networking. In plain words, your pods (apps) within a cluster do get a real VPC IP. When I heard about this almost a year ago I have to admit I was not very sure at all, after some challenges and failures, I started to appreciate simplicity on the networking layer for any kubernetes cluster on top of AWS. In other words if you manage to remove one layer of abstraction, since your cloud can natively take over this, why keep having one extra layer  of networking and hops where you can have the real thing? 

But the workers pre-allocate so many IPs

In order EKS optimize Pod placement on the worker, uses the underlying EC2 worker capabilities to reserve IPs on its ENIs. So when you spin a worker even if you there are no pods or daemons allocated to them, you can see on the dashboard that they will have already pre-allocate a pool of 10 or depending on the class size, number of IPs. If you happen to operate your cluster on a VPC with other 'residents' your EKS cluster can be considered a threat! One way to keep the benefits of AWS CNI networking but make some room on VPCs that are running out of free IPs is to configure- after bringing up the masters - the 'aws-node' deamon set. This is an AWS specific deamon part of EKS magic that make all this happen. See here for a similar issue. So just

kubectl edit deamonset aws-node -n kube-system

and add the `WARM_IP_TARGET` to something smaller.

Make note as we discovered, setting the WARM IP TARGET to something smaller, does not limit the capacity of your worker to host more pods. If your worker does not have WARM IPs to offer to newly created and allocated pods will request a new one from the networking pool. 

In case that that even this work around is not enough - then there is always the options to switch on calico on top of the cluster. See here. Personally after seeing CNI in action I would prefer to stick to this. After 2 years with cases of networking errors, I think I can trust better AWS networking. There is also the maintenance and trouble shooting side of things. Overlay networking is not rocket science but at the same time is not something that you want to be spending time and energy trouble shooting especially if you are full with people with these skills! Also the more complex your AWS networking setup is, the harder it becomes to find issues when packets jump from the kubernetes world to your AWS layer and vice versa. It is always up to the team and people making the decisions to choose the support model that they think fits to their team or assess the capacity of the team to provide real support on challenging occasions. 

What else did you like? - the aws-iam-authenticator

Apart from appreciating the simplicity of CNI I really found very straight forward the integration of EKS with the existing IAM infrastructure. You can use your corporate (even SAML) based roles / users of your AWS account to give or restrict access to your EKS cluster(s). This is a BIG pain point for many companies out there and especially if you are an AWS shop. EKS as just another AWS managed service, follows the same principles and provides a bridge between IAM and kubernetes RBAC!. For people doing kubernetes on AWS, already know that in the early days, access to the cluster and distribution of kube configs - was and still is a very manual and tricky job since the AWS users and roles mean nothing to the kubernetes master(s). Heptio has done a very good job with this.

What is actually happening is that you install the aws-iam-authenticator and attach it to your kubectl , through ./kube/config. Every time you issue a command on kubectl, it is being proxied by the aws-iam-authenticator which reads your AWS credentials (./aws/credentials) and maps them to kubernetes RBAC rules. So you can map AWS IAM roles or Users to Kubernetes RBAC roles or create your own RBAC rules and map them. It was the first time I used this tool and actually works extremely well! Of course if you run an old kubernetes cluster with no RBAC it wont be useful but in the EKS case, RBAC is by default enabled! In your ./kube/config the entry will look like this.

- name: arn:aws:eks:eu-west-1:
      - token
      - -i
      - name: AWS_PROFILE
      command: aws-iam-authenticator

Make note that, from the EKS admin side you will need to map the on your cluster 

kubectl edit configmap aws-auth -n kube-system

- rolearn:
       username: some-other-role
         - system:masters #to SOME KUBERNETES RBAC ROLE

What about all the other things that you have to install?

Once the cluster is ready so you have masters and workers running then the next steps are the following, and can be done by any admin user with appropriate `kubectl` rights.

  • Install and configure Helm
  • Install and configure the aws-cluster-autoscaler. Which is more or less straight forward, see here and here for references.
  • Install and configure fluentD to push logs e.g to Elastic Search
  • Install and configure Prometheus.
  • And of course..all the things that you need or have as dependencies on your platform.

Should I use EKS?

  • If you are an AWS user and you have no plans on moving away, I think is the way to go!
  • If you are a company/ team that wants to focus on business delivery and not spend a lot of energy keeping different kubernetes clusters alive, then YES by all means. EKS reduces your maintenance nightmares and challenges 60-70% based on my experience.
  • If you want to get patches and upgrades (on your masters) for free and transparently - see the latest kubernetes security exploit and ask your friends around, how many they were pushed to ditch old clusters and start over this week (it was fun in the early days but it is not fun any more). So I am dreaming of easily patched clusters and auto upgrades as a user and not cases like - lets evacuate the cluster we will build a new one! 
  • Is it locking you on a specific flavour? No the end result is a vanilla kubenetes, and even that you might be leveraging the custom networking, this is more less the case when you use a similar more advanced offering from Google (which is a more complete ready made offering).
  • If you have second thoughts about region availability, then you should wait until Amazon offers EKS on a broad range of regions, I think this is the only limiting factor now for many potential users.
  • If you already have a big organization tightly coupled with AWS and the IAM system - EKS is the a perfect fit in terms of securing and making your clusters available to the development teams!
Overall it was a very challenging and at the same time interesting week. Trying to bring up an EKS cluster kind of pushed me to read and investigate things on the AWS ecosystem that I was ignoring in the past.

Sunday, December 02, 2018

#terraform your #fastly Service - a simple example

In the past 2,5 years while working for Ticketmaster in London,  I had the chance to use extensively (still do) Fastly, one of the best CDN and Edge Compute services out there. Fastly is not just a web cache but a wholistic Edge service offering. It enables integrators (users) to actually bring their application (web service/site) closer to their end users and at the same time offload some of the logic and decision making that every web application has to do, from the main origins (your servers) and push it to the edge (a simplified explanation of edge computing).

The post is mostly a follow up (I know it took some time) of my short talk  on Fastly's EU Altitude day here in London. You can find the video here. One of the problems we were facing at Ticketmaster is the extremely big number of  services (domains) owned by different teams and parts of the company. Some times a website might span 3-5 different individual Fastly Configs e.g you have the dev site, the prod site, the staging site, or a slightly experimental version of the site, on all of them you maintain a different set of origins and you want to `proxy` them through fastly in order to test end to end the integration etc.

To cut a long story short, the number of domains (services) was getting big very big. In the early adoption days, most of the teams and individuals would login to the web console of Fastly and would start updating configs (like you would do with AWS yeas ago). Some other teams would use things like Ansible and some custom integrations. I was always overwhelmed by both approaches. I could see that having 200 different configs and people manually editing them was not going to scale, at the same time be-spoke integrations with different tools felt very complex +not out of the box support. After some time I discovered the Fastly Terraform Provider and I think this was like a milestone, we could finally, treat our Fastly configs as code, push them to repos, introduce CI/CD practises and have a very good grip on who is doing what at any given time with our services.

The example
You can find the example here. My main motivation is not to show you how you do Terraform, but to provide a simple project structure (is dead easy) and how the terraform provider can be applied. Also dont worry about the specifics, this example is about having a personal domain ( which I own, and then serving using an S3 bucket some static content.  You dont have to do the same, the domain could be your company's team domain and the backends do not have to be some simple S3 buckets. 

The main points of this example are:
  • Provide a sample structure to start with
  • Illustrate the basics of the Fastly terraform provider
  • Illustrate how you can add and upload custom VCL to your terraformed Fastly Service

You will need a Fastly account. Once you activate your account you will need to create a new an API TOKEN see here. This token will be used by the Terraform provider, to talk to Fastly.
You will need to export or provide to terraform this key!
Also you might notice that in my example I use a private S3 bucket to store my terraform state. You don't have to do the same, especially if you plan experimenting first. So you can remove the section where I configure a backend for the terraform provider.

In case you want to use an S3 bucket you will need to provide to terraform the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY on your environment.


In case you want to 100% replay the example,  you need your own domain, attach a Fastly Service to the domain, use an S3 bucket the. So I had to:
  • I register a domain on AWS, example :
  • I added a CNAME on my domain through the AWS console to point to Fastly (NON SSL) see here
  • I created an S3 bucket (with the appropriate policies, so I can serve some static html)

About the example

As you can see the structure of the project is quite flat. You have a couple of terraform files and then a subfolder with the custom VCL that we want to upload to our service. The Fastly terraform provider is actually a mirror, of the properties and settings you can configure manually through the Dashboard. If you are a newcomer I would just you open a browser tab on Fastly's Dashboard, then another one on the documentation of the provider see here so that you can see how each section in terraform correlates with the real thing. It is actually ok if you start experiment with Fastly using the console (dashboard). Once you gain some experience then you move to automating most of the configs that you had applied manually int he past.

Till now  the most frequent question I get is related to combining Terraform and VCL together. On very advanced scenarios you will want to take advantage of the capability of fastly to be a platform where you can take smart decisions based on the incoming requests and actually offloading some processing time from your origins. The more logic you add the more you offload logic (but watch out - not to overcomplicate things, meaning your origins will still be the source of your business logic). Fastly can help you on executing some of this web related logic, like make redirects, make smart caching, load balance through the use of a somewhat primitive C like language called VCL. For those that have extensive experience with Varnish, well we are talking about Varnish 2.x VCL code. 

Attaching to your terraform settings, snippets like the following:

vcl {
    name    = "Main"
    content = "${file("${path.module}/vcl/main.vcl")}"
    main    = true

You actually attach to your service pieces of your VCL code. Fastly will use all this code and will combine them with the standard code being executed for you per request. Here is an example on the provided main.vcl of my service where I do a redirect to my personal blog (here), when you try to hit the url [].

sub vcl_recv {

  # if you want my blog- redirect to it!
  if(req.url.path ~ "^/blog"){
    error 666;
  #FASTLY recv

sub vcl_error {
  if (obj.status == 666) {
    set obj.status = 307;
    set obj.http.Cache-Control = "no-cache";
    set obj.response = "Moved Temporarily";
    set obj.http.Location = "";
    return (deliver);

This is a typical example where I leveraged Fastly to make a smart decision upon an incoming request and perform a redirect without letting the request land on my origins or web servers. You can find the reference documentation of Fastly VCL here. So have a look on the example and use it a starting point....Happy Fastly-ing

Repo link :