Production networking long term state
Status: Active
Last Modified: 2024-02-01
Related Issue: #1051
Author: Kai Siren
Deciders: Aaron Couch, Alsia Plybeah, James Bursa
Context and Problem Statement
We need to decide now on the long term state we want to leave our production networking in. Do we want to leave it on the non-ideal "default" VPC, or try to move it to a "prod" VPC? The "prod" VPC contains all of our ideal networking changes, whereas the "default" VPC needs them to be backfilled in. Moving our application to the "prod" VPC is a complex action, and would likely require downtime.
State of The World
Before State
Prior to work on #1051, there was one VPC. Think of a VPC as an isolated container of resources, where the isolation is enforced by a network boundary. Since there was one VPC, there was essentially no isolation between dev and prod. It was enforced by way of the application always choosing which things to deploy where, eg. via data
selectors or SSM
parameters or otherwise. It was easily possible to mess this up, and in-fact the author has done so already. Without the network boundary to enforce isolation between resource, you could very easily:
Accidentally set the production application to write to the dev database
Deploy an application in such a way that it's reading / writing from every stage at once (eg. dev / staging / prod)
...and other similar mishaps. Having multiple VPCs guards against some of these mistakes, at least in that it makes those mistakes harder to make.
Additionally, not having a non-default network layer inhibits testing. If only a "default" network layer exists, and you must test a networking change, then your only recourse is to test it in prod.
Current State
Work has already been done on #1051 to move us into the multi-VPC world. Presently, we are in a state where:
Our
prod
application is using the "default" VPCOur
staging
application is using the "staging" VPCOur
dev
application is using the "dev" VPC
This is a step in the right direction, and of course solves a lot of the issues of our previous state. Applications a deployed into different network boundaried containers, and there's an available test-bed for testing networking changes. So then, what's the issue?
Well namely that the migration is incomplete. How so? Well if you look at the names, you'll notice that prod
is still using the "default" VPC. If you know how our terraform / AWS works, you'll probably have a good guess of what is going on here. What happening is that a "prod" VPC was created, but migrating to it requires downtime. Specifically it requires redeploying the load balancer. This is incredibly problematic, as the load balancer's URL is hardcoded in our DNS name. So some amount of downtime is required if we want to use an application that's connected to the new VPC.
Ideal State
It's worth mentioning the ideal state here. The ideal state is one where Simpler is synced with the Nava template, and using the non-default VPC everywhere. Nothing will be using the default VPC, in-fact it will be deleted.
The current state differs from the ideal state in that the ideal state is much easier to reason about. In the ideal world, you test you changes in dev, and you promote them to prod, and everything is simple. In the current world, you have to test you theoretical changes in dev, and then backfill a similar type of change into the default VPC. The private subnets were backfilled into the default VPC in a manner such as this. Doing that required high level knowledge of terraform imports
and conditions
, whereas that knowledge wouldn't be required in the ideal state. In the ideal state, the networking is simply the same, everywhere.
Decisions
There's two decisions to make here:
is the ideal state valuable enough to migrate to?
if so, how should we do that migration?
Decision 1 - Should We Migrate?
This is a binary decision, should we migrate or not? Do our infrastructure engineers want to deal with a state, long term, where the action of promoting any networking change from dev
to prod
(ie. to "default") is complex? Do we still want to be dealing with that complexity 6 months, 2 years, 5 years from now? Given that we currently are a static site, and we are soon doing to have data, that makes this a "now or never" decision.
Similarly, do our stakeholders will like name is a good time to incur some background infrastructure work for the sake of making our future lives easier?
Decision 2 - How Should We Migrate?
There's functionally 2 good options here. The tradeoff is lead-time versus downtime. Specifically:
2a. Plan The Load Balancer And DNS Change On The Same Day
Nava and MicroHealth coordinate a day where the new prod VPC load balancer will be created. Nava lets MicroHealth know that at some point in that day Nava will have a new load balancer URL. MicroHealth is responsible for receiving and deploying the load balancer URL on the same day, thus minimizing downtime. This should be able to happen during the course of a 90 minute meeting, if possible. That would limit the downtime to that same 90 minute window.
This option:
Is an efficient use of both Nava and MicroHealth's time, but requires pre-planning.
Has a rather long downtime window of about 90 minutes max.
2b. Prepare A New Prod Load Balancer, Wait For DNS Changeover
Nava prepares a new production load balancer URL inside the prod VPC, by deploying a new totally new production application. This application needs a name, lets say prod-beta. prod-beta can be deployed at any time, then MicroHealth is free to update the DNS record on their own time.
Requires the most prep work from Nava, and the new application would need to be permanently named prod-beta (or whatever we name it). Other name suggestions follow:
prod-alpha (It just can't be called
prod
)live
prod-0
The application would have no downtime.
Decison State 2024-01-31
2024-01-31
After some time and feedback from Alsia Plybeah, the current plan is to go with option 2b. - with a new application called prod-live
. This is meant to incorporate two conclusions:
The fact that we will eventually want a prod backup environment, and that we could call that environment
prod-backup
and place it in a different region.Learnings from other collaborations between Nava and MicroHealth, with regards to the ability to coordinate technical projects between the two organizations.
Decision State 2024-02-01
2024-02-01
Given 2 pieces of feedback, the plan is now to do option 2a. That is, the "create the new load balancer on the same day" option. This requires more coordination, which will happen outside the direct scope of this ADR.
Last updated