Simpler.Grants.gov Public Wiki
Grants.govSimpler.Grants.govGitHubDiscourse
  • 👋Welcome
  • GET INVOLVED
    • Why open source?
    • How to contribute code
    • How to file issues
      • Report a bug
      • Request a feature
      • Report a security vulnerability
    • Community guidelines
      • Code of Conduct
      • Reporting and removing content
      • Incident response protocol
    • Community events
      • Fall 2024 Coding Challenge
        • Event Submissions & Winners
      • Spring 2025 Collaborative Coding Challenge
        • Event Submissions & Winners
    • Communication channels
  • Product
    • Roadmap
    • Deliverables
      • 🏁Static site soft launch
      • 🏁Static site public launch
      • 🏁GET Opportunities
      • 🏁Open source onboarding
      • 🏁Co-Design Group planning
    • Decisions
      • ADR Template
      • ADRs
        • Dedicated Forum for Simpler.Grants.gov Community
        • Recording Architecture Decisions
        • Task Runner for the CI / CD Pipeline
        • API Language
        • Use Figma for design prototyping
        • ADR: Chat
        • DB Choices
        • API Framework and Libraries
        • Back-end Code Quality Tools
        • Front-end Language
        • Communications Tooling: Wiki Platform
        • Use Mural for design diagrams and whiteboarding
        • Ticket Tracking
        • Front-end Framework
        • Front-end Code Quality Tools
        • Front-end Testing & Coverage
        • Backend API Type
        • Front-end Testing & Coverage
        • Deployment Strategy
        • Use U.S. Web Design System for components and utility classes
        • FE server rendering
        • Use NPM over Yarn Architectural Decision Records
        • U.S. Web Design System in React
        • Communications Tooling: Video Conferencing
        • Back-end Production Server
        • Communications Tooling: Analytics Platform
        • Commit and Branch Conventions and Release Workflow
        • Cloud Platform to Host the Project
        • Infrastructure as Code Tool
        • Data Replication Strategy & Tool
        • HHS Communications Site
        • Communications Tooling: Email Marketing
        • Communications Tooling: Listserv
        • Use Ethnio for design research
        • Uptime Monitoring
        • Database Migrations
        • 30k ft deliverable reporting strategy
        • Public measurement dashboard architecture
        • Method and technology for "Contact Us" CTA
        • E2E / Integration Testing Framework
        • Logging and Monitoring Platform
        • Dashboard Data Storage
        • Dashboard Data Tool
        • Search Engine
        • Document Storage
        • Document Sharing
        • Internal Wiki ADR
        • Shared Team Calendar Platform
        • Cross-Program Team Health Survey Tool
        • Adding Slack Users to SimplerGrants Slack Workspace
        • Repo organization
        • Internal knowledge management
        • Migrate Existing API Consumers
      • Infra
        • Use markdown architectural decision records
        • CI/CD interface
        • Use custom implementation of GitHub OIDC
        • Manage ECR in prod account module
        • Separate terraform backend configs into separate config files
        • Database module design
        • Provision database users with serverless function
        • Database migration architecture
        • Consolidate infra config from tfvars files into config module
        • Environment use cases
        • Production networking long term state
    • Analytics
      • Open source community metrics
      • API metrics
  • DESIGN & RESEARCH
    • Brand guidelines
      • Logo
      • Colors
      • Grid and composition
      • Typography
      • Iconography
      • Photos and illustrations
    • Content guidelines
      • Voice and tone
    • User research
      • Grants.gov archetypes
  • REFERENCES
    • Glossary
  • How to edit the wiki
Powered by GitBook
On this page
  • Context and Problem Statement
  • State of The World
  • Before State
  • Current State
  • Ideal State
  • Decisions
  • Decision 1 - Should We Migrate?
  • Decision 2 - How Should We Migrate?
  • Decison State 2024-01-31
  • Decision State 2024-02-01

Was this helpful?

Edit on GitHub
  1. Product
  2. Decisions
  3. Infra

Production networking long term state

PreviousEnvironment use casesNextAnalytics

Last updated 1 year ago

Was this helpful?

  • Status: Active

  • Last Modified: 2024-02-01

  • Related Issue:

  • Author: Kai Siren

  • Deciders: Aaron Couch, Alsia Plybeah, James Bursa

Context and Problem Statement

We need to decide now on the long term state we want to leave our production networking in. Do we want to leave it on the non-ideal "default" VPC, or try to move it to a "prod" VPC? The "prod" VPC contains all of our ideal networking changes, whereas the "default" VPC needs them to be backfilled in. Moving our application to the "prod" VPC is a complex action, and would likely require downtime.

State of The World

Before State

Prior to work on , there was one VPC. Think of a VPC as an isolated container of resources, where the isolation is enforced by a network boundary. Since there was one VPC, there was essentially no isolation between dev and prod. It was enforced by way of the application always choosing which things to deploy where, eg. via data selectors or SSM parameters or otherwise. It was easily possible to mess this up, and in-fact the author has done so already. Without the network boundary to enforce isolation between resource, you could very easily:

  • Accidentally set the production application to write to the dev database

  • Deploy an application in such a way that it's reading / writing from every stage at once (eg. dev / staging / prod)

...and other similar mishaps. Having multiple VPCs guards against some of these mistakes, at least in that it makes those mistakes harder to make.

Additionally, not having a non-default network layer inhibits testing. If only a "default" network layer exists, and you must test a networking change, then your only recourse is to test it in prod.

Current State

  • Our prod application is using the "default" VPC

  • Our staging application is using the "staging" VPC

  • Our dev application is using the "dev" VPC

This is a step in the right direction, and of course solves a lot of the issues of our previous state. Applications a deployed into different network boundaried containers, and there's an available test-bed for testing networking changes. So then, what's the issue?

Well namely that the migration is incomplete. How so? Well if you look at the names, you'll notice that prod is still using the "default" VPC. If you know how our terraform / AWS works, you'll probably have a good guess of what is going on here. What happening is that a "prod" VPC was created, but migrating to it requires downtime. Specifically it requires redeploying the load balancer. This is incredibly problematic, as the load balancer's URL is hardcoded in our DNS name. So some amount of downtime is required if we want to use an application that's connected to the new VPC.

Ideal State

It's worth mentioning the ideal state here. The ideal state is one where Simpler is synced with the Nava template, and using the non-default VPC everywhere. Nothing will be using the default VPC, in-fact it will be deleted.

The current state differs from the ideal state in that the ideal state is much easier to reason about. In the ideal world, you test you changes in dev, and you promote them to prod, and everything is simple. In the current world, you have to test you theoretical changes in dev, and then backfill a similar type of change into the default VPC. The private subnets were backfilled into the default VPC in a manner such as this. Doing that required high level knowledge of terraform imports and conditions, whereas that knowledge wouldn't be required in the ideal state. In the ideal state, the networking is simply the same, everywhere.

Decisions

There's two decisions to make here:

  • is the ideal state valuable enough to migrate to?

  • if so, how should we do that migration?

Decision 1 - Should We Migrate?

This is a binary decision, should we migrate or not? Do our infrastructure engineers want to deal with a state, long term, where the action of promoting any networking change from dev to prod (ie. to "default") is complex? Do we still want to be dealing with that complexity 6 months, 2 years, 5 years from now? Given that we currently are a static site, and we are soon doing to have data, that makes this a "now or never" decision.

Similarly, do our stakeholders will like name is a good time to incur some background infrastructure work for the sake of making our future lives easier?

Decision 2 - How Should We Migrate?

There's functionally 2 good options here. The tradeoff is lead-time versus downtime. Specifically:

2a. Plan The Load Balancer And DNS Change On The Same Day

Nava and MicroHealth coordinate a day where the new prod VPC load balancer will be created. Nava lets MicroHealth know that at some point in that day Nava will have a new load balancer URL. MicroHealth is responsible for receiving and deploying the load balancer URL on the same day, thus minimizing downtime. This should be able to happen during the course of a 90 minute meeting, if possible. That would limit the downtime to that same 90 minute window.

This option:

  • Is an efficient use of both Nava and MicroHealth's time, but requires pre-planning.

  • Has a rather long downtime window of about 90 minutes max.

2b. Prepare A New Prod Load Balancer, Wait For DNS Changeover

Nava prepares a new production load balancer URL inside the prod VPC, by deploying a new totally new production application. This application needs a name, lets say prod-beta. prod-beta can be deployed at any time, then MicroHealth is free to update the DNS record on their own time.

  • Requires the most prep work from Nava, and the new application would need to be permanently named prod-beta (or whatever we name it). Other name suggestions follow:

    • prod-alpha (It just can't be called prod)

    • live

    • prod-0

  • The application would have no downtime.

Decison State 2024-01-31

After some time and feedback from Alsia Plybeah, the current plan is to go with option 2b. - with a new application called prod-live. This is meant to incorporate two conclusions:

  • The fact that we will eventually want a prod backup environment, and that we could call that environment prod-backup and place it in a different region.

  • Learnings from other collaborations between Nava and MicroHealth, with regards to the ability to coordinate technical projects between the two organizations.

Decision State 2024-02-01

Given 2 pieces of feedback, the plan is now to do option 2a. That is, the "create the new load balancer on the same day" option. This requires more coordination, which will happen outside the direct scope of this ADR.

Work has already been done on to move us into the multi-VPC world. Presently, we are in a state where:

#1051
#1051
#1051