Simpler.Grants.gov Public Wiki
Grants.govSimpler.Grants.govGitHubDiscourse
  • 👋Welcome
  • GET INVOLVED
    • Why open source?
    • How to contribute code
    • How to file issues
      • Report a bug
      • Request a feature
      • Report a security vulnerability
    • Community guidelines
      • Code of Conduct
      • Reporting and removing content
      • Incident response protocol
    • Community events
      • Fall 2024 Coding Challenge
        • Event Submissions & Winners
      • Spring 2025 Collaborative Coding Challenge
        • Event Submissions & Winners
    • Communication channels
  • Product
    • Roadmap
    • Deliverables
      • 🏁Static site soft launch
      • 🏁Static site public launch
      • 🏁GET Opportunities
      • 🏁Open source onboarding
      • 🏁Co-Design Group planning
    • Decisions
      • ADR Template
      • ADRs
        • Dedicated Forum for Simpler.Grants.gov Community
        • Recording Architecture Decisions
        • Task Runner for the CI / CD Pipeline
        • API Language
        • Use Figma for design prototyping
        • ADR: Chat
        • DB Choices
        • API Framework and Libraries
        • Back-end Code Quality Tools
        • Front-end Language
        • Communications Tooling: Wiki Platform
        • Use Mural for design diagrams and whiteboarding
        • Ticket Tracking
        • Front-end Framework
        • Front-end Code Quality Tools
        • Front-end Testing & Coverage
        • Backend API Type
        • Front-end Testing & Coverage
        • Deployment Strategy
        • Use U.S. Web Design System for components and utility classes
        • FE server rendering
        • Use NPM over Yarn Architectural Decision Records
        • U.S. Web Design System in React
        • Communications Tooling: Video Conferencing
        • Back-end Production Server
        • Communications Tooling: Analytics Platform
        • Commit and Branch Conventions and Release Workflow
        • Cloud Platform to Host the Project
        • Infrastructure as Code Tool
        • Data Replication Strategy & Tool
        • HHS Communications Site
        • Communications Tooling: Email Marketing
        • Communications Tooling: Listserv
        • Use Ethnio for design research
        • Uptime Monitoring
        • Database Migrations
        • 30k ft deliverable reporting strategy
        • Public measurement dashboard architecture
        • Method and technology for "Contact Us" CTA
        • E2E / Integration Testing Framework
        • Logging and Monitoring Platform
        • Dashboard Data Storage
        • Dashboard Data Tool
        • Search Engine
        • Document Storage
        • Document Sharing
        • Internal Wiki ADR
        • Shared Team Calendar Platform
        • Cross-Program Team Health Survey Tool
        • Adding Slack Users to SimplerGrants Slack Workspace
        • Repo organization
        • Internal knowledge management
        • Migrate Existing API Consumers
      • Infra
        • Use markdown architectural decision records
        • CI/CD interface
        • Use custom implementation of GitHub OIDC
        • Manage ECR in prod account module
        • Separate terraform backend configs into separate config files
        • Database module design
        • Provision database users with serverless function
        • Database migration architecture
        • Consolidate infra config from tfvars files into config module
        • Environment use cases
        • Production networking long term state
    • Analytics
      • Open source community metrics
      • API metrics
  • DESIGN & RESEARCH
    • Brand guidelines
      • Logo
      • Colors
      • Grid and composition
      • Typography
      • Iconography
      • Photos and illustrations
    • Content guidelines
      • Voice and tone
    • User research
      • Grants.gov archetypes
  • REFERENCES
    • Glossary
  • How to edit the wiki
Powered by GitBook
On this page
  • Context and Problem Statement
  • Decision Drivers
  • Options
  • Cloudwatch
  • Sentry
  • Datadog
  • New Relic
  • Splunk
  • Grafana
  • Decision Status

Was this helpful?

Edit on GitHub
  1. Product
  2. Decisions
  3. ADRs

Logging and Monitoring Platform

PreviousE2E / Integration Testing FrameworkNextDashboard Data Storage

Last updated 1 year ago

Was this helpful?

  • Status: Active

  • Last Modified: 2024-03-04

  • Related Issue:

  • Deciders: Lucas and/or Billy

Context and Problem Statement

We want to decide on our long-term logging and monitoring platform. The platform should meet a wide variety of needs, and be highly usable when meeting those needs.

Decision Drivers

  • Platform UX: we want a platform that is easy to use and learn, for a variety of roles in the organization

  • Capabilities: we want a platform that satisfies a variety of production operations needs

  • Cost: we want a cost-effective platform. Note that this ADR does not attempt to calculate the prices of various platforms directly.

  • (...?)

Options

  • Cloudwatch

  • Sentry

  • Datadog

  • New Relic

  • Splunk

  • Grafana

Cloudwatch

Cloudwatch is a built-in monitoring platform that comes for free with AWS. As such, it wins on the "cost-effectiveness" decision driver. Unfortunately, Cloudwatch is only an ideal solution if you are significantly cost-constrained. As a free platform, and as a part of AWS's massive product offering, there isn't much motivation to keep Cloudwatch's feature set competitive with the market. Cloudwatch gets the worst score in the realm of usability, it is challenging to find what you need in Cloudwatch, which makes it an inefficient production operations platform. By contrast, Cloudwatch's generous free tier, and the fact that it's active in AWS by default, make it a good security and compliance platform. The recommendation for Cloudwatch is to use it exclusively for security and compliance purposes - and leverage its cost-effectiveness to invest in another platform that will be more effective at actual production operations.

  • Decision Status: Not Recommended

  • Pros

    • Generous free tier

  • Cons

    • Abysmal UX

Links:

Sentry

  • Decision Status: Not Recommended (at this time)

  • Pros

    • Great UX

  • Cons

    • Not a fully featured platform

Datadog

One large advantage of Datadog is that it was built around its metrics and dashboard capabilities, and was built for API-driven applications. As such, Datadog has best-in-class dashboard functionality. This gives it a slight leg-up relative to New Relic. Datadog additionally can make dashboards public, which is a functionality that we may want to leverage to expose our API status (4XX / 5XX rate, etc) to the general public.

Grants.gov used Datadog in the past and then decided to move to Cloudwatch, due to cost.

  • Decision Status: Top Choice

  • Pros

    • Well-known to our team

    • Fully featured platform (logs, metrics, APM)

    • Strong metrics product

    • Best for API-driven applications

  • Cons

    • Likely more expensive than its closest competitor

Links:

New Relic

New Relic has a free tier, and their pricing is less complex than Datadog's.

New Relic was built around its APM capability, so it will have better capabilities for debugging performance issues and hunting down esoteric production bugs.

  • Decision Status: Top Choice

  • Pros

    • Well-known to our team

    • Fully featured platform (logs, metrics, APM)

    • Free tier, simpler pricing than the closest competitor

    • Strong APM product

Links:

Splunk

  • Decision Status: Viable Option

  • Pros

    • Fully featured platform (logs, metrics, APM)

  • Cons

    • Platform features not designed for our use case

Grafana

Grafana is a fully open-source metrics platform that comes with other services that handle functionalities like logging and APM. Deploying the full suite of Grafana Labs tooling can provide you with a FOSS platform that is feature-competitive with all the other closed-source platforms on this ADR. The open-source nature comes with high costs to both UX and deployment, though. The UX of Grafana is somewhere halfway between Cloudwatch and Datadog. Grafana has managed options, but ultimately it's a platform built for self-hosting. As such, Grafana as a platform works best for companies that have an infrastructure team that can perform an on-call rotation to support it. At the time of writing, Simpler.grants.gov has no on-call rotation, which is a strong point against our use of a platform like Grafana.

  • Decision Status: Not Recommended (at this time)

  • Pros

    • Open-source

  • Cons

    • High maintenance and training burden

    • Mediocre UX

Decision Status

Having collected all this information, it is the opinion of the author that both New Relic and Datadog would be strong choices. New Relic is the safer option due to its less complex pricing scheme. Datadog is more likely to have niche features that we find incredibly valuable (like public dashboards). We should move forward from here with a New Relic trial, and consider an additional Datadog trial if New Relic turns out non-ideal in some way.

the first 3 social links for "Cloudwatch UX" are about how bad it is , ,

Sentry is a frontend-focused APM (application performance monitoring) platform that has not (yet) expanded to become a logging and metrics platform. It is specifically focused on frontend exception and error handling, as well as performance. As simpler.grants.gov is an API-driven platform, Sentry is a non-ideal choice for our first production operations platform. That said, Sentry fulfills its role very effectively, and with fantastic UX. So Sentry might be a good platform to consider if we decide down the line that we need additional monitoring coverage on the frontend. Sentry is a fairly popular product but is not yet listed by .

Datadog is a fully featured platform with support for logging, metrics dashboards, and APM. Several members of our team have prior experience with Datadog. Datadog excels in the market, is listed on , and would be an excellent choice of platform. It has fantastic UX, for both backend and frontend applications. The only "flaw" is that, at a high level, Datadog's product offering is hard to distinguish from New Relic's. The bulk of this paragraph is copied into the New Relic description, to emphasize that point.

New Relic is a fully featured platform with support for logging, metrics dashboards, and APM. Several members of our team have prior experience with New Relic. New Relic excels in the market, is listed on , and would be an excellent choice of platform. It has fantastic UX, for both backend and frontend applications. The only "flaw" is that, at a high level, New Relic's product offering is hard to distinguish from Datadog's. The bulk of this paragraph is copied into the Datadog description, to emphasize that point.

Splunk is a fully featured platform with support for logging, metrics dashboards, and APM. At least one member of our current team has used Splunk, and would not recommend it. Splunk is listed on , and would likely be a good choice of platform. There are some notable high-level differences between Splunk and [New Relic or Datadog]. Splunk is a fully featured platform yes, but at its core, it's an enterprise data-driven platform. Splunk is best at helping understand data flow in large and complex applications. Simpler.grants.gov is not a data-driven platform and is not looking to reach an "enterprise application" scale. Therefore, Splunk's core product offering would likely be slightly mismatched for our needs, despite being outwardly very similar to [New Relic or Datadog].

Here are some blog posts comparing Datadog with New Relic. This ADR is generally aligned with the contents of these posts.

#630
1
2
3
Pricing
FedRAMP
FedRAMP
Pricing
FedRAMP
Pricing
FedRAMP
1
2
3
4