Startup Tech Lessons

July 02, 2021

Summary

I am a 29-year-old software engineer with six years of professional experience. After earning my BS in Computer Science from the University of Minnesota, I worked on IMDb and AWS at Amazon (2015-2020). Highlights include design, development, and delivery of AWS Config Remediation and owning the IMDb home page and SEO. I gained a lot of experience building full-stack applications with billions of monthly page views and web services used by tens of thousands of enterprise customers in the context of a large, established company.

Eager for a new challenge, I left Amazon in the summer of 2020 and tried starting a SaaS company with a construction industry veteran. I spent most of the last year building a full-stack application from scratch. Unfortunately, the start-up did not work out, but I learned a lot (about code, myself, and other people), and I have written this post to share my experience and give advice.

If you are creating a new SaaS application, I hope you will find useful, pragmatic guidance below.

Front-end

I recommend TypeScript/React/Next.JS.

I had some experience with TypeScript (typed JavaScript) at AMZN, and I remain convinced that TypeScript saves enormous amounts of development effort over raw JavaScript. Airbnb reports that 38% of JavaScript bugs could be prevented by using TypeScript. [1] The percentage feels even higher based on my personal experience.

I chose React as my front-end framework due its well-designed interface, performance, and widespread adoption. I started with Redux for state management, but after doing some cursory research I realized that others were moving away from Redux and onto native React state-management hooks (e.g., Context). State management with React Context is more concise and readable.

I chose Next.js as the web framework, as it appeared to be outpacing other JavaScript alternatives. It saves you from the most annoying bits of full-stack implementation, like bundling, code-splitting, and server-side rendering. I briefly considered using a different back-end stack (such as Java/Spring, the standard at AMZN) but estimated that the efficiency gained by the vast Node.js-based ecosystem and sharing code across front-end and back-end was justification enough for a TypeScript/Node.js.

Styling

I recommend using Tailwind CSS, with templates from Tailwind UI.

The leading component/design frameworks are Bootstrap, Material UI, and Ant Design. I ended up choosing Ant Design because it is generally well-reputed, and their API looked comprehensive and well-designed in general. Ant has first-class TypeScript support and well-organized documentation. I had very few problems with their API - I doubt any other framework would have been superior. But there are some issues: Ant lacks accessibility support, and the ecosystem can be confusing.

I did some research into CSS-in-JS, which was a compelling concept to me after dealing with a large amount of poorly maintained stylesheets at AMZN. In large codebases, localizing styles with corresponding components aids in readability and completely eliminates a pernicious class of styling problems. I rarely struggled with cascading rules. I originally was going to use styled-jsx but ended up using Styled Components due to the wider support. I don’t think the choice here is important, they’re all very similar. I had used Sass before as a CSS post-processor, but Styled Components (and others) support the same logic.

After some initial development, I became frustrated writing large amounts of styles for every component and discovered Tailwind CSS. I ended up integrating it, mostly to use in constructing layouts.

I’m not a trained UX designer, so I made mistakes as I tried to build the app as fast as possible. The biggest mistake I made was the mixing of different design systems. I would have been better off hiring a dedicated UX designer, skipped Ant Design, and built an in-house design system, probably utilizing Tailwind exclusively and avoiding ad-hoc CSS-in-JS styles as much as possible. It also would have helped to use Storybook to organize components. After working on subsequent projects, I think Tailwind CSS is the easiest and most maintainable way to quickly build a web application.

Back-end API

I recommend using GraphQL, via the Apollo server/client.

I decided initially to use GraphQL for all API connections between front and back-end. I reasoned that the data schema was going to be complex and highly nested, which turned out to be true. GraphQL enabled me to clearly define a data schema up front, then work front-to-back to request data used by the app. I think GraphQL should best be considered from the point of view of the UI/UX demands of the client experience - i.e., a way for the client to best request the data it needs.

GraphQL also has a rich ecosystem of TypeScript tooling, some of which I contributed to (on GitHub) during the first two months of development as I set up the GraphQL infrastructure. This allowed me to share data types across the entire stack, which was an efficiency boon. Apollo is an open-source software package that provides both a GraphQL server, which parses GraphQL queries and fetches the requested data, and a client, which dispatches queries to the server and manages a local cache for state. I underestimated how much I would end up using Apollo as a state-management tool on the front-end. Overall, Apollo was a good choice, but I frequently struggled getting the cache management features to work how I wanted.

At times it becomes difficult to decide whether a piece of client state should reside in Apollo or React Context. In general, if a piece of state is specific to the UI and doesn’t make sense to persist, I put it in context - otherwise, it goes in Apollo. For example, whether a toolbar is expanded would live in context. A project name or configuration would live in Apollo.

Persistence Layer

I recommend DynamoDB, the managed NoSQL database from AWS.

I value the consistency, predictability, and speed of NoSQL over the organizational benefits of SQL. Startup codebases that use SQL approaches, such as RoR, become unwieldy as they grow.

At AMZN nearly everyone uses DynamoDB - but I decided to try MongoDB, as it is more of an industry-standard and was easy to set up on my local machine for testing/development. I was also curious to learn how it compared with DynamoDB overall (which was probably not a great reason to use it). In retrospect, I should’ve chosen DynamoDB to keep everything in AWS, but the choice wasn’t important enough to warrant a migration. In practice, DynamoDB is not much more expensive than managed MongoDB. Most of the runaway DynamoDB expenses I have seen are either due to 1) legitimately high scale, in which case the cost is not really avoidable, or 2) excessive reads/writes that would be better served by an in-memory cache fleet or persistence layer. I also tried a lot of MongoDB querying capabilities, which are much more flexible than those available in DynamoDB, but the MongoDB code is less readable than the simpler DynamoDB API, and any performance gain with MongoDB is at best negligible.

Search

Use AWS ElasticSearch for searching your NoSQL database.

I briefly considered the built-in search functionality of MongoDB Atlas, but ended up using ElasticSearch directly for increased flexibility and the ability to spin it up on my local development machine. I also didn’t like the vendor lock-in, which was a bad reason, since it’s unlikely you’re going to spend time migrating something like search infrastructure unless the initial choice was something wildly poor. ElasticSearch needed some infrastructure to pipe data from MongoDB - I ended up using a containerized solution called Monstache, which worked reasonably well.

User Authentication

I settled on Auth0 for authentication early-on.

Overall, it was pleasant to work with, though their Next.js integration changed significantly during my 12 month project. Auth0 also released their Organizations feature after I had already home-rolled the exact same feature, which was poor luck, but evidence that their priorities are well-placed.

Payment Processing

I used Stripe for payment processing.

Their APIs are extremely good, but I made the initial mistake of using their managed Stripe Checkout UX. This ended up being insufficient for use cases like dynamic sales tax collection, and I had to rebuild a custom payment page inside my app. I also rebuilt the Auth/Subscription integration several times as I realized the best model was to listen to payment events from Stripe and keep a copy of pertinent subscription information in my own datastore, to reduce Stripe API calls for subscription checks. This is a performance and reliability improvement over making calls to Stripe that could be potentially throttled and is the approach Stripe recommends.

Observability

I recommend using Sentry.io for full-stack monitoring.

For larger projects that require deeper observability, I recommend Prometheus for metrics collection in conjunction with Grafana for monitoring. I initially considered a few different error-tracking and monitoring tools. When it came time to implement these features, I ended up with Sentry.io, which is very well-designed. Due to the nature of Next.js, Sentry can track both client and server-side errors. This ended up being the main observability tool I used.

The Prometheus/Grafana stack gave some useful usage/performance/failure metrics. It is friendlier than the internal tools I was used to at AMZN, but the time it took to set up everything in a non-Kubernetes environment (ECS) was significant. It was particularly useful for general health metrics and to monitor services or jobs distinct from the web stack and external API.

Infrastructure

I used GitHub and CircleCI for continuous integration.

CircleCI is great, but $30/month was expensive for what I actually used. The YAML syntax was also difficult to navigate, but their web-based interface was intuitive.

I recommend using a monorepo for source control.

Initially I used separate repositories, which complicated dependency organization and the CI process. Once I switched to a monorepo organization, it was easy to share code between the homepage, app, and backend services. Builds/deployments are a little difficult on CircleCI with a monorepo setup, but not intractable, and I found the benefits outweighed this cost. I spoke with CircleCI staff and they agreed this was a common issue they’re trying to address.

I recommend AWS CDK for infrastructure-as-code.

I initially wanted to use Terraform, having been extremely frustrated with CloudFormation at AMZN. Initially I skipped any infrastructure-as-code and manually crafted everything in AWS. This was a mistake and wasted a lot of time. It let me get the app running in the fastest possible way, but infrastructure modifications and spinning up new services became extremely difficult. I re-built everything in AWS CDK, which is what I should have done from the beginning. CDK is well-typed and I believe it’s the fastest, most maintainable way to author AWS infrastructure at present. Prior to CDK, it would take me hours or days to set infrastructure for a new service and get the myriad permissions and network settings right, versus <10 minutes total with CDK.

I initially considered using a PaaS like Heroku to host the service. I initially overestimated the amount of work needed to set up the necessary AWS infrastructure, especially in light of AWS tools such as the CDK. My experience with AWS led me to eschew a PaaS and set up containerized services myself. CDK templates make it easy to set up secure infrastructure. I’m also very familiar with AWS IAM (Identity and Access Management), which saves me having to figure out how to handle credentials across even more services from different vendors.

I ended up using AWS ECS for container orchestration.

I considered using Kubernetes but it seemed like a needless layer of complexity since I already knew I wanted to be on AWS.

One problem with AWS is that there is a large degree of variability in terms of service quality. The biggest services such as EC2/S3/DynamoDB tend to be reliable and reasonably-priced. I had a particularly bad experience with customer service for SES, who refused to approve my use case for unknown reasons. Service from competitors such as Twilio is far superior.

Key Engineering Lessons Learned

In general, I think picking the most widely-supported technology is probably a good heuristic. Your critical mistakes are going to be made elsewhere:

The biggest mistake was not starting with the core of the application. Don’t waste time setting up things like authentication, authorization, payment, or navigation. Those are all subject to change later. Get the most important (i.e. innovative) parts working first. You want to gather user data around this part as fast as possible.
When developing a new feature, design the front-end, then design the APIs, build the back-end, write integration tests (always), build the front-end, and only write front-end tests for complex logic. The front-end will probably be heavily modified later, more frequently than the back-end. It’s a lot easier to fiddle with the front-end if you’re confident the back-end works as expected.
Avoid modal “windows” as much as possible. They are useful in specific contexts, but easily abused and not mobile-friendly. A webapp is not a window manager.
Try to minimize credential keys or tokens that require manual rotation - use AWS IAM as much as possible. AWS offers lots of useful tools to get started (notably CDK and Copilot).
Don’t render PDFs client-side. It is much, much slower and more error-prone than rendering them server-side and making an HTTP request. The machinery to render PDFs client-side is also resource intensive. Getting a prototype working with client-side PDF rendering is much faster, but untenably slow in some cases.