Five years ago, our Data Science & Analytics team wrote about our centralized data stack in this post. The setup that previous members of our team built has served the organization’s needs very well over time, and we appreciate the eagerness of our predecessors in adopting the most modern tools and technologies to power the entire org’s data work and allow everyone at DonorsChoose easy access to our data.
Since then, we’ve added a few more tools and services to our data stack, and thought it was a good time to write an update.
Our Updated Data Stack
Some of the biggest changes in our stack over the last five years include:
Integrating more data sources with Fivetran connectors(primarily more Amazon S3 connectors)
Transitioning some data transformations from Looker to dbt
Switching from ExactTarget to Simon Data for our Email Service Provider
Introducing Amazon Comprehend for one of our core machine learning models
Limiting access to our “open data” Looker instance in favor of more broad access to a standardized data set updated annually
A Deeper Dive
The data stack is a little more complicated than it was five years ago, so we’ll break it down below in some detail.
Fivetran still lives at the heart of our data ingestion process. Fivetran connectors are easy to set up and have allowed us to pull data from several different sources into Redshift. Current sources include our site’s PostgreSQL database, Zendesk, Salesforce, several Amazon S3 buckets, and CSV files from our staff users. You can read a case study about our experience with Fivetran.
We use Heap Analytics to capture and aggregate end-user interaction events on our website, and Heap pipes that data directly to Redshift.
We’ve onboarded Simon Data as our new Email Service Provider. They push engagement data (eg. sends, opens, clicks) into an S3 bucket which Fivetran then pulls into Redshift. Simon Data also pulls data from Redshift (more on that in the Data Security section below).
Our data lives in a single Redshift cluster. We use 3 ds2.xlarge nodes, which are storage-efficient and cost-efficient. Currently we’re using about 25% of our allotted storage space. Redshift Advisor analyzes queries and automatically recommends specific sort and distribution keys to optimize table setup over time.
Up until recently, all data transformations happened in Looker Persistent Derived Tables (PDTs).
Last year we started transitioning some of our transformations from Looker to dbt, and it has been a game-changer. For example, dbt allows for incremental updates on tables that would otherwise take a long time to rebuild, and can automatically generate a DAG that shows dependency relationships between all our models.
Even though we’re transitioning some transformations over to dbt, we’re still as happy as ever with Looker as a business intelligence platform. On any given week, we’ll have 100 Looker users that spend more than 2 hours per week. In addition to adding Explores and fields as requested, we’ve archived older views and fields that are no longer used. With the newer updates to our data stack, our email marketing team can use Looker to join email engagement from Simon Data with transactional data from our website for a more holistic view of our audience.
Our most recent machine learning model in production, Taxonomy, is a text classification algorithm in Amazon Comprehend that categorizes requested classroom resources into one of 11 categories, such as Art Supplies and Sports Equipment.
In addition, we use an Amazon EC2 instance to run machine learning cron jobs on some website data. For example, one model predicts whether current projects will be funded the next day, and another predicts donor lifetime value.
We’re also testing a few other machine learning ideas using DataRobot, which has provided us with a free license and expert consulting through their wonderful AI for Good program.
Closing the Feedback Loop
Historically, we’ve analyzed data by using Looker to visualize trends, running raw SQL on our warehouse tables, or conducted more detailed Python analyses using Google Colab. We’re now thinking about how to use our data in operations more effectively. One example we’ve implemented recently involves identifying unengaged users based on their email history. We now have two sources of email data: Simon Data and website data from our PostgreSQL database. After combining both in Redshift and transforming it with dbt, we can easily discern which of our users haven’t opened or clicked any of our emails over the last six months. We then feed that list back into an S3 bucket, which our Engineering team uses to send less email to inactive users. The entire process helps us avoid spam traps and keep our domain reputation high.
As our data stack has grown in complexity, we’ve thought carefully about improving data security at every level.
In our warehouse, we provide external vendors with their own usernames with the minimum level of permissions they need for their service. The Heap Analytics user can only write logging data to their own schema. For the Simon Data user, we’ve created a schema full of important views that they can pull a few times a day. The views reference tables in other schemas, but don’t allow the user to actually access the raw data from those other schemas.
We only make personally identifiable information (PII) data available to our internal users when they need it for their job using role-based access control within Looker, and we time-bound that access. For staff who do not need access to detailed data, they can still see aggregated information. We find that this setup allows our users to get all the info they need while limiting the identification of specific PII of our donors and teachers. Lastly, our awesome IT team has set up Okta, a single-sign-on provider that enforces strong passwords and multi-factor authentication for account creation and access to our Looker instance.