Improving Data Architecture with AWS, Snowflake, and Open Source Tools
I'm trying that on as a title for a book I'm writing. One of the things I like about Medium is that it allows me to see the interest of my readers and overwhelmingly, most of them are fellow data engineers and architects. Originally, I was thinking of more of a theory book, but I am realizing the incredible need for recipes to build big data applications in Snowflake.
Because of my experience of over twenty years of building data architecture, I can see the larger picture, and I also have gathered quite a few opinions
For example:
Build Value First - It's easy to be myopic about technology and blind to whether or not what you're building has value
Delivery - Pick a data platform appropriate to your needs
Snowflake is for very large TB+ data sets and the data lake
Postgres or a relational system will be appropriate for many applications
Security - Security should come first because it is very difficult to build security into applications once they are in production
Repeatability - It might be called "DevOps" or "CI/CD", but the point is that your build must be repeatable. You need to be able to stand up infrastructure in a repeatable way.
Simplicity - Less code is always better. By picking quality open-source tools to realize your data architecture and a powerful platform like Snowflake to build on, you can focus on building your application as simply as possible. Just because you can write code doesn't mean you should.
We'll see about that and where it goes.
I've learned about my audience based on my Medium articles that they are technical and interested in actual recipes to build data architectures. This is great because my forte is putting together crafted, simple recipes to build out important components on a data platform. Now it leaves me to pick the tools to use. They are:
Snowflake - The Data Platform
Terraform - Infrastructure Components
Airbyte - Data Integration Tool
dbt - Data Pipeline Transformations
Kubernetes - Runtime and scaling
Airflow - Orchestration
These are all tools with solid open-source and community support. They've been proven repeatedly and are considered state-of-the-art today.
For the platform, I am choosing Snowflake over Databricks because though Databricks is extremely innovative, I feel that Snowflake offers governance and maturity that Databricks doesn't have. The cost comparison between the two (as in my previous article) is a red herring. Cost is driven by poor governance, and no technology will save you from bad strategy, governance, and execution. Focus on value and the cost will follow (the point of this article).
Notes
Steps
Provision Snowflake Pipeline using Terraform
Create Snowpipe using dbt
Parameterize new data adds
Track lineage consistently
Load Historical Data
Setup SQS Data Triggers on AWS
Monitor Data Loading (Observability)
Decentralized Governance
Creating Dashboards to Monitor Data Quality
People who deal with data every day should be the ones creating the rules
Many domains go into the result
As the DW gets smaller, the incoming data needs to get bigger, and the outgoing analytics capabilities need to get bigger. (Like a bow tie)
Comments