VPC Endpoints are the key to connecting AWS services privately with each other. When you use VPC endpoints, all traffic stays on the AWS network, and the AWS network is the best in the world.
Add to the quality of the network that the primary Endpoint for Data Science is S3, and the S3 Gateway Endpoint is free. You have a WIN, WIN, WIN proposition on using VPC endpoints.
They are free
They protect your data from Internet access
They are horizontally scaled, redundant, and highly available
See the details on the AWS Docs here:
Setting up a VPC Endpoint with Cloud Formation is easy and effective.
Define the VPC Endpoint based on an existing VPC.
AWSTemplateFormatVersion: "2010-09-09"
Description: >
This template constructs a VPC endpoint for the Scholar data
Parameters:
S3DataHome:
Type: String
Description: "The S3 Bucket Containing the Data Lake Data"
VpcId:
Type: String
RouteTableA:
Type: String
RouteTableB:
Type: String
Resources:
S3Endpoint:
Type: 'AWS::EC2::VPCEndpoint'
Properties:
PolicyDocument:
Version: 2012-10-17
Statement:
- Effect: Allow
Principal: '*'
Action:
- 's3:GetObject'
Resource:
- !Sub "arn:aws:s3:::${S3DataHome}/*"
RouteTableIds:
- !Ref RouteTableA
- !Ref RouteTableB
ServiceName: !Sub 'com.amazonaws.${AWS::Region}.s3'
VpcId: !Ref VpcId
To be a little bit fancy, you can define a dependency between Cloud Formation templates with the following syntax.
In the "Parameters" section, add the VPC Stack Cloud Formation reference
Parameters:
S3DataHome:
Type: String
Description: "The S3 Bucket Containing the Data Lake Data"
NetworkStackName:
Type: String
Description: "The base VPC Network stack"
Default: "PersonatorVpc"
Instead of passing the parameters as arguments on the cloud formation build, pass the name of the VPC stack and use the export values as imports into your access operations.
ScholarS3Connection:
Type: AWS::Glue::Connection
Properties:
CatalogId: !Sub "${AWS::AccountId}"
ConnectionInput:
ConnectionType: NETWORK
Description: "The connection to the S3 location of the Scholar Data"
Name: "scholar-data-connection"
PhysicalConnectionRequirements:
AvailabilityZone: "us-east-1a"
SecurityGroupIdList:
- Fn::ImportValue:
Fn::Sub: "${NetworkStackName}-SecurityGroupId"
SubnetId:
Fn::ImportValue:
Fn::Sub: "${NetworkStackName}-PrivateSubnetAId"
In this way, you can build data science applications using Cloud Formation templates only, and have a clean, DevOps process of building all your network and application infrastructure through scripting.
Next Steps - Glue or ECS with Fargate?
As I explore implementing AWS::Glue::Job in Cloud Formation, I am learning that all is not well with AWS Glue. The problems started small - noticing that the format is not consistent.
DefaultArguments:
"--job-bookmark-option": "job-bookmark-disable"
"--job-language": "scala"
"--class": "GlueApp"
"--TempDir": "s3://aws-glue-temporary-account-us-east-1/tim.burns/scholar-acm"
Does not match the other AWS Cloud Formation services.
The ECS Endpoint I defined above needs to be included in the Glue connection to utilize the VPC Endpoint. The inclusion of Connection in CloudFormation is a disgusting hack.
Connections:
Connections:
- !Ref ScholarS3Connection
The support for generating Scripts is limited and recommendations are to use the UI and copy the results into the deployment, which is a poor DevOps practice.
Even more, I'm running across articles discussing how ECS Fargate is cheaper and more effective than Glue.
AWS reposts the argument in their blog, confirming that they know that Glue is deeply flawed.
Snowflake does much of the process, but it is expensive. A simple query with 40MB taking less than 30 seconds will cost around $10 in Snowflake and their charging model is not transparent. This blog article does a good job of digging into the details behind outrageous Snowflake bills.
The other issue with Snowflake is once the data is in Snowflake, it is not readily accessible and transferable in the outside ecosystem as a parquet file would be.
To be honest, I think the best route here is back to the basics and write PySpark to turn CSV files into Parquet.
Here is a very solid article outlining how to start from basic principles:
Comments