AWS for Data Scientists: From VPC to Spark

Tim Burns

May 8, 20213 min read

VPC Endpoints are the key to connecting AWS services privately with each other. When you use VPC endpoints, all traffic stays on the AWS network, and the AWS network is the best in the world.

Add to the quality of the network that the primary Endpoint for Data Science is S3, and the S3 Gateway Endpoint is free. You have a WIN, WIN, WIN proposition on using VPC endpoints.

They are free
They protect your data from Internet access
They are horizontally scaled, redundant, and highly available

See the details on the AWS Docs here:

https://docs.aws.amazon.com/vpc/latest/privatelink/vpce-gateway.html

Setting up a VPC Endpoint with Cloud Formation is easy and effective.

Define the VPC Endpoint based on an existing VPC.

AWSTemplateFormatVersion: "2010-09-09"
Description: >
  This template constructs a VPC endpoint for the Scholar data

Parameters:
  S3DataHome:
    Type: String
    Description: "The S3 Bucket Containing the Data Lake Data"

  VpcId:
    Type: String

  RouteTableA:
    Type: String

  RouteTableB:
    Type: String

Resources:
  S3Endpoint:
    Type: 'AWS::EC2::VPCEndpoint'
    Properties:
      PolicyDocument:
        Version: 2012-10-17
        Statement:
          - Effect: Allow
            Principal: '*'
            Action:
              - 's3:GetObject'
            Resource:
              - !Sub "arn:aws:s3:::${S3DataHome}/*"
      RouteTableIds:
        - !Ref RouteTableA
        - !Ref RouteTableB
      ServiceName: !Sub 'com.amazonaws.${AWS::Region}.s3'
      VpcId: !Ref VpcId

To be a little bit fancy, you can define a dependency between Cloud Formation templates with the following syntax.

In the "Parameters" section, add the VPC Stack Cloud Formation reference

Parameters:
  S3DataHome:
    Type: String
    Description: "The S3 Bucket Containing the Data Lake Data"

  NetworkStackName:
    Type: String
    Description: "The base VPC Network stack"
    Default: "PersonatorVpc"

Instead of passing the parameters as arguments on the cloud formation build, pass the name of the VPC stack and use the export values as imports into your access operations.

ScholarS3Connection:
  Type: AWS::Glue::Connection
  Properties:
    CatalogId: !Sub "${AWS::AccountId}"
    ConnectionInput:
      ConnectionType: NETWORK
      Description: "The connection to the S3 location of the Scholar Data"
      Name: "scholar-data-connection"
      PhysicalConnectionRequirements:
        AvailabilityZone: "us-east-1a"
        SecurityGroupIdList:
          - Fn::ImportValue:
              Fn::Sub: "${NetworkStackName}-SecurityGroupId"
        SubnetId:
          Fn::ImportValue:
            Fn::Sub: "${NetworkStackName}-PrivateSubnetAId"

In this way, you can build data science applications using Cloud Formation templates only, and have a clean, DevOps process of building all your network and application infrastructure through scripting.

Next Steps - Glue or ECS with Fargate?

As I explore implementing AWS::Glue::Job in Cloud Formation, I am learning that all is not well with AWS Glue. The problems started small - noticing that the format is not consistent.

DefaultArguments:
  "--job-bookmark-option": "job-bookmark-disable"
  "--job-language": "scala"
  "--class": "GlueApp"
  "--TempDir": "s3://aws-glue-temporary-account-us-east-1/tim.burns/scholar-acm"

Does not match the other AWS Cloud Formation services.

The ECS Endpoint I defined above needs to be included in the Glue connection to utilize the VPC Endpoint. The inclusion of Connection in CloudFormation is a disgusting hack.

Connections:
  Connections:
  - !Ref ScholarS3Connection

The support for generating Scripts is limited and recommendations are to use the UI and copy the results into the deployment, which is a poor DevOps practice.

Even more, I'm running across articles discussing how ECS Fargate is cheaper and more effective than Glue.

https://www.taloflow.ai/blog/aws-glue-to-ecs

AWS reposts the argument in their blog, confirming that they know that Glue is deeply flawed.

https://aws.amazon.com/blogs/containers/how-taloflow-saved-60-by-moving-their-data-pipeline-to-aws-fargate-spot/

Snowflake does much of the process, but it is expensive. A simple query with 40MB taking less than 30 seconds will cost around $10 in Snowflake and their charging model is not transparent. This blog article does a good job of digging into the details behind outrageous Snowflake bills.

https://medium.com/opendoor-labs/analyze-snowflake-costs-570b7be953db

The other issue with Snowflake is once the data is in Snowflake, it is not readily accessible and transferable in the outside ecosystem as a parquet file would be.

https://medium.com/amaro/a-less-server-data-infrastructure-solution-for-ingestion-and-transformation-pipelines-part-2-e679e326f5a4

To be honest, I think the best route here is back to the basics and write PySpark to turn CSV files into Parquet.

Here is a very solid article outlining how to start from basic principles:

https://towardsdatascience.com/pyspark-import-any-data-f2856cda45fd

AWS for Data Scientists: From VPC to Spark

Next Steps - Glue or ECS with Fargate?

Recent Posts

Comments