Thursday, February 13, 2014

AWS RedShift/S3 Data Warehouse Solution

Well over the last week I have been asked to setup a Amazon Web Service Redshift datawarehouse solution. At first I was a little taken back since I didn't know much about it. But over the past couple of days I have been reading over the infrastructure and design principals. At the end of the day after jumping through a few hoops it comes down to a traditional start schema design in PostgreSql backend. Below I will outline some key resources to help get started with the project.

  • Creating an account with amazon, most of you already have one you just need to sign into the AWS Console. 
  • You need to setup a drop point in S3. This allows you to drop bulk operation files into the online store for batch operations over to RedShift.
  • You need to setup a RedShift cluster. Size and scale are key hear to cost. Make sure to start small and scale as your data set grows.
  • You need to setup external ip access. By default you will not be able to connect via endpoint, jdbc, odbc connections
  • You need to setup local dev connectivity. I used Sql Workbench/J via jdbc drivers. 
  • Now that you are setup and connected you can start with creating a schema. Database, users, tables and stored procedures. One thing to keep in mind with developing your schema is you are working with a limited list of PostgreSql data types. 
  • How to get mass amounts of data into your new cloud based hosted solution. This is were it gets a little tricky. Typically, a quick ETL tool like SSIS or TalenD would do the trick. But these tools send an insert as "Insert into". Ideally RedShift is looking for a "Insert". The difference is 10 rows per sec vs 100K per sec. So Amazon recommends S3 to RedShift "COPY" command. I would associate this command like a tsql bulk copy command in MSSQL. 
  • Programmatic solution to loading data into S3
    • Found the AWS SDK and they have examples of how to do this. The key is to make sure the permissions are setup correctly and the region is bound correctly in the app config.
  • Using Data Pipeline to automate data from your drop point S3 to RedShift
    • more to come
  • Staging Updates and Deletes
    • more to come

Please check back as I develop the solution I will provide additional updates. 


Post a Comment