Recently, my focus has been on using Amazon Redshift for a data warehousing solution, and I wanted to share some insights gained through working with unload, copy, and super data types. Unloading data allows for efficient extraction from Redshift to various storage options, facilitating downstream analytics and archival processes. Similarly, the COPY command serves as a linchpin for ingesting data into Redshift, enabling seamless integration from diverse sources. Super data types are a flexible option for accommodating complex data structures without the need for continuous schema changes.
Unload and rename a subset of columns
If you need to unload a subset of columns from a source table for load into a target table with different column names, this can be accomplished as follows:
unload (' select old_col_01 as new_col_01
,old_col_23 as new_col_02
,old_col_42 as new_col_03
from my_schema.my_table_name
where old_col_24 is not null
'
)
to 's3://my-s3-bucket/table_01_data_'
iam_role 'arn:aws:iam::93073150:role/my-redshift-role'
JSON
;
The column name old_col_01 from the source table is being renamed to new_col_01 for the target table.
The target S3 bucket should not contain any files with the prefix in the unload command. A quick way to clean up the files without the need to use the AWS Console, filter and select the files and type “permanently delete” is to use AWS Cloudshell and the command:
aws s3 rm s3://my-s3-bucket/ --recursive --exclude "*" --include "table_01_data_*"
Please make sure that you have populated the exclude and include options correctly. There is no warning or confirmation. All the files matching the exclude and include are deleted with no recovery.
Unload a subset of rows from a table
The Redshift unload command does not allow the use of a limit clause directly in the select. The work around is to use it in a sub select as shown below:
unload (' select *
from my_schema.my_table_name
where col_01 in
(
select col_01
from my_schema.my_table_name
limit 400
)
'
)
to 's3://my-s3-bucket/table_01_data_'
iam_role 'arn:aws:iam::93073150:role/my-redshift-role'
JSON
;
The sub select refers to the same table and applies the limit clause.
Copy a certain column(s) from a file that contains other columns
If you have a file in S3 the contains an unload from a table with a number of columns and you only want to populate 1 or more columns in a different table, you can do it with the below command:
copy my_schema.my_table_name
(col_01, col_02)
from 's3://my-s3-bucket/table_01_data_'
iam_role 'arn:aws:iam::93073150:role/my-redshift-role'
region 'us-east-1'
FORMAT JSON 'auto ignorecase'
;
Columns named col_01 and col_02 are copied into table name my_schema.my_table name while other columns in the S3 file are ignored.
Copying data and Redshift super data type
If you have a file containing JSON data and you need to load it into a Redshift table containing a super data type, you can use the following command:
copy my_schema.my_table_name from 's3://my-s3-bucket/ table_01_data_' iam_role 'arn:aws:iam::93073150:role/my-redshift-role' region 'us-east-1' FORMAT JSON 'noshred' ;
All of the data in the S3 file will be loaded into the super data type column in JSON format.
The data in the super data type column containing JSON data can be queried as follows. Assume that the table name is my_schema.my_super_data_table containing a super data type column named my_json_data. This JSON data contains elements named create_time, account_id etc. You can refer to the individual elements as follows:
select my_json_data."create_time"::timestamp::date as create_date
,my_json_data."account_id" as account_id
,my_json_data."cust_name" as cust_name
from my_schema.my_super_data_table
where my_json_data."zip_code" = 12345
;