I had a requirement to delete about 68,0000 files from multiple folders in an S3 bucket. This Python script automates the process of deleting old files from an Amazon S3 bucket based on pattern matching of the name of the folder/file. It connects to the S3 bucket, identifies files older than a specified timeframe, and deletes them while keeping a detailed audit trail. Here’s a breakdown of the script:
1. Setting Up:
- The script imports the necessary modules:
datetime
for date manipulation,boto3
for interacting with S3, andtimedelta
for time calculations. - It defines variables for bucket name, prefix, file paths for storing S3 file names and files to be deleted, and the target file pattern for identification.
2. Gathering Files from S3:
- A connection is established to S3 using
boto3
. - The
list_objects_v2
paginator retrieves all files under the specified bucket and prefix. If you do not use this logic, only the first 1000 files will be listed. - The script iterates over each page and extracts the file names, storing them in a text file (
files_in_s3
). - A timestamp is recorded to indicate the completion of this stage.
3. Identifying Files for Deletion:
- The script calculates the date two months ago using
timedelta
anddatetime
. - It iterates through the list of files from S3 and checks if they:
- Start with the specified pattern (
my-file-name-pattern
). - Contain the two-month-ago date (
yy_months_ago
) in their name.
- Start with the specified pattern (
- If both conditions are met, the file name is written to another text file (
files_to_delete
) for deletion. - A timestamp and a count of files marked for deletion are printed.
4. Deleting Identified Files:
- The script iterates through the list of files to be deleted.
- For each file, it extracts the folder and region information.
- It checks if the current folder or region is different from the previous one. If yes, it prints a timestamp indicating the start of deletion for that specific folder/region.
- The script then uses the
delete_object
function to remove the file from the S3 bucket.
5. Completion and Audit Trail:
- A final timestamp marks the completion of file deletion.
- The script prints “End of program” as a closing message.
Benefits:
- Automates deletion of old files, reducing storage costs and improving data management.
- Maintains an audit trail of files identified for deletion and their removal timestamps.
- Customizable to different bucket configurations and deletion criteria.
Note:
- This script assumes the necessary AWS credentials are configured for accessing S3 resources.
- Modify the script parameters like bucket name, prefix, pattern, and file paths as needed for your specific scenario.
This script provides a comprehensive and efficient way to manage and delete old files in your S3 bucket, ensuring optimal storage utilization and data governance.
Code:
from datetime import datetime, timedelta import boto3 now = datetime.now() print(f"Starting at : {now}") print(' ') # ## Bucket details # bucket_name = 'my-bucket' bucket_prefix = 'my-prefix/' files_in_s3 = 'C:/dean/python/s3_list.txt' files_to_delete = 'C:/dean/python/s3_delete.txt' # ## Connect to S3 and get the file names # s3 = boto3.client('s3') paginator = s3.get_paginator('list_objects_v2') page_iterator= paginator.paginate(Bucket=bucket_name, Prefix=bucket_prefix) with open(files_in_s3, 'w') as f: for page in page_iterator: contents = page.get('Contents', []) for item in contents: f.write(item['Key'] + '\n') now = datetime.now() print(f"Collected files from S3 at {now}") print(' ') # ## find the n-2 month # n_months_ago = datetime.now() - timedelta(days=60) yy_months_ago = n_months_ago.strftime('%Y/%m') print(f"Deleting files for {yy_months_ago}") print(' ') # ## Write the files to be deleted to an audit trail # file_ctr = 0 file_out= open(files_to_delete, 'w') with open(files_in_s3, 'r') as f: for line in f: file_name = line.strip() if file_name.startswith('my-file-name-pattern'): if yy_months_ago in file_name: file_out.write(file_name + '\n') file_ctr = file_ctr + 1 now = datetime.now() print(f"Identified files to delete at {now}") temp = 'Number of files to delete ' + str(file_ctr) print(temp) print(' ') file_out.close() # ## Delete the files # prev_folder = '' prev_region = '' with open(files_to_delete, 'r') as f: for line in f: cur_folder = line.split('/')[3] cur_region = line.split('/')[4] if cur_folder != prev_folder or cur_region != prev_region: now = datetime.now() print(f"Deleting files from {cur_folder}/{cur_region} at {now}") prev_folder = cur_folder prev_region = cur_region file_name = line.strip() s3.delete_object(Bucket=bucket_name, Key=file_name) print(' ') now = datetime.now() print(f"Completed file deletion at {now}") print(' ') print('End of program')