Python Script for Deleting Old Files from S3 Bucket

I had a requirement to delete about 68,0000 files from multiple folders in an S3 bucket. This Python script automates the process of deleting old files from an Amazon S3 bucket based on pattern matching of the name of the folder/file. It connects to the S3 bucket, identifies files older than a specified timeframe, and deletes them while keeping a detailed audit trail. Here’s a breakdown of the script:

1. Setting Up:

  • The script imports the necessary modules: datetime for date manipulation, boto3 for interacting with S3, and timedelta for time calculations.
  • It defines variables for bucket name, prefix, file paths for storing S3 file names and files to be deleted, and the target file pattern for identification.

2. Gathering Files from S3:

  • A connection is established to S3 using boto3.
  • The list_objects_v2 paginator retrieves all files under the specified bucket and prefix. If you do not use this logic, only the first 1000 files will be listed.
  • The script iterates over each page and extracts the file names, storing them in a text file (files_in_s3).
  • A timestamp is recorded to indicate the completion of this stage.

3. Identifying Files for Deletion:

  • The script calculates the date two months ago using timedelta and datetime.
  • It iterates through the list of files from S3 and checks if they:
    • Start with the specified pattern (my-file-name-pattern).
    • Contain the two-month-ago date (yy_months_ago) in their name.
  • If both conditions are met, the file name is written to another text file (files_to_delete) for deletion.
  • A timestamp and a count of files marked for deletion are printed.

4. Deleting Identified Files:

  • The script iterates through the list of files to be deleted.
  • For each file, it extracts the folder and region information.
  • It checks if the current folder or region is different from the previous one. If yes, it prints a timestamp indicating the start of deletion for that specific folder/region.
  • The script then uses the delete_object function to remove the file from the S3 bucket.

5. Completion and Audit Trail:

  • A final timestamp marks the completion of file deletion.
  • The script prints “End of program” as a closing message.

Benefits:

  • Automates deletion of old files, reducing storage costs and improving data management.
  • Maintains an audit trail of files identified for deletion and their removal timestamps.
  • Customizable to different bucket configurations and deletion criteria.

Note:

  • This script assumes the necessary AWS credentials are configured for accessing S3 resources.
  • Modify the script parameters like bucket name, prefix, pattern, and file paths as needed for your specific scenario.

This script provides a comprehensive and efficient way to manage and delete old files in your S3 bucket, ensuring optimal storage utilization and data governance.

Code:

from datetime import datetime, timedelta
import boto3

now = datetime.now()
print(f"Starting at : {now}")
print(' ')

#
## Bucket details
#
bucket_name = 'my-bucket'
bucket_prefix = 'my-prefix/'
files_in_s3 = 'C:/dean/python/s3_list.txt'
files_to_delete = 'C:/dean/python/s3_delete.txt'

#
## Connect to S3 and get the file names
#
s3 = boto3.client('s3')
paginator = s3.get_paginator('list_objects_v2')
page_iterator= paginator.paginate(Bucket=bucket_name, Prefix=bucket_prefix)
with open(files_in_s3, 'w') as f:
    for page in page_iterator:
        contents = page.get('Contents', [])
        for item in contents:
            f.write(item['Key'] + '\n')  
now = datetime.now()
print(f"Collected files from S3 at {now}")
print(' ')

#
## find the n-2 month
#
n_months_ago = datetime.now() - timedelta(days=60)
yy_months_ago = n_months_ago.strftime('%Y/%m')
print(f"Deleting files for {yy_months_ago}")
print(' ')

#
## Write the files to be deleted to an audit trail
#
file_ctr = 0
file_out= open(files_to_delete, 'w')
with open(files_in_s3, 'r') as f:
    for line in f:
        file_name = line.strip()
        if  file_name.startswith('my-file-name-pattern'):
            if  yy_months_ago in file_name:
                file_out.write(file_name + '\n')
                file_ctr = file_ctr + 1
now = datetime.now()
print(f"Identified files to delete at {now}")
temp = 'Number of files to delete ' + str(file_ctr)                
print(temp)
print(' ')
file_out.close()

#
## Delete the files
#
prev_folder = ''
prev_region = ''
with open(files_to_delete, 'r') as f:
    for line in f:
        cur_folder = line.split('/')[3]
        cur_region = line.split('/')[4]
        if cur_folder != prev_folder or cur_region != prev_region:
            now = datetime.now()
            print(f"Deleting files from {cur_folder}/{cur_region} at {now}")
            prev_folder = cur_folder
            prev_region = cur_region
        file_name = line.strip()
        s3.delete_object(Bucket=bucket_name, Key=file_name)
print(' ')
now = datetime.now()
print(f"Completed file deletion at {now}")
print(' ')
print('End of program')

 

Author: Dean Capps

Database consultant at Amazon Web Services.