Python Script for Deleting Old Files from S3 Bucket

I had a requirement to delete about 68,0000 files from multiple folders in an S3 bucket. This Python script automates the process of deleting old files from an Amazon S3 bucket based on pattern matching of the name of the folder/file. It connects to the S3 bucket, identifies files older than a specified timeframe, and deletes them while keeping a detailed audit trail. Here’s a breakdown of the script:

1. Setting Up:

The script imports the necessary modules: datetime for date manipulation, boto3 for interacting with S3, and timedelta for time calculations.
It defines variables for bucket name, prefix, file paths for storing S3 file names and files to be deleted, and the target file pattern for identification.

2. Gathering Files from S3:

A connection is established to S3 using boto3.
The list_objects_v2 paginator retrieves all files under the specified bucket and prefix. If you do not use this logic, only the first 1000 files will be listed.
The script iterates over each page and extracts the file names, storing them in a text file (files_in_s3).
A timestamp is recorded to indicate the completion of this stage.

3. Identifying Files for Deletion:

The script calculates the date two months ago using timedelta and datetime.
It iterates through the list of files from S3 and checks if they:
- Start with the specified pattern (my-file-name-pattern).
- Contain the two-month-ago date (yy_months_ago) in their name.
If both conditions are met, the file name is written to another text file (files_to_delete) for deletion.
A timestamp and a count of files marked for deletion are printed.

4. Deleting Identified Files:

The script iterates through the list of files to be deleted.
For each file, it extracts the folder and region information.
It checks if the current folder or region is different from the previous one. If yes, it prints a timestamp indicating the start of deletion for that specific folder/region.
The script then uses the delete_object function to remove the file from the S3 bucket.

5. Completion and Audit Trail:

A final timestamp marks the completion of file deletion.
The script prints “End of program” as a closing message.

Benefits:

Automates deletion of old files, reducing storage costs and improving data management.
Maintains an audit trail of files identified for deletion and their removal timestamps.
Customizable to different bucket configurations and deletion criteria.

Note:

This script assumes the necessary AWS credentials are configured for accessing S3 resources.
Modify the script parameters like bucket name, prefix, pattern, and file paths as needed for your specific scenario.

This script provides a comprehensive and efficient way to manage and delete old files in your S3 bucket, ensuring optimal storage utilization and data governance.

Code:

from datetime import datetime, timedelta
import boto3

now = datetime.now()
print(f"Starting at : {now}")
print(' ')

#
## Bucket details
#
bucket_name = 'my-bucket'
bucket_prefix = 'my-prefix/'
files_in_s3 = 'C:/dean/python/s3_list.txt'
files_to_delete = 'C:/dean/python/s3_delete.txt'

#
## Connect to S3 and get the file names
#
s3 = boto3.client('s3')
paginator = s3.get_paginator('list_objects_v2')
page_iterator= paginator.paginate(Bucket=bucket_name, Prefix=bucket_prefix)
with open(files_in_s3, 'w') as f:
    for page in page_iterator:
        contents = page.get('Contents', [])
        for item in contents:
            f.write(item['Key'] + '\n')  
now = datetime.now()
print(f"Collected files from S3 at {now}")
print(' ')

#
## find the n-2 month
#
n_months_ago = datetime.now() - timedelta(days=60)
yy_months_ago = n_months_ago.strftime('%Y/%m')
print(f"Deleting files for {yy_months_ago}")
print(' ')

#
## Write the files to be deleted to an audit trail
#
file_ctr = 0
file_out= open(files_to_delete, 'w')
with open(files_in_s3, 'r') as f:
    for line in f:
        file_name = line.strip()
        if  file_name.startswith('my-file-name-pattern'):
            if  yy_months_ago in file_name:
                file_out.write(file_name + '\n')
                file_ctr = file_ctr + 1
now = datetime.now()
print(f"Identified files to delete at {now}")
temp = 'Number of files to delete ' + str(file_ctr)                
print(temp)
print(' ')
file_out.close()

#
## Delete the files
#
prev_folder = ''
prev_region = ''
with open(files_to_delete, 'r') as f:
    for line in f:
        cur_folder = line.split('/')[3]
        cur_region = line.split('/')[4]
        if cur_folder != prev_folder or cur_region != prev_region:
            now = datetime.now()
            print(f"Deleting files from {cur_folder}/{cur_region} at {now}")
            prev_folder = cur_folder
            prev_region = cur_region
        file_name = line.strip()
        s3.delete_object(Bucket=bucket_name, Key=file_name)
print(' ')
now = datetime.now()
print(f"Completed file deletion at {now}")
print(' ')
print('End of program')

Create a Lambda alerting process

Introduction

The goal is to create a lambda that runs once a day and sends an alert on all EC2 instances that are currently configured in all regions for a given account. The solution consists of

A role to provide permissions
An SNS topic that can be subscribed to by users who wish to be notified
A lambda written in python to identify the EC2 instances
A scheduling process consisting of an EventBridge rule and an EventBridge trigger

Create the role

Navigate to the IAM Dashboard and click on “Roles” in the left panel

Click on the orange “Create role” button
Select “AWS service” under the “Trusted entity type”
Select “Lambda” under the “Use case”
Under the “Permissions policies” search for “AWSLambdaBasicExecutionRole” and select it
Click on the orange “Next” button
Provide a “Role name” and meaningful “Description”

Click on the orange “Create role” button. We will be modifying the role later to add more permissions.

Return to the IAM Roles dashboard and search for the role as we have to add two more permissions

Click on the “Role name” and then on “Add permissions”, “Attach policies” on the next page
On the next page, add the “AmazonEC2ReadOnlyAccess” and then repeat to add the

“AmazonSNSFullAccess” policies.

The role creation is now complete.

Create the SNS topic

To demonstrate the AWS Command Line Interface (CLI), we will create the topic via a CLI command rather than the console. The AWS CLI command can be executed either from an EC2 instance with the required permissions or from cloud shell.I will be using cloud shell as it does not require any setup. The command is as follows

aws sns create-topic --name dc-running-assets-sns-topic

The output will display the ARN of the SNS topic. Save the ARN as it will be needed later.

Navigate to the “Amazon SNS” “Topics” dashboard and search for the SNS topic with the name from the above create command. Click on the “Name” and then on the orange “Create subscription” button on the next page. On the next page, populate the “Protocol” as “Email” and the “Endpoint” with your email address and click on the orange create subscription button

You will receive an email requesting you to confirm subscription. After you click on the “Confirm subscription” link, you will be taken to the subscription confirmation webpage. This can also be confirmed by returning to the SNS dashboard and checking the subscriptions. Additionally, you will receive a subscription confirmation email.

Create the lambda function in python

Navigate to the Lambda functions page in the console and click on the orange “Create function” button.

On the “Create function” web page
Select the “Author from scratch” option
Populate the “Function name”. I will use dc-running-assets-lambda
Select Python 3.9 under the “Runtime” drop down
Select x86_64 under “Architecture”
Under the “Change default execution role”
Select “Use an existing role”
Populate the role created above in the “Existing role” drop down

Finally click on the orange “Create function” button

On the next page, click on the “Code” tab if not already selected and replace the prepopulated code with the code below after making the following modifications

Replace the sns_topic_arn variable with the arn of the SNS topic created earlier
Comment or uncomment the lines with comments “Running instances only” or “All instance” depending on your use case
The “import os” is in place in the even you need to debug with the use of print statements

import boto3
import os

def lambda_handler(event, context):
    
    sns_topic_arn = 'arn:aws:sns:us-east-2:xxxxx:dc-running-assets-sns-topic'
    
    ec2_regions = [region['RegionName'] for region in boto3.client('ec2').describe_regions()['Regions']]
    all_instances = []
    
    for region in ec2_regions:
        all_instances.append(' ')
        all_instances.append(f"**** Region: {region} ***")
        
        ec2 = boto3.client('ec2', region_name=region)

        # Running instances only
        #response = ec2.describe_instances(Filters=[{'Name': 'instance-state-name', 'Values': ['running']}])
        
        # All instances
        response = ec2.describe_instances()
        
        for reservation in response['Reservations']:
            for instance in reservation['Instances']:
                instance_id = instance['InstanceId']
                instance_state = instance['State']['Name']
                instance_type = instance['InstanceType']
                private_ip = instance['PrivateIpAddress']
                all_instances.append(f"Region: {region}, Inst. ID: {instance_id}, State: {instance_state}, Type: {instance_type}, Prvt. IP: {private_ip}")

    if all_instances:
        sns = boto3.client('sns')
        message = "List of EC2 Instances:\n" + '\n'.join(all_instances)
        sns.publish(TopicArn=sns_topic_arn, Subject="List of EC2 Instances", Message=message)
    
    return {
        'statusCode': 200,
        'body': 'Email sent successfully'
    }

After pasting the code, click on the “Deploy” button and the “Changes not deployed” message will be removed.

Configuring timeouts

Lambda functions are created with a default timeout of 3 seconds. This particular lambda needs approximately 45 seconds to execute as it loops through all the regions and all the EC2 in each region hence we need to increase the default timeout. This is accomplished as follows:

Select the “Configuration” tab to the right of the “Code” tab and click on “General configuration”
Click on the “Edit” button
On the “Edit basic settings” page, enter the following
I added a description in the “Description – optional” box
Change the “Timeout” to 45 seconds

Create an AWS event to trigger the lambda on a set schedule

Create the scheduler as follows:

On the lambda page, click on the “Add trigger” button in the “Function overview” section at the top of the page
On the Add trigger page, type “Schedule” into the “Select a source” box and select “EventBridge (CloudWatch events)”
On the “Trigger configuration” page, select “Create a new rule” and populate

“Rule name” with the name of the rule
“rule description” with a meaningful description

Under “Rule type”

Select “Schedule expression”
Enter the schedule in the “Schedule expression” box. For example, “cron(0 20 * * ? *)” indicates that the schedule is every day at 20:00 hours

Click on the orange “Add” button to create the rule

Conclusion

The lambda function will now execute as per the defined schedule and email the list of servers from the account.

Creating an Excel workbook with multiple worksheets via Python

Discovering the wonders of Python is an ever-surprising journey. Just when I thought I had seen it all, a new cool feature emerged before me. Recently, I stumbled upon a fascinating solution while exploring ways to create Excel workbooks with multiple worksheets. It’s called XlsxWriter – a Python module dedicated to crafting Excel XLSX files.

https://xlsxwriter.readthedocs.io/index.html

Below is a sample program to create an Excel Workbook with two worksheets.

#! /usr/bin/python3
import os

import xlsxwriter
# Function to write the 'Dream cars' worksheet

def write_sheet_cars():

    worksheet = workbook.add_worksheet('Dream cars')
    # Formatting for header row

    cell_format = workbook.add_format({'font_color': 'blue', 'bold': True, 'align': 'Left'})

    row = 0

    col = 0

    worksheet.write(row, col, 'Name & model', cell_format)

    worksheet.write(row, col + 1, 'Price (USD)', cell_format)
    # Formatting for data rows

    cell_format = workbook.add_format({'font_color': 'black', 'num_format': '$###,###,##0', 'align': 'Right'})

    row = row + 1

    col = 0

    worksheet.write(row, col, 'Alfa Romeo 8C 2900B Lungo Spider')

    worksheet.write(row, col + 1, 19800000, cell_format)
    row = row + 1

    col = 0

    worksheet.write(row, col, '1955 Jaguar D-Type')

    worksheet.write(row, col + 1, 21780000, cell_format)
    row = row + 1

    col = 0

    worksheet.write(row, col, '1957 Ferrari 335 Sport Scaglietti')

    worksheet.write(row, col + 1, 35710000, cell_format)
    # Autofit the columns to fit the content

    worksheet.autofit()
# Function to write the 'Cool planes' worksheet

def write_sheet_planes():

    worksheet = workbook.add_worksheet('Cool planes')
    # Formatting for header row

    cell_format = workbook.add_format({'font_color': 'blue', 'bold': True, 'align': 'Left'})

    row = 0

    col = 0

    worksheet.write(row, col, 'Name & model', cell_format)

    worksheet.write(row, col + 1, 'Maximum speed (km/h)', cell_format)
    # Formatting for data rows

    cell_format = workbook.add_format({'font_color': 'black', 'num_format': '###,###,##0', 'align': 'Right'})

    row = row + 1

    col = 0

    worksheet.write(row, col, 'Mig 25')

    worksheet.write(row, col + 1, 3000, cell_format)
    row = row + 1

    col = 0

    worksheet.write(row, col, 'F-15 Eagle')

    worksheet.write(row, col + 1, 3087, cell_format)
    row = row + 1

    col = 0

    worksheet.write(row, col, 'Su-27 Flanker')

    worksheet.write(row, col + 1, 2500, cell_format)
    # Autofit the columns to fit the content

    worksheet.autofit()
#

## Main processing

#
os.system('clear')
# Create a new Excel workbook

workbook = xlsxwriter.Workbook('Excel_Workbook_From_Python.xlsx')
# Write the 'Dream cars' worksheet

write_sheet_cars()
# Write the 'Cool planes' worksheet

write_sheet_planes()
# Close the workbook and save the changes

workbook.close()


The "Dream cars" worksheet:


The "Cool planes" worksheet:

Using Python/BOTO3 code to put test data into a DynamoDB table

In this example, I created a Python script that uses the Boto3 SDK to write data to a DynamoDB table. The script defines a function called put_item() that takes in four arguments: part_key, sort_key, alt_sort_key, and more_info. The function prints out the part_key, sort_key, and alt_sort_key, writes an item to a DynamoDB table with the given attributes, and prints out the response in JSON format.

The script also defines a main logic section that iterates through a list of alphabet letters and a list of numbers to generate different combinations of part_key, sort_key, and alt_sort_key. It calls the put_item() function for each combination of keys with the more_info argument generated by concatenating alt_sort_key with the integer 3029.

The script uses the os module to clear the console before execution and the json module to print out responses in JSON format. It also imports the boto3 module to create a resource for DynamoDB operations.

#--------------------------------------------------------------------
#
# Author      : Dean Capps 
# Description : Put an item in a DynamoDB table
#
#--------------------------------------------------------------------

print("Starting")

import os
os.system('clear')

#
## import Python SDK for AWS
#
import boto3
import json

#
## create a boto3 resource for DynamoDB operations
#
dynamodb = boto3.resource("dynamodb")

#
## Write (put_item) into a dynamodb table with different 
## ReturnConsumedCapacity options
def put_item(part_key, sort_key, alt_sort_key, more_info):
    print(part_key, sort_key, alt_sort_key)
    try:
        TableName = dynamodb.Table('dean_test_table')
        response = TableName.put_item(
            Item={
                    'part_key'     : part_key,
                    'sort_key'     : sort_key,
                    'alt_sort_key' : alt_sort_key,
                    'more_info'    : more_info
            },
            #ReturnConsumedCapacity="NONE"
            ReturnConsumedCapacity="INDEXES"
        )
        # print(json.dumps(response, indent=2))
        
        # response = TableName.put_item(
            # Item={
                    # 'part_key':'b',
                    # 'sort_key':'4',
                    # 'alt_sort_key':'b1'
            # },
            # ReturnConsumedCapacity="TOTAL"
        # )
        # print(json.dumps(response, indent=2))
        
        # response = TableName.put_item(
            # Item={
                    # 'part_key':'d',
                    # 'sort_key':'1',
                    # 'alt_sort_key':'c1',
                    # 'more_info': {
                        # 'field-1':'Field 01 data',
                        # 'field-2':'Field 02 data'
                    # }
            # },
            # ReturnConsumedCapacity="INDEXES"
        #)
        print(json.dumps(response, indent=2))
        
    except Exception as e:
        print("Error writing to table")
        print(e)



#
## main logic
#
part_key = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
sort_key = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20' ]
for i in range(len(part_key)):
    for j in range(len(sort_key)):
        part_key_temp = part_key[i]
        sort_key_temp = sort_key[j]
        alt_sort_key  = part_key_temp + sort_key_temp
        more_info     = alt_sort_key*3029
        put_item(part_key_temp, sort_key_temp, alt_sort_key, more_info)

Using Python/BOTO3 code to create a DynamoDB table

This week I have been experimenting with the interface between DynamoDB and Python/BOTO3. In this example, I am creating a DynamoDB table along with a local secondary index (LSI) and a global secondary index (GSI). It is important that the following order be maintained in the table specification within the “dynamodb.create_table” structure:

a. Specify the key schema (Primary key in the RDBMS world)

b. Next specify the attributes of the table (Columns in the RDBMS world). Note that I have specified one extra attribute (alt_sort_key), which will be used in the LSI

c. In the next chunk of code, I create an LSI with the partition key matching the table’s key and the alternate sort key, alt_sort_key. Also included in the LSI specification is the projection clause which is the set of attributes that is to be copied from the table into the LSI. DynamoDB provides three different options for this:

KEYS_ONLY – Each item in the index consists only of the table partition key and sort key values, plus the index key values

INCLUDE – In addition to the attributes described in KEYS_ONLY, the secondary index will include other non-key attributes that you specify.

ALL – The secondary index includes all of the attributes from the source table.

d. The last structure is the GSI. Note that the LSI uses capacity from the table, while the GSI requires that you specify the capacity separately.

The entire create table code is structured to be within a try/catch logic to handle errors.

#! /usr/bin/env python
#--------------------------------------------------------------------
#
# Author      : Dean Capps 
# Description : Create a DynamoDB table
#
#--------------------------------------------------------------------
#
print("Starting")

import os
os.system('clear')

#
## import Python SDK for AWS
#
import boto3

#
## create a boto3 client for DynamoDB operations
#
dynamodb = boto3.client("dynamodb")

#
## Create the table
##
## Keep the order of
##   a. key schema
##   b. attributes
##   c. LSI
##   d. GSI
#
try:
    response = dynamodb.create_table(
        TableName="dean_test_table",  
        KeySchema=[
            {
                'AttributeName': 'part_key',
                'KeyType': 'HASH'
            },
            {
                "AttributeName": "sort_key",                
                'KeyType': 'RANGE'                
            }
        ],
        AttributeDefinitions=[
            {
                "AttributeName": "part_key",
                "AttributeType": "S"
            },
            {
                "AttributeName": "sort_key",
                "AttributeType": "S"
            },
            {
                "AttributeName": "alt_sort_key",
                "AttributeType": "S"
            }
        ],
        LocalSecondaryIndexes=[
            {
                'IndexName': 'dean_test_table_lsi',
                'KeySchema': [
                    {
                        'AttributeName': 'part_key',
                        'KeyType': 'HASH'
                    },
                    {
                        'AttributeName': 'alt_sort_key',
                        'KeyType': 'RANGE'
                    }
                ],
                'Projection': {
                    'ProjectionType': 'ALL'
                },
            }
        ],      
        GlobalSecondaryIndexes=[
            {
                'IndexName': 'dean_table_gsi',
                'KeySchema': [
                    {
                        'AttributeName': 'alt_sort_key',
                        'KeyType': 'HASH'
                    },
                ],
                'Projection': {
                    'ProjectionType': 'ALL'
                },
                'ProvisionedThroughput' :{
                    'ReadCapacityUnits': 1,
                    'WriteCapacityUnits': 1,
                }
            }
        ],        
        ProvisionedThroughput={
            "ReadCapacityUnits": 5,
            "WriteCapacityUnits": 5
        }
    )
    print("Table created successfully")
except Exception as e:
    print("Error creating table:")
    print(e)

Manipulating CSV files with Python

I had a CSV file with over 50 columns of which I only needed 11 columns in a slightly different order. I had been manipulating the file manually but got frustrated after the second time I had to do this repetitive manual task and turned to Python to see if I could write some quick and dirty code. As with most things in Python, it was relatively easy and quick to accomplish this:

import csv
with open("file_with_many_columns.csv","r") as source:
    rdr= csv.reader( source )
    with open("file_with_columns_needed.csv","w") as result:
        wtr= csv.writer( result )
        for r in rdr:
            #
            ## 
            #
            wtr.writerow( (r[2], r[1], r[0], r[3], r[4], r[5], r[6], r[7], r[27], r[44], r[45] ) )

The number in the square brackets corresponds to the column numbers in the original file. Like all things in Python numbering starts with [0] being the first or “A” column. The order of columns in the write statement is the order in which the files will be in the output file.

Hope this helps with your use case.

Script to connect to multiple Linux servers with Python & Paramiko

I needed a quick and dirty script to check if I had an account and valid password on a list of servers. Instead of manually logging on to each server to confirm my access, I created a Python script using Paramiko. The script reads in a text file with a list of servers and then logs on to each server and tests if I am able to successfully log on with my user account. In order for the script to execute successfully, you need to have Parmiko installed in addition to python.

Script:

import sys
import time
import paramiko
import getpass
 
my_id = 'johnsmith'
my_password = getpass.getpass('Password:')
 
ssh = paramiko.SSHClient()
ssh.set_missing_host_key_policy(paramiko.AutoAddPolicy())
 
out_file = open('connection_results.txt','w')
in_file = open('list_of_servers.txt', 'r') 
for host in in_file: 
      host=host.strip()
      print ("Checking server",host)
      
      try:
            ssh.connect(host, username=my_id, password=my_password)
            terminal = ssh.invoke_shell()
            terminal.send('junk')
            terminal.send('\n')
            time.sleep(5)
            print (terminal.recv(9999).decode('utf-8'))
      
            command = 'hostname'
            (stdin, stdout, stderr) = ssh.exec_command(command)
            for line in stdout.readlines():
                  print ("Connected to",line)
                  out_file.write("connected to " + line + "\n")
    
            terminal.send('exit')
            terminal.send('\n')
            time.sleep(5)
      
            ssh.close()
 
      except:
            out_file.write("Could not connect to " + host + "\n")
            
 
in_file.close()
out_file.close()

Using Python to search for key words in text files

As I have previously blogged, Python is very useful in solving everyday tasks that you may encounter. This week, I was tasked with reviewing resumes of potential candidates. As some of the resumes were unexpectedly long (8 pages in one instance), I came up with a list of search terms and then used the bellow python code to identify the presence of these terms in the resumes. The script generates a skills matrix of candidate name and skill as a csv that can be opened in Excel and filtered as needed. Additionally, it counts the number of search terms found in each resume to rank candidates.

This process does not in any way replace reading the resumes. It is just a convenient way to generate a cross reference of skills and candidates.

import os
import string

searchstrings = ('install', 'patch', 'upgrade', 'migrate', 'shell', 'scripts', 'rac', 'asm', 'performance', 'data guard', 'golden gate', 'import', 'export', 'pump', 'loader', 'rman', 'recovery', 'tde', 'db2', 'mysql', 'sybase', 'mongo', 'teradata', 'postgres', 'postgresql', 'casandra', 'mssql', 'aws', 'jenkins', 'json', 'cloud', 'oci')
src_dict = ("C:/temp/Resumes/temp/") #Specify base directory
 
report_file_name = 'C:/temp/Resumes/temp.txt'
report_file = open(report_file_name, 'w')
#temp = 'Name,Skill' + '\n'
#report_file.write(temp)
                
for resumes in os.listdir(src_dict):
    #print (resumes, 'contains the below terms')
    files = os.path.join(src_dict, resumes)
    #print (files)
    strng = open(files)
    for line in strng.readlines():
        #print (line)
        for word in searchstrings:
            if  word in line.lower():
                #print ('    Found', word, 'at line-->', line.rstrip())
                temp = resumes + ',' + word + '\n'
                #print (resumes,',',word)
                report_file.write(temp)
                
report_file.close()

#
## Sort the data to remove the duplicates
#
duplicates_file = open(report_file_name, 'r').readlines()
content_set = sorted(set(duplicates_file))

unique_file_name = 'C:/temp/Resumes/report.csv'
unique_file = open(unique_file_name, 'w')
for line in content_set:
    unique_file.write(line)

unique_file.close()

#
## Count the number of skills that each person has
#
unique_file_name = 'C:/temp/Resumes/report.csv'
with open('C:/temp/Resumes/report.csv') as unique_file:
    line = unique_file.readline()
fields = line.split(",")
prev_name = fields[0]
unique_file.close()
#print (prev_name)
skill_ctr = 0

unique_file = open(unique_file_name, 'r')
for line in unique_file:
    fields = line.split(",")
    curr_name = fields[0]
    if  curr_name == prev_name:
        skill_ctr = skill_ctr + 1
    else:
        temp = prev_name + ' has ' + str(skill_ctr) + ' skills.'
        print (temp)
        prev_name = curr_name
        skill_ctr = 0
temp = curr_name + ' has ' + str(skill_ctr) + ' skills.'
print (temp)

unique_file.close()

Python to the rescue – comparing text (fields) in a file

Python has increasingly become my go-to language for solving the annoying little problems that occur on a day to day basis. A good book for beginners is “Automate the Boring Stuff with Python: Practical Programming for Total Beginners” by Al Sweigart.

I have a process that runs multi streamed on a server. The input parameters and the log files containing the output are read from and written to different directories which are based on the stream name. Each log file is parsed at the end of the process to ensure that all the input parameters have been processed. I noticed that two streams were encountering issues as the stream name did not match the directory name in the input parameters. Rather than manually find the mismatch, I found the mismatch in a couple of minutes with the below python code:

#!/usr/bin/python
import pdb
import os
import sys
import subprocess
import commands
import time
import socket
import locale
locale.setlocale(locale.LC_ALL, 'en_US')
#
##-----------------------------------------------------------------------------
## Main Logic
##-----------------------------------------------------------------------------
#
all_parms=open('/opt/oracle/all_parms.txt','r')

for line in all_parms:
        #
        ## Filter out some non-data lines
        #
        if  '/dbbackup' in line:
                stream_name,fld_01,fld_02,fld_03,fld_04,temp,fld_05,fld_06=line.split("|")
                cmd_01,cmd_02,cmd_03,location=temp.split("/")
                if  stream_name != location:
                        print "Found a mismatch",stream_name,location
                        print line
                        print " "

The input file looks like:

Stream_name|field01|field02|field03|field04|/directory/name/stream_name|field05|field06

Python/Selenium example

My daughter asked me to create a process that would log on to her high school grade website, collect her current grades and calculate the current grade-point average. I used this as an opportunity to become familiar with Selenium. You can find more information about Selenium here:

http://www.seleniumhq.org/

I coded this in Python and the below blog posting contains the relevant parts of the script with explanation. I was running this on Kali Linux as root.

Assuming that you already have Python installed on your system, you can get Selenium as follows:

install selenium

After you have installed Selenium, you have to install the correct driver for the browser that you intend to use. For this example, I was using the Firefox browser which needs the geckodriver. Additional information on this can be found at:
https://developer.mozilla.org/en-US/docs/Mozilla/QA/Marionette/WebDriver

The geckodriver location has to be added to the PATH variable. As an example:

export PATH=$PATH:/path/to/geckodriver

At the top of the program, after the usual imports for Python, add in the Selenium commands:

#!/usr/bin/python
#
import sys
import time
import string

from decimal import *

from selenium import webdriver
from selenium.webdriver.common.keys import Keys

This routine will set up the username and password and then open the browser:

def read_web_page():
    user_name = 'user name'
    pass_word = 'password'
        
    #
    ## Open the browser
    #
    driver = webdriver.Firefox()

Navigate to the web page with:

    driver.get("https:website name.jsp?status=login")

Confirm that you at the correct web page with:

    assert "Campus Parent Portal Login" in driver.title
    temp = driver.current_url
    print 'Now on page ' + temp

Now that we are on the login page, find the elements username and password and supply the information from the variables created previously:

    elem = driver.find_element_by_name("username")
    elem.send_keys(user_name)

    elem = driver.find_element_by_name("password")
    elem.send_keys(pass_word)

Click on the sign in button and wait for 30 seconds for the website to respond:

    driver.find_element_by_css_selector("input[src*=btn_sign_in]").click()
    time.sleep(30)

Confirm that we are at the grades web page:

    #
    ## click on grades
    #
    temp = driver.current_url
    print 'Now on page ' + temp

This web page had multiple frames. The following command switches to the frame detail page, finds the element grades and clicks on it. After the click is complete wait for 20 seconds for the website to respond:

    driver.switch_to_frame("frameDetail")
    driver.find_element_by_link_text("Grades").click()
    time.sleep(20)

After the website has responded (i.e. the grades are being displayed), grab the contents of the web page (i.e. page source) and convert all characters to “printable characters”:

    #
    ## get the grades
    #
    temp = driver.page_source
    printable = set(string.printable)
    temp2 = filter(lambda x: x in printable, temp)

Write the page source to a file as text:

    grades_file = open('/dean/python/grades_file_raw.txt','w')
    grades_file.write(temp2)
    grades_file.close()

Sign out of the web site:

    #
    ## Sign out
    #
    #driver.switch_to_frame("frameDetail")
    driver.find_element_by_link_text("Sign Out").click()
    time.sleep(10)

This particular website had an alert window that pops up with the question “Do you really want to log off?”. The below command switch to the alert window and accept the alert indicating consent to log off:

    alert = driver.switch_to_alert()
    alert.accept()
    time.sleep(10)
    driver.quit()

All of the above processing involves the selenium driver. Now that the information is available as text in a file, I was able to parse it with the regular Python commands. Some of these are shown below as examples:

    
    grades_file_raw = open('/dean/python/grades_file_raw.txt','r')
    grades_file_fin = open('/dean/python/grades_file_fin.txt','w')
    prev_teacher = ""
    for raw_line in grades_file_raw:
        printable = set(string.printable)
        temp = filter(lambda x: x in printable, raw_line)
        raw_line = temp

        if  'div id="studentName"' in raw_line: 
             student = cut_str(2, '>', '<', raw_line) 
             out_line = student grades_file_fin.write('%-60s\n' %(out_line)) 
             out_line = '-' * len(student) grades_file_fin.write('%-60s\n' %(out_line)) 

Additional code removed for brevity.