Python Script for Deleting Old Files from S3 Bucket

I had a requirement to delete about 68,0000 files from multiple folders in an S3 bucket. This Python script automates the process of deleting old files from an Amazon S3 bucket based on pattern matching of the name of the folder/file. It connects to the S3 bucket, identifies files older than a specified timeframe, and deletes them while keeping a detailed audit trail. Here’s a breakdown of the script:

1. Setting Up:

The script imports the necessary modules: datetime for date manipulation, boto3 for interacting with S3, and timedelta for time calculations.
It defines variables for bucket name, prefix, file paths for storing S3 file names and files to be deleted, and the target file pattern for identification.

2. Gathering Files from S3:

A connection is established to S3 using boto3.
The list_objects_v2 paginator retrieves all files under the specified bucket and prefix. If you do not use this logic, only the first 1000 files will be listed.
The script iterates over each page and extracts the file names, storing them in a text file (files_in_s3).
A timestamp is recorded to indicate the completion of this stage.

3. Identifying Files for Deletion:

The script calculates the date two months ago using timedelta and datetime.
It iterates through the list of files from S3 and checks if they:
- Start with the specified pattern (my-file-name-pattern).
- Contain the two-month-ago date (yy_months_ago) in their name.
If both conditions are met, the file name is written to another text file (files_to_delete) for deletion.
A timestamp and a count of files marked for deletion are printed.

4. Deleting Identified Files:

The script iterates through the list of files to be deleted.
For each file, it extracts the folder and region information.
It checks if the current folder or region is different from the previous one. If yes, it prints a timestamp indicating the start of deletion for that specific folder/region.
The script then uses the delete_object function to remove the file from the S3 bucket.

5. Completion and Audit Trail:

A final timestamp marks the completion of file deletion.
The script prints “End of program” as a closing message.

Benefits:

Automates deletion of old files, reducing storage costs and improving data management.
Maintains an audit trail of files identified for deletion and their removal timestamps.
Customizable to different bucket configurations and deletion criteria.

Note:

This script assumes the necessary AWS credentials are configured for accessing S3 resources.
Modify the script parameters like bucket name, prefix, pattern, and file paths as needed for your specific scenario.

This script provides a comprehensive and efficient way to manage and delete old files in your S3 bucket, ensuring optimal storage utilization and data governance.

Code:

from datetime import datetime, timedelta
import boto3

now = datetime.now()
print(f"Starting at : {now}")
print(' ')

#
## Bucket details
#
bucket_name = 'my-bucket'
bucket_prefix = 'my-prefix/'
files_in_s3 = 'C:/dean/python/s3_list.txt'
files_to_delete = 'C:/dean/python/s3_delete.txt'

#
## Connect to S3 and get the file names
#
s3 = boto3.client('s3')
paginator = s3.get_paginator('list_objects_v2')
page_iterator= paginator.paginate(Bucket=bucket_name, Prefix=bucket_prefix)
with open(files_in_s3, 'w') as f:
    for page in page_iterator:
        contents = page.get('Contents', [])
        for item in contents:
            f.write(item['Key'] + '\n')  
now = datetime.now()
print(f"Collected files from S3 at {now}")
print(' ')

#
## find the n-2 month
#
n_months_ago = datetime.now() - timedelta(days=60)
yy_months_ago = n_months_ago.strftime('%Y/%m')
print(f"Deleting files for {yy_months_ago}")
print(' ')

#
## Write the files to be deleted to an audit trail
#
file_ctr = 0
file_out= open(files_to_delete, 'w')
with open(files_in_s3, 'r') as f:
    for line in f:
        file_name = line.strip()
        if  file_name.startswith('my-file-name-pattern'):
            if  yy_months_ago in file_name:
                file_out.write(file_name + '\n')
                file_ctr = file_ctr + 1
now = datetime.now()
print(f"Identified files to delete at {now}")
temp = 'Number of files to delete ' + str(file_ctr)                
print(temp)
print(' ')
file_out.close()

#
## Delete the files
#
prev_folder = ''
prev_region = ''
with open(files_to_delete, 'r') as f:
    for line in f:
        cur_folder = line.split('/')[3]
        cur_region = line.split('/')[4]
        if cur_folder != prev_folder or cur_region != prev_region:
            now = datetime.now()
            print(f"Deleting files from {cur_folder}/{cur_region} at {now}")
            prev_folder = cur_folder
            prev_region = cur_region
        file_name = line.strip()
        s3.delete_object(Bucket=bucket_name, Key=file_name)
print(' ')
now = datetime.now()
print(f"Completed file deletion at {now}")
print(' ')
print('End of program')

Testing online COBOL environments

I often use web-based environments, to test the syntax of particular statements in programing languages. As a fun challenge, I decided to see if I could find a web-based environment to test COBOL code. I was expecting this to be an impossible quest because; “why would you?” 😊

After trying a couple of sites, I landed on tutorials point’s web site:
Tutorialspoint online COBOL compiler

Using this website, I was able to write a simple COBOL program to reverse a string. Some of the key concepts used are
a. PERFORM VARYING
b. One dimensional table
c. Reference modification, aka COBOL speak for substring

Below is a screen shot of the web interface. The source code is displayed in the left panel. When you press the execute button, the right panel is a terminal that accepts input and displays output.

Occasionally the web interface displays its true unix/php identity with a message such as:

This can be remediated by putting the string in quotes as shown below:

The complete source code in text is below:

IDENTIFICATION DIVISION.
PROGRAM-ID. REVERSESTRING.

DATA DIVISION.

WORKING-STORAGE SECTION.
01  GENERAL-VARIABLES.
    05  WS-ORIGINAL-TXT           PIC X(100).
    05  WS-REVERSE-TXT-TABLE.
        10  WS-REVERSE-TXT-ITEM   PIC X(1) OCCURS 100 TIMES.
    05  WS-I                      PIC S9(3)  USAGE IS COMP-3 VALUE 0.
    05  WS-J                      PIC S9(3)  USAGE IS COMP-3 VALUE 0.
    05  WS-TEMP                   PIC X(01).
    05  WS-ORIG-TXT-LEN           PIC S9(3)  USAGE IS COMP-3.

PROCEDURE DIVISION.
    DISPLAY "Enter a string to reverse: "
    ACCEPT WS-ORIGINAL-TXT

    PERFORM VARYING WS-I FROM 100 BY -1 UNTIL ( WS-I = 1 OR WS-TEMP <> ' ')
            MOVE WS-ORIGINAL-TXT(WS-I:1) TO WS-TEMP
    END-PERFORM
    COMPUTE  WS-ORIG-TXT-LEN = WS-I + 1
    DISPLAY "You entered    : " WS-ORIGINAL-TXT(1:WS-ORIG-TXT-LEN)

    PERFORM VARYING WS-I FROM WS-ORIG-TXT-LEN BY -1 UNTIL (WS-I = 0)
            COMPUTE WS-J = WS-J + 1
            MOVE WS-ORIGINAL-TXT(WS-I:1) TO  WS-REVERSE-TXT-ITEM(WS-J)
    END-PERFORM
    
    DISPLAY "Reverse string : " WS-REVERSE-TXT-TABLE
    
    STOP RUN.

Create a Lambda alerting process

Introduction

The goal is to create a lambda that runs once a day and sends an alert on all EC2 instances that are currently configured in all regions for a given account. The solution consists of

A role to provide permissions
An SNS topic that can be subscribed to by users who wish to be notified
A lambda written in python to identify the EC2 instances
A scheduling process consisting of an EventBridge rule and an EventBridge trigger

Create the role

Navigate to the IAM Dashboard and click on “Roles” in the left panel

Click on the orange “Create role” button
Select “AWS service” under the “Trusted entity type”
Select “Lambda” under the “Use case”
Under the “Permissions policies” search for “AWSLambdaBasicExecutionRole” and select it
Click on the orange “Next” button
Provide a “Role name” and meaningful “Description”

Click on the orange “Create role” button. We will be modifying the role later to add more permissions.

Return to the IAM Roles dashboard and search for the role as we have to add two more permissions

Click on the “Role name” and then on “Add permissions”, “Attach policies” on the next page
On the next page, add the “AmazonEC2ReadOnlyAccess” and then repeat to add the

“AmazonSNSFullAccess” policies.

The role creation is now complete.

Create the SNS topic

To demonstrate the AWS Command Line Interface (CLI), we will create the topic via a CLI command rather than the console. The AWS CLI command can be executed either from an EC2 instance with the required permissions or from cloud shell.I will be using cloud shell as it does not require any setup. The command is as follows

aws sns create-topic --name dc-running-assets-sns-topic

The output will display the ARN of the SNS topic. Save the ARN as it will be needed later.

Navigate to the “Amazon SNS” “Topics” dashboard and search for the SNS topic with the name from the above create command. Click on the “Name” and then on the orange “Create subscription” button on the next page. On the next page, populate the “Protocol” as “Email” and the “Endpoint” with your email address and click on the orange create subscription button

You will receive an email requesting you to confirm subscription. After you click on the “Confirm subscription” link, you will be taken to the subscription confirmation webpage. This can also be confirmed by returning to the SNS dashboard and checking the subscriptions. Additionally, you will receive a subscription confirmation email.

Create the lambda function in python

Navigate to the Lambda functions page in the console and click on the orange “Create function” button.

On the “Create function” web page
Select the “Author from scratch” option
Populate the “Function name”. I will use dc-running-assets-lambda
Select Python 3.9 under the “Runtime” drop down
Select x86_64 under “Architecture”
Under the “Change default execution role”
Select “Use an existing role”
Populate the role created above in the “Existing role” drop down

Finally click on the orange “Create function” button

On the next page, click on the “Code” tab if not already selected and replace the prepopulated code with the code below after making the following modifications

Replace the sns_topic_arn variable with the arn of the SNS topic created earlier
Comment or uncomment the lines with comments “Running instances only” or “All instance” depending on your use case
The “import os” is in place in the even you need to debug with the use of print statements

import boto3
import os

def lambda_handler(event, context):
    
    sns_topic_arn = 'arn:aws:sns:us-east-2:xxxxx:dc-running-assets-sns-topic'
    
    ec2_regions = [region['RegionName'] for region in boto3.client('ec2').describe_regions()['Regions']]
    all_instances = []
    
    for region in ec2_regions:
        all_instances.append(' ')
        all_instances.append(f"**** Region: {region} ***")
        
        ec2 = boto3.client('ec2', region_name=region)

        # Running instances only
        #response = ec2.describe_instances(Filters=[{'Name': 'instance-state-name', 'Values': ['running']}])
        
        # All instances
        response = ec2.describe_instances()
        
        for reservation in response['Reservations']:
            for instance in reservation['Instances']:
                instance_id = instance['InstanceId']
                instance_state = instance['State']['Name']
                instance_type = instance['InstanceType']
                private_ip = instance['PrivateIpAddress']
                all_instances.append(f"Region: {region}, Inst. ID: {instance_id}, State: {instance_state}, Type: {instance_type}, Prvt. IP: {private_ip}")

    if all_instances:
        sns = boto3.client('sns')
        message = "List of EC2 Instances:\n" + '\n'.join(all_instances)
        sns.publish(TopicArn=sns_topic_arn, Subject="List of EC2 Instances", Message=message)
    
    return {
        'statusCode': 200,
        'body': 'Email sent successfully'
    }

After pasting the code, click on the “Deploy” button and the “Changes not deployed” message will be removed.

Configuring timeouts

Lambda functions are created with a default timeout of 3 seconds. This particular lambda needs approximately 45 seconds to execute as it loops through all the regions and all the EC2 in each region hence we need to increase the default timeout. This is accomplished as follows:

Select the “Configuration” tab to the right of the “Code” tab and click on “General configuration”
Click on the “Edit” button
On the “Edit basic settings” page, enter the following
I added a description in the “Description – optional” box
Change the “Timeout” to 45 seconds

Create an AWS event to trigger the lambda on a set schedule

Create the scheduler as follows:

On the lambda page, click on the “Add trigger” button in the “Function overview” section at the top of the page
On the Add trigger page, type “Schedule” into the “Select a source” box and select “EventBridge (CloudWatch events)”
On the “Trigger configuration” page, select “Create a new rule” and populate

“Rule name” with the name of the rule
“rule description” with a meaningful description

Under “Rule type”

Select “Schedule expression”
Enter the schedule in the “Schedule expression” box. For example, “cron(0 20 * * ? *)” indicates that the schedule is every day at 20:00 hours

Click on the orange “Add” button to create the rule

Conclusion

The lambda function will now execute as per the defined schedule and email the list of servers from the account.

Creating an Excel workbook with multiple worksheets via Python

Discovering the wonders of Python is an ever-surprising journey. Just when I thought I had seen it all, a new cool feature emerged before me. Recently, I stumbled upon a fascinating solution while exploring ways to create Excel workbooks with multiple worksheets. It’s called XlsxWriter – a Python module dedicated to crafting Excel XLSX files.

https://xlsxwriter.readthedocs.io/index.html

Below is a sample program to create an Excel Workbook with two worksheets.

#! /usr/bin/python3
import os

import xlsxwriter
# Function to write the 'Dream cars' worksheet

def write_sheet_cars():

    worksheet = workbook.add_worksheet('Dream cars')
    # Formatting for header row

    cell_format = workbook.add_format({'font_color': 'blue', 'bold': True, 'align': 'Left'})

    row = 0

    col = 0

    worksheet.write(row, col, 'Name & model', cell_format)

    worksheet.write(row, col + 1, 'Price (USD)', cell_format)
    # Formatting for data rows

    cell_format = workbook.add_format({'font_color': 'black', 'num_format': '$###,###,##0', 'align': 'Right'})

    row = row + 1

    col = 0

    worksheet.write(row, col, 'Alfa Romeo 8C 2900B Lungo Spider')

    worksheet.write(row, col + 1, 19800000, cell_format)
    row = row + 1

    col = 0

    worksheet.write(row, col, '1955 Jaguar D-Type')

    worksheet.write(row, col + 1, 21780000, cell_format)
    row = row + 1

    col = 0

    worksheet.write(row, col, '1957 Ferrari 335 Sport Scaglietti')

    worksheet.write(row, col + 1, 35710000, cell_format)
    # Autofit the columns to fit the content

    worksheet.autofit()
# Function to write the 'Cool planes' worksheet

def write_sheet_planes():

    worksheet = workbook.add_worksheet('Cool planes')
    # Formatting for header row

    cell_format = workbook.add_format({'font_color': 'blue', 'bold': True, 'align': 'Left'})

    row = 0

    col = 0

    worksheet.write(row, col, 'Name & model', cell_format)

    worksheet.write(row, col + 1, 'Maximum speed (km/h)', cell_format)
    # Formatting for data rows

    cell_format = workbook.add_format({'font_color': 'black', 'num_format': '###,###,##0', 'align': 'Right'})

    row = row + 1

    col = 0

    worksheet.write(row, col, 'Mig 25')

    worksheet.write(row, col + 1, 3000, cell_format)
    row = row + 1

    col = 0

    worksheet.write(row, col, 'F-15 Eagle')

    worksheet.write(row, col + 1, 3087, cell_format)
    row = row + 1

    col = 0

    worksheet.write(row, col, 'Su-27 Flanker')

    worksheet.write(row, col + 1, 2500, cell_format)
    # Autofit the columns to fit the content

    worksheet.autofit()
#

## Main processing

#
os.system('clear')
# Create a new Excel workbook

workbook = xlsxwriter.Workbook('Excel_Workbook_From_Python.xlsx')
# Write the 'Dream cars' worksheet

write_sheet_cars()
# Write the 'Cool planes' worksheet

write_sheet_planes()
# Close the workbook and save the changes

workbook.close()


The "Dream cars" worksheet:


The "Cool planes" worksheet:

Resolving security alerts by removing an old kernel

While working on an Oracle instance running on a Unix EC2 server, I encountered a recurring security alert indicating that my server was using an outdated kernel version. Despite being up to date with patches, the issue persisted. This blog post outlines the steps I took to address this problem efficiently, allowing me to continue my work without the need for creating a new server.

Identifying the Issue:

To determine the kernel versions installed on my system, I used the command:

rpm -qa | grep kernel

Among the listed versions, I noticed an older kernel, “kernel-uek-5.4.17-2136.319.1.3.el7uek.x86_64,” that was causing the security alert.

Solution Steps:

Switching to Root User:
To make the necessary changes, I switched to the root user using the command:

sudo su root

Removing the Old Kernel:
To remove the outdated kernel version, I utilized the following command:

yum remove kernel-uek-5.4.17-2136.319.1.3.el7uek.x86_64

This command triggered the removal process and displayed the following output:

Loaded plugins: ulninfo
Resolving Dependencies
--> Running transaction check
---> Package kernel-uek.x86_64 0:5.4.17-2136.319.1.3.el7uek will be erased
--> Finished Dependency Resolution

Dependencies Resolved

=====================================================================================================================================================================================================
Package                                     Arch                                    Version                                                       Repository                                   Size
=====================================================================================================================================================================================================
Removing:
kernel-uek                                  x86_64                                  5.4.17-2136.319.1.3.el7uek                                    @ol7_UEKR6                                  115 M

Transaction Summary
=====================================================================================================================================================================================================
Remove  1 Package

Installed size: 115 M
Is this ok [y/N]: y
Downloading packages:
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction
  Erasing    : kernel-uek-5.4.17-2136.319.1.3.el7uek.x86_64                                                                                                                                      1/1
  Verifying  : kernel-uek-5.4.17-2136.319.1.3.el7uek.x86_64                                                                                                                                      1/1

Removed:
  kernel-uek.x86_64 0:5.4.17-2136.319.1.3.el7uek

Complete!

Using Python/BOTO3 code to put test data into a DynamoDB table

In this example, I created a Python script that uses the Boto3 SDK to write data to a DynamoDB table. The script defines a function called put_item() that takes in four arguments: part_key, sort_key, alt_sort_key, and more_info. The function prints out the part_key, sort_key, and alt_sort_key, writes an item to a DynamoDB table with the given attributes, and prints out the response in JSON format.

The script also defines a main logic section that iterates through a list of alphabet letters and a list of numbers to generate different combinations of part_key, sort_key, and alt_sort_key. It calls the put_item() function for each combination of keys with the more_info argument generated by concatenating alt_sort_key with the integer 3029.

The script uses the os module to clear the console before execution and the json module to print out responses in JSON format. It also imports the boto3 module to create a resource for DynamoDB operations.

#--------------------------------------------------------------------
#
# Author      : Dean Capps 
# Description : Put an item in a DynamoDB table
#
#--------------------------------------------------------------------

print("Starting")

import os
os.system('clear')

#
## import Python SDK for AWS
#
import boto3
import json

#
## create a boto3 resource for DynamoDB operations
#
dynamodb = boto3.resource("dynamodb")

#
## Write (put_item) into a dynamodb table with different 
## ReturnConsumedCapacity options
def put_item(part_key, sort_key, alt_sort_key, more_info):
    print(part_key, sort_key, alt_sort_key)
    try:
        TableName = dynamodb.Table('dean_test_table')
        response = TableName.put_item(
            Item={
                    'part_key'     : part_key,
                    'sort_key'     : sort_key,
                    'alt_sort_key' : alt_sort_key,
                    'more_info'    : more_info
            },
            #ReturnConsumedCapacity="NONE"
            ReturnConsumedCapacity="INDEXES"
        )
        # print(json.dumps(response, indent=2))
        
        # response = TableName.put_item(
            # Item={
                    # 'part_key':'b',
                    # 'sort_key':'4',
                    # 'alt_sort_key':'b1'
            # },
            # ReturnConsumedCapacity="TOTAL"
        # )
        # print(json.dumps(response, indent=2))
        
        # response = TableName.put_item(
            # Item={
                    # 'part_key':'d',
                    # 'sort_key':'1',
                    # 'alt_sort_key':'c1',
                    # 'more_info': {
                        # 'field-1':'Field 01 data',
                        # 'field-2':'Field 02 data'
                    # }
            # },
            # ReturnConsumedCapacity="INDEXES"
        #)
        print(json.dumps(response, indent=2))
        
    except Exception as e:
        print("Error writing to table")
        print(e)



#
## main logic
#
part_key = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
sort_key = ['1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12', '13', '14', '15', '16', '17', '18', '19', '20' ]
for i in range(len(part_key)):
    for j in range(len(sort_key)):
        part_key_temp = part_key[i]
        sort_key_temp = sort_key[j]
        alt_sort_key  = part_key_temp + sort_key_temp
        more_info     = alt_sort_key*3029
        put_item(part_key_temp, sort_key_temp, alt_sort_key, more_info)

Using Python/BOTO3 code to create a DynamoDB table

This week I have been experimenting with the interface between DynamoDB and Python/BOTO3. In this example, I am creating a DynamoDB table along with a local secondary index (LSI) and a global secondary index (GSI). It is important that the following order be maintained in the table specification within the “dynamodb.create_table” structure:

a. Specify the key schema (Primary key in the RDBMS world)

b. Next specify the attributes of the table (Columns in the RDBMS world). Note that I have specified one extra attribute (alt_sort_key), which will be used in the LSI

c. In the next chunk of code, I create an LSI with the partition key matching the table’s key and the alternate sort key, alt_sort_key. Also included in the LSI specification is the projection clause which is the set of attributes that is to be copied from the table into the LSI. DynamoDB provides three different options for this:

KEYS_ONLY – Each item in the index consists only of the table partition key and sort key values, plus the index key values

INCLUDE – In addition to the attributes described in KEYS_ONLY, the secondary index will include other non-key attributes that you specify.

ALL – The secondary index includes all of the attributes from the source table.

d. The last structure is the GSI. Note that the LSI uses capacity from the table, while the GSI requires that you specify the capacity separately.

The entire create table code is structured to be within a try/catch logic to handle errors.

#! /usr/bin/env python
#--------------------------------------------------------------------
#
# Author      : Dean Capps 
# Description : Create a DynamoDB table
#
#--------------------------------------------------------------------
#
print("Starting")

import os
os.system('clear')

#
## import Python SDK for AWS
#
import boto3

#
## create a boto3 client for DynamoDB operations
#
dynamodb = boto3.client("dynamodb")

#
## Create the table
##
## Keep the order of
##   a. key schema
##   b. attributes
##   c. LSI
##   d. GSI
#
try:
    response = dynamodb.create_table(
        TableName="dean_test_table",  
        KeySchema=[
            {
                'AttributeName': 'part_key',
                'KeyType': 'HASH'
            },
            {
                "AttributeName": "sort_key",                
                'KeyType': 'RANGE'                
            }
        ],
        AttributeDefinitions=[
            {
                "AttributeName": "part_key",
                "AttributeType": "S"
            },
            {
                "AttributeName": "sort_key",
                "AttributeType": "S"
            },
            {
                "AttributeName": "alt_sort_key",
                "AttributeType": "S"
            }
        ],
        LocalSecondaryIndexes=[
            {
                'IndexName': 'dean_test_table_lsi',
                'KeySchema': [
                    {
                        'AttributeName': 'part_key',
                        'KeyType': 'HASH'
                    },
                    {
                        'AttributeName': 'alt_sort_key',
                        'KeyType': 'RANGE'
                    }
                ],
                'Projection': {
                    'ProjectionType': 'ALL'
                },
            }
        ],      
        GlobalSecondaryIndexes=[
            {
                'IndexName': 'dean_table_gsi',
                'KeySchema': [
                    {
                        'AttributeName': 'alt_sort_key',
                        'KeyType': 'HASH'
                    },
                ],
                'Projection': {
                    'ProjectionType': 'ALL'
                },
                'ProvisionedThroughput' :{
                    'ReadCapacityUnits': 1,
                    'WriteCapacityUnits': 1,
                }
            }
        ],        
        ProvisionedThroughput={
            "ReadCapacityUnits": 5,
            "WriteCapacityUnits": 5
        }
    )
    print("Table created successfully")
except Exception as e:
    print("Error creating table:")
    print(e)

Manipulating CSV files with Python

I had a CSV file with over 50 columns of which I only needed 11 columns in a slightly different order. I had been manipulating the file manually but got frustrated after the second time I had to do this repetitive manual task and turned to Python to see if I could write some quick and dirty code. As with most things in Python, it was relatively easy and quick to accomplish this:

import csv
with open("file_with_many_columns.csv","r") as source:
    rdr= csv.reader( source )
    with open("file_with_columns_needed.csv","w") as result:
        wtr= csv.writer( result )
        for r in rdr:
            #
            ## 
            #
            wtr.writerow( (r[2], r[1], r[0], r[3], r[4], r[5], r[6], r[7], r[27], r[44], r[45] ) )

The number in the square brackets corresponds to the column numbers in the original file. Like all things in Python numbering starts with [0] being the first or “A” column. The order of columns in the write statement is the order in which the files will be in the output file.

Hope this helps with your use case.

Automating website scraping

For many years I have had a personal quest to develop a simple way to automate the collection of information from websites. In 2017, I played around with a Python/Selenium-based solution. See blog post Python/Selenium example for details.

While the Python/Selenium-based solution was interesting it did not meet all my requirements such as ease of use, low development time, and flexibility. While browsing the web I ran across Windows Power Automate which looked interesting. I was pleasantly surprised to see that a desktop version of the software was included in my Microsoft 365 subscription. I downloaded and installed the product and attempted to automate the collection of information.

The use case was to get the account balance from my brokerage account, the current DOW and S&P numbers, and write the information to a CSV file with the current date. The steps are as follows:

01. Click on create a new flow (aka program unit)
02. Set up variables for user names and passwords by clicking on “Variables” on the left and dragging the function “Set variable” into the body of the flow
03. Launch a new web browser instance with “Launch new name-of-browser”. I was not able to get the Firefox plugins to work, however, Microsoft Edge worked well
04. Populate the user name and password with “Populate text field on web page”. The product displays a UI when you are on the web page so you simply have to highlight the field on the web page and assign the variable set in 02. above
05. Usually you have to click on a login button of some sort. This is accomplished with the “Press button on web page” function
06. I coded a “Wait” for 30 seconds to let the web page complete the log on
07. Use the “Extract data from web page” to scrape the required data and store it into a variable
08. The “Get current date and time” function retrieves the current date into a variable
09. Write the data into a file with the “Write text to file” function
10. Close the web browser with the “Close web browser” function

Repeat the above steps for each web page you need to extract the information from. The individual functions can be moved up or down in the flow and can be copied to repeat a function.

One issue I ran into was that the is written to a file in columnar format. I am sure that I will be able to change this as I get more familiar with the product. In the interim, I created a few lines of code in Python and executed it from the flow with the “Run DOS command function” to format the data into a CSV file. Now when I want the information from these three websites, I run the flow and open the CSV file in Excel.

The tool also has options to debug such as run from a step or stop at a step. The product is intuitive and easy to learn for anyone with a few basic coding skills.

Complete documentation is here.

Executing 7-Zip from the command line and protecting the archive with a password

My past practice to back up my personal files was to copy the files to a couple of USB drives and store them in a fireproof safe. However, with the continued drought and increased risk of fire in the Western States, I was evaluating a cloud-based backup solution. I must confess I am very wary of such a solution as I am placing my data (files) into the hands of a third party.

In my first attempt, I turned on passwords in all the software packages that I use so that the underlying data files will be password protected. However, I continued to feel uneasy about this and wanted another layer of security. I also needed the ability to pick and choose certain files from different directories for the cloud backup. I decide to use 7-Zip as it allowed me to:

a. create a windows shell script that I could run with a mouse click
b. add files from multiple locations
c. protect the encrypted archive with a password

You can download 7-Zip from here

Download 7-Zip

After you install 7-Zip, you can use it via its GUI. I, however, wanted to do this via a shell script. On Windows 10, under the c:\Program Files (x86)\7-Zip directory you will find the executable 7z.exe. You can either add this location to your Windows path variable or directly reference the executable as I have done below.

The commands to add files to an archive are:

"c:\Program Files (x86)\7-Zip\7z.exe" a C:\7zip-archive-location\output-archive.7z "C:\location-01\file-01.txt" -pYour_Super_Secret_Password

"c:\Program Files (x86)\7-Zip\7z.exe" a C:\7zip-archive-location\output-archive.7z "C:\location-02\file-02.txt" -pYour_Super_Secret_Password

"c:\Program Files (x86)\7-Zip\7z.exe" a C:\7zip-archive-location\output-archive.7z "C:\location-03\file-03.txt" -pYour_Super_Secret_Password

The command is broken down as follows:

"c:\Program Files (x86)\7-Zip\7z.exe" 
The location of the executable. Since the location has an embedded                                         space, the entire text is enclosed in double-quotes. For simplicity, I have enclosed all the file locations in double-quotes.

a
Parameter to create (add to) an archive (zip file) 

C:\7zip-archive-location\output-archive.7z
Location and name of the archive  

"C:\location-01\file-01.txt"
The file(s) you want to be included in the archive. Wildcards are permitted.

-pYour_Super_Secret_Password
The password for the file. Note that this parameter is preceded by a hyphen and the password follows the “p” parameter without a space.

After the file is created, I added the below command to list all the files in the archive:

"c:\Program Files (x86)\7-Zip\7z.exe" l C:\7zip-archive-location\output-archive.7z

The command is broken down as follows:

"c:\Program Files (x86)\7-Zip\7z.exe" 
The location of the executable. Since the location has an embedded space, the entire text is enclosed in double-quotes. For simplicity, I have enclosed all the file locations in double-quotes.

l
Parameter to list the files within an archive (zip file) 

C:\7zip-archive-location\output-archive.7z
Location and name of the archive