Automating website scraping

For many years I have had a personal quest to develop a simple way to automate the collection of information from websites. In 2017, I played around with a Python/Selenium-based solution. See blog post Python/Selenium example for details.

While the Python/Selenium-based solution was interesting it did not meet all my requirements such as ease of use, low development time, and flexibility. While browsing the web I ran across Windows Power Automate which looked interesting. I was pleasantly surprised to see that a desktop version of the software was included in my Microsoft 365 subscription. I downloaded and installed the product and attempted to automate the collection of information.

The use case was to get the account balance from my brokerage account, the current DOW and S&P numbers, and write the information to a CSV file with the current date. The steps are as follows:

01. Click on create a new flow (aka program unit)
02. Set up variables for user names and passwords by clicking on “Variables” on the left and dragging the function “Set variable” into the body of the flow
03. Launch a new web browser instance with “Launch new name-of-browser”. I was not able to get the Firefox plugins to work, however, Microsoft Edge worked well
04. Populate the user name and password with “Populate text field on web page”. The product displays a UI when you are on the web page so you simply have to highlight the field on the web page and assign the variable set in 02. above
05. Usually you have to click on a login button of some sort. This is accomplished with the “Press button on web page” function
06. I coded a “Wait” for 30 seconds to let the web page complete the log on
07. Use the “Extract data from web page” to scrape the required data and store it into a variable
08. The “Get current date and time” function retrieves the current date into a variable
09. Write the data into a file with the “Write text to file” function
10. Close the web browser with the “Close web browser” function

Repeat the above steps for each web page you need to extract the information from. The individual functions can be moved up or down in the flow and can be copied to repeat a function.

One issue I ran into was that the is written to a file in columnar format. I am sure that I will be able to change this as I get more familiar with the product. In the interim, I created a few lines of code in Python and executed it from the flow with the “Run DOS command function” to format the data into a CSV file. Now when I want the information from these three websites, I run the flow and open the CSV file in Excel.

The tool also has options to debug such as run from a step or stop at a step. The product is intuitive and easy to learn for anyone with a few basic coding skills.

Complete documentation is here.

Creating an AWS RDS instance via the command line interface (CLI)

We recently ran into an unusual circumstance working with a customer and were unable to create an RDS instance via the console and ended up creating the instance via the command line. This is a useful command in that we can use it in a script to create multiple instances.

The complete documentation for all options can be found at AWS CLI reference.

Code with explanation of the options

I will attempt to explain each option of the command below. The complete code in one unit is at the bottom of the post.

aws rds create-db-instance \
--db-name MYDB \

The name of the database. This option has different results depending on the database engine you use. This example is for Oracle.

--db-instance-identifier mydb \

The DB instance identifier. This parameter is stored as a lowercase string.

--allocated-storage 300 \

The amount of storage in gibibytes (GiB) to allocate for the DB instance. The constraints on the storage are different depending on the database engine you use.

--db-instance-class db.m5.2xlarge \

The compute and memory capacity of the DB instance. Not all DB instance classes are available in all Amazon Web Services Regions, or for all database engines.

--engine oracle-se2-cdb \

The type of database engine to be used for this instance. In this example, it is Oracle, Standard Edition container database.

--master-username admin \

The account name for the master user.

--master-user-password admin_pwd \

The password for the master user. Please pick a secure password.

--vpc-security-group-ids "sg1" "sg2" \

A list of Amazon EC2 VPC security groups to associate with this DB instance in the format “string 01””String 02”

--db-subnet-group-name “sub_net_1” \

A DB subnet group to associate with this DB instance.

--db-parameter-group-name default.oracle-se2-cdb-19 \

The name of the DB parameter group to associate with this DB instance. If you do not specify a value, then the default DB parameter group for the specified DB engine and version is used.

--backup-retention-period 2 \

The number of days for which automated backups are retained. Setting this parameter to a positive number enables backups. Setting this parameter to 0 disables automated backups.

--port 1521 \

The database port. There are different default values depending on the database engine.

--no-multi-az \

Specifies that the database is not to be deployed in multi-AZ mode. This is a Boolean and can be specified as –multi-az or –no-multi-az

--engine-version 19.0.0.0.ru-2022-04.rur-2022-04.r1 \

The version number of the database engine to use. For a list of valid engine versions, use the DescribeDBEngineVersions operation.

--auto-minor-version-upgrade \

Indicates whether minor engine upgrades are applied automatically to the DB instance during the maintenance window. By default, minor engine upgrades are applied automatically.

--license-model license-included \

License model information for this DB instance.

--option-group-name my_option_group \

Indicates that the DB instance should be associated with the specified option group.

--character-set-name WE8ISO8859P1 \

For supported engines, such as Oracle, this value indicates that the DB instance should be associated with the specified Character Set.

--nchar-character-set-name AL16UTF16 \

The name of the NCHAR character set for the Oracle DB instance.

--no-publicly-accessible \

Whether the DB instance is publicly accessible

--storage-type gp2 \

Specifies the storage type to be associated with the DB instance.

--storage-encrypted \

Specifies that the DB instance storage is encrypted with the below KMS Key

--kms-key-id  \

Specify the ARN of the KMS key to be used to encrypt the storage

--copy-tags-to-snapshot \

Specifies that the tags associated with the database instance will be applied to the snapshots of the database

--no-enable-iam-database-authentication \

A value that indicates whether to enable mapping of Amazon Web Services Identity and Access Management (IAM) accounts to database accounts. IAM database authentication works with MariaDB, MySQL, and PostgreSQL. With this authentication method, you don’t need to use a password when you connect to a DB instance. Instead, you use an authentication token.

--enable-performance-insights \

Whether to enable Performance Insights for the DB instance. Performance Insights expands on existing Amazon RDS monitoring features to illustrate and help you analyze your database performance.

--performance-insights-kms-key-id  \

The ARN of the KMS key used to encrypt your performance insights. If you do not specify a value, then Amazon RDS uses your default KMS key.

--performance-insights-retention-period 7 \

The duration to retain the performance insights data in days.

--deletion-protection \

Specifies that this DB instance has deletion protection enabled.

--max-allocated-storage 500 \

The upper limit in gibibytes (GiB) to which Amazon RDS can automatically scale the storage of the DB instance.

--region us-east-1

The region that the database will be deployed in.

Complete code in one unit

aws rds create-db-instance \
--db-name MYDB \
--db-instance-identifier mydb \
--allocated-storage 300 \
--db-instance-class db.m5.2xlarge \
--engine oracle-se2-cdb \
--master-username admin \
--master-user-password admin_pwd \
--vpc-security-group-ids "sg1" "sg2" \
--db-subnet-group-name “sub_net_1” \
--db-parameter-group-name default.oracle-se2-cdb-19 \
--backup-retention-period 2 \
--port 1521 \
--no-multi-az \
--engine-version 19.0.0.0.ru-2022-04.rur-2022-04.r1 \
--auto-minor-version-upgrade \
--license-model license-included \
--option-group-name my_option_group \
--character-set-name WE8ISO8859P1 \
--nchar-character-set-name AL16UTF16 \
--no-publicly-accessible \
--storage-type gp2 \
--storage-encrypted \
--kms-key-id  \
--copy-tags-to-snapshot \
--no-enable-iam-database-authentication \
--enable-performance-insights \
--performance-insights-kms-key-id  \
--performance-insights-retention-period 7 \
--deletion-protection \
--max-allocated-storage 500 \
--region us-east-1