Using ChatGPT to Learn New Tricks

By RobertTroughton 09 January 2024 Coding 8 min read

An Introduction #

So here's the thing ... with a hobby project that I'm currently working on, a heavily populated pixelart gallery for a 40 year old computer, I wanted to be able to use some modern methods for creating a website. Static Site Generation for the bulk of the HTML, images hosted on Amazon's AWS S3, JSON files for metadata, sparing use of Javascript (only where absolutely essential) and utilising ChatGPT to help me get there.

Static Site Generation (SSG) is something that I'm very much in favour of right now. It makes website development feel comfortable for me... I can see the entirety of the website that I'm working with, I can generate it from familiar tools and I don't need to worry about data stored within SQL, MySQL or other databases. It's just awesome.

SSG isn't great at dealing with lots of images and metadata files. It can do it - but everyone was advising me that it would quickly fall apart. So I took their advice and opted for Amazon's AWS S3. That can very quickly serve my images and JSON files and I don't need to worry about nesting things within organised folders - it apparently works better without that, actually.

The Problem With AWS S3's Sync #

So, the files that I have hosted on AWS can be updated, added and deleted whenever I update the website. In order to push the updates to S3, I need to do a "sync" with my bucket, which would essentially mirror my local folder with that stored on AWS. Amazon provide a sync tool for doing that so that, with a single commandline, you can have it take a whole folder of files and upload them, also recursing into subfolders. It has some functionality for checking which files have been updated - but it's problematic in ways that I'll come to in the next paragraph. To do the sync you just need the following commandline:-

aws s3 sync --delete . s3://[bucketname]

That's it and, in theory, the folder is now mirrored to AWS S3. However... the problem comes that the sync tool will only very "dumbly" check whether or not a file has actually changed. It looks at the timestamp and filesize - if either are different, up the file goes. That's problematic for what I'm doing as the files will often be recreated - exactly the same data being saved but with a new timestamp.

This is a problem that many people have expressed frustration at. Amazon haven't added functionality to this tool, despite many asking for it, to do anything with hashing to check for changes. They simply suggest that people use an additional option, "size-only", that is available to the tool like this:-

aws s3 sync --delete --size-only . s3://[bucketname]

Not exactly ideal .. and, really, asking for problems. With this option, the timestamp is ignored and the files will only be uploaded if the file size changes. So if you have a text file with just "HELLO" written inside and you change it to "!HEY!", the file won't be synced and the change won't show on the website. Doh!

The First (Failed) Solution #

As well as the command above for syncing an entire folder, there're also individual commands that can be used for adding/updating files on S3, and for deleting them:-

aws s3 cp         <-- to add/update a file

aws s3 rm         <-- to delete a file

Great! So the first thing that I asked ChatGPT to do was for it to help me iterate over files in a folder, recursing into subfolders, and to create a batch file for me. I had it write a whole host of hashing code, along with importing and exporting a "file-hash.txt" file that would store hash values for each file in my folder.

It did a pretty good job, actually. I had it add functionality for a "full sync" where it would revert back to the earlier sync command if there was no "file-hash.txt" file to be found (eg. on a first run). I also had it add code that would update the hashes in a temp file first - and then to add lines to the batch file so that the main hash file would only be updated (by deleting the current version and renaming the temp version to the correct filename) at the end of the batch run.

An example batch file to show how a few files might be updated:-

cd AWS
aws s3 cp .\artwork\238466.png s3://c64graphicsdb/
aws s3 cp .\artwork\238472.png s3://c64graphicsdb/
aws s3 cp .\artwork\238492.png s3://c64graphicsdb/
aws s3 cp .\collages\10805.png s3://c64graphicsdb/
aws s3 cp .\collages\38941.png s3://c64graphicsdb/
aws s3 cp .\json\238472.json s3://c64graphicsdb/
aws s3 cp .\json\238492.json s3://c64graphicsdb/
cd "D:\\Git\\C64GFXDb\\CPPTool"
del AWSSyncFileHashes.txt
rename AWSSyncFileHashes_temp.txt AWSSyncFileHashes.txt
pause

The problem that I had..? The whole batch file was ridiculously slow. It was pretty clear that this would method wasn't going to work - with 33,000 files to update, I couldn't have each file sync taking 10 seconds! The full sync would take approximately 4 days! Plus.. what of the costs from AWS S3? Combined, these 2 problems were the main reason for looking at better methods of syncing in the first place!

The reason that this method was so slow was that for each individual sync command, we had to reconnect to the S3 bucket - which is, apparently, a time-consuming thing to do.

I asked ChatGPT what it though about the slowness .. the advice that it gave me was to use the command that I'd started the day with, the one that I was originally hoping to improve on. Gah!

My Second Idea #

Next up, I figured that I could use something like a dual-folder solution. I would output files to one folder, AWS, as I created/recreated them ... and I would have a second folder, AWSSync, which I would copy the contents of AWS into - but ONLY copying if the file that I'm replacing isn't identical. I could also delete files that no longer existed. With that, a full AWS folder sync should be quick as it would only deal with the files that had actually changed.

Before I properly got started on this solution, I had another idea.. a different, very cunning - and to most of you readers, probably quite obvious - solution.

The Third (and Final) Solution #

Realising that Amazon would have a C++ SDK for AWS, it would surely be possible to get rid of my whole batch-file method and to use the SDK to do the whole thing efficiently.. connect to AWS, and the bucket, once. Grab hashes for the files stored on AWS. Compare the files and figure out which are actually different and which dont' exist any more .. modify, add and delete only what's absolutely necessary.. and then disconnect.

I figured at first that I'd still need to store a hashkey file on disk in order to know what the hash is for all the files sent to the AWS bucket... as it turned out, even this wouldn't be needed!

Having a word or two with my good friend ChatGPT, who is quite the expert with the AWS SDK as it happens (unlike myself who had never touched it before), it would be relatively easy to implement all of this with only a few SDK functions needed. Connect, get a list of files stored on AWS, push a file, delete a file.. all quite easy! I also found out that, surprisingly, the AWS SDK already had functionality available for getting hashkeys of files it stored - surprising, really, given that the AWS commandline sync tool doesn't make use of this (to the chagrin of seemingly thousands of users!).

As this might be useful to others, let me just post the entirety of my sync function below.

#include <aws/core/Aws.h>
#include <aws/s3/S3Client.h>
#include <aws/s3/model/DeleteObjectRequest.h>
#include <aws/s3/model/ListObjectsRequest.h>
#include <aws/s3/model/PutObjectRequest.h>
#include <aws/core/utils/memory/stl/AWSStringStream.h>
#include <openssl/evp.h>

std::string calculateFileHash(const std::string& filePath)
{
	std::ifstream file(filePath, std::ios::binary);
	std::vector<unsigned char> fileContents((std::istreambuf_iterator<char>(file)), std::istreambuf_iterator<char>());

	EVP_MD_CTX* mdctx = EVP_MD_CTX_new();
	const EVP_MD* md = EVP_md5();
	unsigned char mdValue[EVP_MAX_MD_SIZE];
	unsigned int mdLen = 0;

	EVP_DigestInit_ex(mdctx, md, NULL);
	EVP_DigestUpdate(mdctx, fileContents.data(), fileContents.size());
	EVP_DigestFinal_ex(mdctx, mdValue, &mdLen);
	EVP_MD_CTX_free(mdctx);

	std::stringstream hashStream;
	hashStream << "\"";
	for (unsigned int i = 0; i < mdLen; i++)
	{
		hashStream << std::hex << std::setw(2) << std::setfill('0') << static_cast<int>(mdValue[i]);
	}
	hashStream << "\"";
	return hashStream.str();
}

void DoAWSSync()
{
    static const std::string S3BucketName("c64graphicsdb"); //; our bucket
    std::string directoryPath = "AWS"; //; our sync folder

    //; create a log file for detailing what, if anything, was synced
    std::string outputFilePath = "AWSSync.txt";
    std::ofstream outputFile(outputFilePath, std::ofstream::trunc);

    bool bSomethingSynced = false;  //; we just use that so that, if nothing changed, our log file can say as much

    //; save our current working folder and move into our sync folder - this makes the AWS sync simpler
    std::filesystem::path originalDirectory = std::filesystem::current_path();
    std::filesystem::current_path(directoryPath);

    //; initialise the AWS SDK
    Aws::SDKOptions options;
    Aws::InitAPI(options);

    {
        Aws::S3::S3Client s3_client;

        //; our filename and hashkey map detailing files already in our AWS bucket
        std::unordered_map<std::string, std::string> s3ObjectsHash;

        //; progress counter - count the total number of files we need to consider for sync
        size_t totalFiles = std::distance(std::filesystem::recursive_directory_iterator("."), std::filesystem::recursive_directory_iterator());
        size_t processedFiles = 0;
        std::cout << "       : Updating AWS\r";

        //; list all objects in S3 and store their ETags
        //; note that this is done in chunks - we can't get all the files in one go
        Aws::S3::Model::ListObjectsRequest objectsRequest;
        objectsRequest.WithBucket(S3BucketName);

        bool moreObjectsToList = true;
        while (moreObjectsToList)
        {
            auto listObjectsOutcome = s3_client.ListObjects(objectsRequest);

            if (listObjectsOutcome.IsSuccess())
            {
                const auto& objects = listObjectsOutcome.GetResult().GetContents();

                //; grab all the ETag values (these are the hashkeys)
                for (const auto& object : objects)
                {
                    std::string key = object.GetKey();
                    s3ObjectsHash[key] = object.GetETag();
                }

                //; are there more objects? If so, set the marker to the end ready for the next list-grab
                moreObjectsToList = listObjectsOutcome.GetResult().GetIsTruncated();
                if (moreObjectsToList)
                {
                    objectsRequest.SetMarker(objects.back().GetKey());
                }
            }
            else
            {
                //; failed to grab the list for some reason - so we output an error and quit out of the loop
                outputFile << "S3: Error listing objects from bucket '" << S3BucketName << "' .. error: '" << listObjectsOutcome.GetError().GetMessage() << "'" << std::endl;
                break;
            }
        }

        // Now, iterate over local files and upload if necessary
        for (const auto& entry : std::filesystem::recursive_directory_iterator("."))
        {
            if (entry.is_regular_file())
            {
                std::string localFilePath = entry.path().string();

                // Remove './' from the beginning of the local file path
                if (localFilePath.find(".\\") == 0)
                {
                    localFilePath.erase(0, 2);
                }

                // Replace backslashes with forward slashes - we need to do this so that the find() below works .. AWS SDK uses "/" between foldernames, the directory iterator uses "\"
                std::replace(localFilePath.begin(), localFilePath.end(), '\\', '/');
                
                //; calculate the hashkey for our local file
                std::string localFileHash = calculateFileHash(localFilePath);

                //; search for the file in the AWS list
                auto s3ObjectHashIter = s3ObjectsHash.find(localFilePath);

                //; if not found, or if the hashkey for the AWS file is different, we need to upload the file to AWS - whether to add it for the first time or to update
                if (s3ObjectHashIter == s3ObjectsHash.end() || s3ObjectHashIter->second != localFileHash)
                {
                    //; push the file to AWS
                    Aws::S3::Model::PutObjectRequest putRequest;
                    putRequest.WithBucket(S3BucketName).WithKey(localFilePath);
                    std::ifstream file(localFilePath, std::ios::binary);
                    std::stringstream buffer;
                    buffer << file.rdbuf();
                    putRequest.SetBody(Aws::MakeShared<Aws::StringStream>("", buffer.str()));

                    //; did the push work? Output to the log in either case
                    auto putObjectOutcome = s3_client.PutObject(putRequest);
                    if (!putObjectOutcome.IsSuccess())
                    {
                        outputFile << "Failed to upload file: " << localFilePath << std::endl;
                    }
                    else
                    {
                        outputFile << "File uploaded: " << localFilePath << std::endl;
                    }
                    bSomethingSynced = true;
                }

                //; remove from the map to track which S3 objects are not present locally
                s3ObjectsHash.erase(localFilePath);
            }
         
            //; update our on-screen progress indicator
            processedFiles++;
            double progress = (static_cast<double>(processedFiles) / static_cast<double>(totalFiles)) * 100.0;
            std::cout << std::fixed << std::setprecision(1) << progress << "%\r";
        }
        //; make sure the progress indicator ends on 100%
        std::cout << "\r" << std::fixed << std::setprecision(1) << 100.0 << "%\n";

        //; iterate over any objects that remain in s3ObjectsHash - these are files
        //; that don't exist locally - so they can be deleted in AWS
        for (const auto& [key, _] : s3ObjectsHash)
        {
            //; Skip keys that end with '/', as they represent directories
            if (!key.empty() && key.back() == '/') {
                continue;
            }

            //; delete the file
            Aws::S3::Model::DeleteObjectRequest deleteRequest;
            deleteRequest.WithBucket(S3BucketName).WithKey(key);

            //; report either success or failure of the delete to our logfile
            auto deleteObjectOutcome = s3_client.DeleteObject(deleteRequest);
            if (!deleteObjectOutcome.IsSuccess())
            {
                outputFile << "Failed to delete object: " << key << std::endl;
            }
            else
            {
                outputFile << "File deleted: " << key << std::endl;
            }
            bSomethingSynced = true;
        }
    }

    //; if we didn't update anything on AWS, our log will be empty - so we just add a line to mention that there were no changes
    if (!bSomethingSynced)
    {
        outputFile << "No Updates Required - AWS is up to date!" << std::endl;
    }

    //; close down our connection to AWS
    Aws::ShutdownAPI(options);

    //; restore the original current working directory
    std::filesystem::current_path(originalDirectory);
}

The code above should be fairly close to what you'd need if you want to do something similar. I guess it would be simple enough to create a standalone tool, a console executable perhaps, that would do the syncing - and adding the hashkey functionality - but I'll leave that as a project for the reader.

Summary #

As well as helping anyone else wishing to utilise Amazon AWS in this way, I hope that the above demonstrates nicely exactly how useful ChatGPT can be. Honestly, I'm very very impressed. It would probably have taken me a few days to write this code - rather than the few hours that it took instead.

Colour me impressed. I shall definitely be talking with ChatGPT, and other AIs, a lot more in future.

Previous: Static Code Analysis - Part One
Next: Is Remote Working The Future?