Building an Imperfect Database
Flag Factory
An Imperfect Database
While I'm working on the mechanics for the Flag Factory clicker game, I'm also trying to get all of the ingredients and materials in place. I realized that there was going to be a need for some type of database for me to access the flags that could also contain descriptions of the flags. When doing the initial web scraping, I didn't pull any information other than the images from google image search, so the files don't contain any meta data, alt-tags, or descriptive names. I also knew that I didn't want to use machine learning to try and interpret the flags language or content. Since the flags already seem to lack human intervention, I strongly felt that adding an additional layer of automation to the content would push the project away from where I want it to go. So with all of that in mind, I realized that I was going to have to roll up my sleeves for some good old-fashioned data entry.
Categories
I also knew that I wouldn't want to be inputting this data more than once if possible. So I'm not sure if it's useful, but I decided to try to capture a description of the flag's content as well as attempt to categorize each flag on a scale of Pro-Trump <→ Pro-Biden. Both of these things are highly subjective. (It would be interesting to task several people with trying to log this data to see the differing in results.) Some flags were quite easy to define "HIGH-TRUMP" or "HIGH-BIDEN" or "NEUTRAL."
However, there was a lot of gray area. For example, what if a flag supported a cause that is generally associated with Biden supporters, but doesn't necessarily proclaim itself to be Pro-Biden? Or what if the flag was for a heavy metal band which may or may not have associations with white supremacy? How do you categorize those things? To try and capture these areas in a predictable manner. I came up with 6 possible categories:
- "HIGH-TRUMP"
- "SOMEWHAT TRUMP"
- "NEUTRAL"
- "SOMEWHAT BIDEN"
- "HIGH-BIDEN"
- "UNSURE"
For the HIGH values and the NEUTRAL value, I ask myself if it is 100% no question, everyone would categorize the flag this way. If I felt like most people would categorize it as HIGH in one direction but I wasn't confident, then it goes in the SOMEWHAT section. If I found myself being unsure or feeling like I would need to do extensive research to find out, I tagged it as UNSURE so I could myself moving.
I currently have 1,419 flags cataloged. Here are the category stats from those flags:
- "HIGH-TRUMP": 57
- "SOMEWHAT TRUMP": 23
- "NEUTRAL": 727
- "SOMEWHAT BIDEN": 27
- "HIGH-BIDEN": 13
- "UNSURE": 571
Descriptions
The Flag Content Descriptions are also highly subjective. While some of the flags are just words, others require some description. A lot of the flags were for characters or symbols that were clearly something specific, but I didn't know what. I was trying to minimize the per-flag input time, but for some of them there's just no solution other than looking it up. The more of these that I do, the more that I realize that even having me be the only person writing these, there is still a great deal of difference in my efforts and annoyances. Depending on how tired I am or what else I'm thinking about, some flags have more in-depth descriptions than others. Sometimes the descriptions are more mechanical and sometimes they are colorful. While this makes for a rather poor scientific database, I think it is making for a richer artistic one. An obvious human element is starting to develop in the data, which provides me an opportunity to inject some of that texture while utilizing data and automation.
Data Input
I first started by making a text snippet in aText with the data structure and going through my flag image folder to manually input data. I did this for about an hour and was clocking an average of 1 flag/minute. Looking at a set 15,000 images, this data entry alone would take approximately 207 hours or 8.5 days. Woof. No good.
I realized that I could cut down on some of that by starting with a json file that already had the image file names in place. I did this while on an airplane ride, which led me to a process that is best described as "if it's stupid but it works, it's not stupid." What I wanted was to get all of the filenames in a list. So I used terminal, and then was copying them into a string, so I could split them by spaces. However, when I went to do this, the linebreaks from terminal were preventing VSCode from reading it as one long string.
I realized that if I zoomed wayyyyy out in Terminal, there would be a lot more names per line and I could take one minute and manually remove the line breaks. This was silly but ultimately was probably faster than me figuring out a "better" solution.
With those in place, I was able to write a little app in javascript that would take all of the filenames and pre-write a bunch of the json file for me.
In all, this process took me about 3 hours. With my pre-built script, I cataloged for an hour and was clocking in 2 flag entries per minute. Double the speed! This was a great start. The investment of 3 hours got the 207-hr data entry time down to an estimate 103 hours (about 4.3 days). I knew I could do better, but I did worry about over-investing time. I now had a neat little JSON file with which I could dynamically cycle through the images. I decided the investment was worth it, but I tried to be conscious of keeping it as bare bones as possible.
So, I took about 5 hours and built out a very rudimentary database builder using nedb. I also found it helpful to give myself a little write up reminder about the strategies I was using to categorize the flags. I over spent time on trying (and failing) to resolve some issues with duplicate flags and the flow of how to log flags, but then pick back up where I left off if needed. Rather than agonize over making this thing production-ready, I'm integrating it with a manual workflow of after I do a logging session, I copy and paste into my stable json. And to pick back up, if I accidentally close the window or refresh, I just use the console to jump my flagCount to where I left off.
If I need to do something similar in the future, maybe I'll resolve those things but for now, it was good enough for me to just get to work on the data entry. I know had my entry speed up to 4 flags per minute! So overall, I invested 8 hours of programming time, but was able to get the overall estimate of 207 hours down to 51 hours of data entry. It's also now a lot easier for me to just pick up and put down again, so I find that I can simply sit and log some flags while I'm watching TV or having some down time. I was able to more quickly get to a critical mass of flag data, which will give me enough to build off of for a prototype. With the structures in place, I can easily just build out a more robust data set, without effecting any of the building or experience that I'm working on in the near future. Success!