2022-04-13
- Just wrote up a data pipeline in an evening. Pretty fun!
- Pipeline extracts color information from outfits that are part of my WAYWT app. Idea being that part of developing a strong sense of style/personal aesthetic involves learning about color/palettes.
- First step was to collect the building blocks. I needed some software that I didn’t want to write (bc it would be pretty complicated). Specifically, I needed software to remove the background from an image automatically, and I needed software that would extract colors from an image, ordered by importance. I would combine these things to extract the important colors from the outfit. Fortunately, I found a couple python libraries just for this. Did some initial testing, and they worked pretty great.
- Next step was to connect to the database. I ported over the code where I was testing the libraries and made it use the results of my tests to provide data for the database. This was actually pretty seamless. And I was able to start sending data to my database, at a rate of like 4 outfits / minute.
- Unfortunately, I have like 20k outfits in my database rn, so 4/min just wasn’t going to cut it from a throughput perspective. It needed to be faster. The default way to speed things up would be to use more processors. And I knew the workload could easily be run on more processors without any major synchronization issues (the work involved in determining the colors for a given outfit is independent of the work for any other outfit). The problem was only that I didn’t really know how to do this kind of thing in Python. A little Googling showed me that my best option was to use the
multiprocessing
library. Refactoring my code to use the multiprocessing
library took a little care/effort and I biffed parts of it around accessing variables in proper scopes (each processor should have its own set of vars, generally). But it wasn’t super hard to fix the issues. Once it was working, I jumped from handling ~4 outfits/min to ~20/min, which makes the overall problem
much more tractable (should complete in ~10 hours of execution vs 50 before).
- The major lesson in the whole experience is that simplicity is king. I was able to make lots of progress on the problem precisely because I didn’t do too many complicated things. Writing my own code to remove the background from images would be complicated and take me a lot of time. Writing my own code to extract colors from an image would probably be a bit less complicated than removing the background, but still much more complicated than I could reasonably accomplish in an evening. Each of the tools I chose to use had extremely simple APIs, which made it possible to hook everything together in a short period, and make quick fixes when I made mistakes. Of course, this lesson on the importance of simplicity is taught to me over and over again when building things. Accidental complexity is the kiss of death for these side projects. Must avoid it like the plague.
- I’m a bit shocked by the absolute dominance of tones/muted colors in the dataset. ~10x more outfits seem to use charcoal than pink, and if you compare navy to pink, it’s still 5x in favor of navy. Either people don’t like colors, or (more likely) my color extraction process doesn’t pick up accent colors very well.
- Almost nothing in life gives me the energy to stay up late the way coding does. Absolutely intoxicating.cating.
- I should probably invest in learning javascript for real
- Saw a comment on HN about “bread winner conversations”. Made me think some stuff.
- Still need to get my WAYWT app out the door. Some things I would like to do before launch
- Get the UI for searching posts in a basic working state
- Make it possible to post to the subreddit directly from the app, alternatively, create a private post
- Add even a little bit of error handling to the app,
- Add metrics/alerting to the app
- Set up GitHub actions CI pipeline
Date
April 13, 2022