So I’ve just released my first patch for the recipe parser, which took care of a few of the more glaring issues that the program still needed. Most notable was a bug that prevented a number of screens from being showed due to an errant variable in my layout.html
file. I removed the offending variable and set the guest users to be automatically redirected to their list when they click the “home” button instead.
More importantly, however, I added support for a number of additional recipes, and worked on the architecture for my web scraper. This has been a fun bit of coding; because there are so many different recipe websites out there, implementing them one by one is very tedious and I was incentivized to come up with ways to avoid repetition as much as possible.
For the recipes I’ve looked at so far, I’ve found several different patterns for how they store their recipe lines and titles. The first, and simples, is an html
component with a specified attribute. In these cases, the same BeautifulSoup
command can essentially be run, and I just change out the parts that I look for. Here’s the start of the dictionary I use for these cases:
I parse the url and check if the main domain is in the dictionary. If it is, then I retrieve the necessary component names and scrape accordingly:
(Here, “o” is a urlparse
object initialized with the provided url.)
Of course, some websites are a bit more complicated. For these, I actually have to make a specific function to scrape from them. I tried to make this as easy as possible as well, however, by creating a second dictionary that stores the functions:
This way, I can check if the domain name is in the second list, and return the information right then and there:
There are a few other small tricks that I’m using to make this process easier. For one, many of the recipe blogs are Wordpress based, and use one of the many Wordpress plugins for recipe blogs. I created standard templates for the websites that use them, such as this one for Wordpress Recipe Maker:
Another plugin requires its own function to parse, but luckily that function can be used on more than one recipe:
… and so on and so forth. There are a lot of sites out there, and there’s no way I could get them all, but I’d like to have enough that the odds are fairly good you’ll be using a site that’s covered. Regardless, I had quite a bit of fun working on this code; one of the things I enjoy the most is figuring out new ways to solve problems that would require a lot of busy work to do otherwise. Writing out a specified function for every single recipe website would have been a lot of unneeded work.
Anyway, that’s all I’ve got for now, but I’ll be posting some new updates to the recipe parser as I go. I have a feeling that I’m almost done with this one, at least for now. Huh, finishing a project. Great feeling.