Parsing

Yak shaving. Seriously.

Before you build a pumpkin spice client, you have to have something to read with it. I mean, technically you want to select things to read ("follow" or "friend" people) within the client, but still.

So I started fiddling with a little Perl script, takes an url on the command line and checks it out. If it's a Facebook or Twitter url, it can tell who the person/page is (and if I had bothered to register with the API, it could go off and do stuff with it). I haven't done anything with Instagram or Google Plus, but there are libraries for those on CPAN so I'm just treating them as a solved problem for now.

Tilde.club pages are really a pretty good early test for at-large web pages. Some of them, like mine, have the feed properly marked in the header, and I can just look at XML::Feed->find_feeds($uri); and get something that presumeably represents a blog feed.

Others have a feed but it's not marked, so I look for things like index.rss or feed.xml or any of the other standard-ish names in the same directory as the page I've got.

And then there's the large majority of them: pages that have no machine-readable equivalents. In my real life, I use Page2RSS to produce a changelog RSS. That works pretty well, but for two things: one, it's rather tedious in TinyTinyRSS: copy the page address, open the Page2RSS page, paste the page address into the form, copy the resulting page address, waaaaait for TinyTinyRSS to open, select Subscribe to Feed, paste the address, select the category, waaaaait for TinyTinyRSS to poll the site, waaaaaaaait for TinyTinyRSS to reload. (I'm not sure if TTR is slow, or our server is, or my browser is, or all three.)

Now, once it's set up, it works great, and generally hands me everyone's blog entry in a reasonably timely manner. So if pumpkin spice does all of those steps under the hood and asynchronously, that solves most of my problem. All that's left is the fact that I'm relying on a third-party service whose customer I am not. So PS has to have a build-in diff function that can preserve the formatting relatively well. And ideally, it should be a little smarter than Page2RSS and not hand me every update where just a timestamp (or a temperature, or whatever) just changed.

I'm really hoping somebody else has solved that problem, and that I just need to pore over CPAN (and Github, these days) to find it, because otherwise I have to go into the hell that is HTML parsing.

Page created: 07 November 2014

tilde home
silver home

Click for the [ Random page ]
Want to join the ring? Click here for info.