One month ago, we acquired (via freedom of information request) the Office for National Statistics’ guidance on what its staff and agents should be looking for when observing the “basket” of items used to monitor inflation.
We wrote then:
We’ll try to make a tidier version of the ONS’s table soon
In the meantime, stuff has happened. But now we’re SO back, and we’ve got the mega spreadsheet to prove it.
If you’re from a hedge fund and are just here to scrape said spreadsheet, please scroll down. If you’re interested in a woeful tale, stick with us.
A woeful tale, pt.1
We mentioned last month that:
Because civil servants are cruel, the document’s formatting is horrible, half of its text isn’t actual text, and there are redactions. 🙃
Wah, wah, wah — yes, journalism really IS hard and don’t let anyone tell you otherwise.
The ONS’s response to our freedom of information request landed as a 70-page PDF file, which can be downloaded here.
The PDF is, basically, a spreadsheet, consisting of item names, and up to six “help line” columns detailing exactly what the ONS considers valid for that category.
So, for instance:
This is clearly odd. Why a continuous string of text is being written across six columns in this way, we have no idea.
Of course, not all of the items are this oddly separated, eg:
While others are much more condensed, such as:
There is no apparent rhyme nor reason to why the guidance is sometimes spread out across columns, nor is there anything resembling a consistent style to how things are written. It very much feels like the product of years of ad hoc tweaks and edits (presumably in response to requests for clarification), and gives us concrete poetry flashbacks.
Readers may have noticed references to codes such as N and C in those help lines above. N means a new product is non-comparable to a previously observed product, while C means it is. Simple!
A woeful tale, pt.2
We’re not exactly sure how the ONS managed this, but the PDF’s pagination per item loosely goes like this:
First pages (labelled 1–35, odd numbers in PDF)
Item description / Help line 1 / Help Line 2
Second pages (labelled 36–70, even numbers in PDF)
Help line 3 / Help Line 4 / Help line 5 / Help line 6
Obviously this isn’t a big issue in and of itself, but we wanted to convert this into a more usable format and here (non-exhaustively), are some of the other issues the PDF has:
— Some pages, including the first, have not been saved as text
— Some cells are full of random line breaks and other typographic artefacts
— Nothing redacted ever pastes good
— Text copied from cells came out as unbroken strings when pasted
— In Acrobat and OS X preview, some pages copied sideways, natch.
— Lom
— Lom
— Lom
We reached out to the ONS and asked if we could simply get the underlying spreadsheet, and in the meantime went to work doing things semi-manually (copying individual cells or typing them out by hand).
A woeful tale, pt.3
Although [REDACTED] was actually extremely helpful — and there’s a twisted logic to how things were done here — so it was that we ended up primarily deriving our information from the optical-character-recognition generated .docx version of a PDF version of an Excel spreadsheet.
Behold, a basket
Some time later, we’d got everything into a single spreadsheet. Caveats:
— we have undoubtedly made some typos given the amount of manual copying of weirdly-written data we had to do
— we’ve recreated the guidance in the exact style it was written, which is often massively inconsistent.
— in some instances, the OCR process applied by the ONS means that letters with descenders that appears at the bottom of a page have scanned weirdly, or that some letters are borked (eg we had to fix “fluten” to “gluten”)
As mentioned, the column divisions made very little overall sense. So we concatenated them to make each set of guidance read as one paragraph.
For everyone else, enjoy (nb there’s a direct download link in the bottom right if you want the raw data):
Observation notes on observation notes
Last time around, we mentioned some of the oddities in this document. The laborious process of preparing this spreadsheet caused us to note several more:
— a child’s soft toy/teddy bear CANNOT be a hand puppet
— a child’s baby doll must have a plastic element
— gold rings aren’t allowed to be rose gold (don’t be basic)
— organic bread should only be priced if it’s “representative” of the shop it’s from, whatever that means
— quiche crusts are optional
— observers must specify whether popcorn is sweet/salary/both, but it doesn’t actually matter
— the guidance on livery charges is… thorough
— 2-in-1 shower gel products aren’t allowed, sorry Boris
— chicken nuggets don’t count as chicken for the purposes of a chicken and chips takeaway
In particular, we found ourselves drawn to a number of items where there is little or no help provided, including…
— Digital Media Player
— Computer Software
— Mobile Phone Applications
— Interchangeable Lens Digital Camera
— Laptop Computers
…and dozens more, all of which have hugely variable product types and costs. ¯\_(ツ)_/¯
In fact, almost everything in here seems absurd if you look closely enough. Oh well.
Further reading:
— small caged mammal (FTAV)
— small caged mammal infinity (FTAV)
— small caged mammal revelations (FTAV)
— small caged mammal merchandise (Redbubble)