XPath for Web Scraping

You are currently reading XPath for Web Scraping: A Practical, Hands-On Guide to Web Scraping with XPath — a free, online book that is designed to get developers up-and-running with XPath as quickly and as easily as possible.

Getting Started

This is chapter 2 of 3. Go to next chapter »

The best way to learn how to write XPath expressions is to actually write them. You need to be actively experimenting, tweaking, and testing expressions, allowing yourself to learn why some expressions work as expected and why some expressions don’t. XPath isn’t a particularly finicky syntax, but there are some “gotchas” that are most easily understood once you’ve encountered them in the wild.

With this mind, we’re going to dive right into using XPath to extract content from a web page. Specifically, we’re going to extract content from example.com – a website that was designed to be used as an example project in books.

It’s not most exciting website in the world, but once we have a strong grasp of XPath’s fundamentals (which won’t take long), we can start working on some more compelling projects.

What You’ll Need

XPath, in itself, is not a piece of software for scraping the web. It’s a syntax that is used to find and extract elements from HTML documents, and this syntax is simply supported by a number of scraping tools.

With this in mind, it’s important that everyone reading this book has access to software that allows them to play around with XPath expressions.

If you’re already familiar with a particular scraping tool, feel free to use it. The expressions should work the same, no matter what tool you’re using. If you’re not familiar with a web scraping tool though, or if you’d like to use the exact tool that I’ll be using in this book:

  1. Download a copy of the Firefox web browser.
  2. Install the FirePath extension. (This extension relies on the Firebug extension, so both will need to be installed.)

FirePath is a brilliant tool that allows us to test XPath expressions from directly within the browser. I use it for all of my web scraping projects, and it’s wonderfully useful for both learning and actively using XPath.

After installing FirePath, open the Firebug pane and click the tab that says “FirePath”. This pane is where the magic happens.

FirePath is a fairly straightforward tool, but for the sake of completeness, here’s everything you need to know about how to use it:

You start by writing XPath expressions in the text box at the top of the pane and then clicking the “Evaluate” button.

If your XPath expression is invalid – maybe you’ve left out a necessary character, for instance – the text field will turn red. If your expression is valid though, three things will happen:

  1. Whatever content is found by the XPath expression will be highlighted in the browser with a dashed blue outline.
  2. Whatever content is found by the XPath expression will be highlighted in the source code, beneath the text box.
  3. The number of elements found by the XPath expression will appear in the status bar, at the bottom of the browser.

You’ll also notice that, if you right click any element in web page, there’s a new “Inspect in FirePath” option that appears in the context menu. If you select this option, FirePath will figure out the XPath expression for this particular expression and display the results of it, as if you’d typed it out yourself and clicked the “Evaluate” button. The expressions that FirePath generate aren’t particularly elegant – they’re cryptic and overly specific – but, once you’re familiar with the XPath syntax, they can be a handy way to lay the foundations for your own creations.

Project Setup

Before we write any XPath expressions, we need to ask ourselves an important question: “What content do we actually want to extract from the web page?””

In this case, the answer doesn’t precisely matter, since we’re scraping an arbitrary page without any particular goal in mind. But for the sake of having some direction, let’s say that our initial goal is to extract the text of the page’s heading. We’ll do a lot more than this, but this is a good first step.