« Back to home page

XPath for Web Scraping: A Practical, Hands-On Guide to Web Scraping with XPath

Introduction

What is XPath?

Why XPath?

The magic of XPath is best experienced, rather than explained, which is why the bulk of this book is based around practical examples and lessons. If you’re not sure if XPath is worth your time though, I’m happy to put on my salesman hat and convince you that, yes, if you’re serious about web scraping, you need to know how to write XPath expressions.

Here’s why:

Benefit #1: Flexibility

You can use XPath expressions to grab almost any content from any web page. The big exception to this rule is content that’s delivered via JavaScript – there are separate tools to handle that part of the process – but when it comes to grabbing text, links, images, or any other standard elements from HTML documents, XPath is the best tool for the job.

To illustrate this flexibility, consider the following scenarios:

In theory, these sound like relatively complicated things to achieve, but XPath can handle all of these situations without breaking a sweat. In fact, these are some of the easiest things we can do with XPath. Once we dive into the more advanced details, what we can achieve is absolutely mind-boggling.

Benefit #2: Power

XPath expressions don’t simply make it easier to scrape the web. They allow us to avoid writing a ton of repetitive code. When you consider how volatile web scraping can be – the web is constantly changing, after all – this ability to write less code becomes an incredible advantage.

For example, think about how much time it’d take to handle the scenarios we just discussed using the programming language of your choice, or a visual scraping tool (like import.io). Then know that we can achieve all of that same functionality with three, relatively short XPath expressions:

Extract href attribute of “Buy Now” links:

//a[text()="Buy Now"]/@href

Extract src of first image inside div.content:

(//div[@class="content"]//img)[1]/@src

Extract all rows of a table, except for the first and last rows:

//table//tr[position()>1 and position()<last()]

You don’t have to understand what these expressions are actually doing – we’ll uncover their mysteries soon enough – but at least appreciate how much functionality we’re able to achieve in less than fifty characters.

Benefit #3: Reliability

Let me know if this sounds familiar:

  1. You create a web scraper.
  2. The website you’re scraping changes its design.
  3. Everything breaks.

Even if you’re new to the world of web scraping, it’s not hard to imagine battling this scenario on a fairly regular basis. The web is a constantly shifting beast, and it’s this lack of stability that ensures web scraping remains an inherently fragile process.

The bad news is, we can’t avoid this fragility altogether (unless we only scrape websites that literally never change), but there is some good news:

XPath has a number of features that allow us to scrape content in a far more reliable manner, so even if a website does change, there is less of a risk that the our scrapers will completely fall apart. This does require a bit of luck (or psychic powers), but since these features are so easy to implement, there’s plenty of upside for only a minimal investment of time.

Benefit #4: Readability

Okay, this might not look readable:

(//div[@class='content']//img)[1]/@src

In fact, it probably looks downright cryptic. But once you learn the basic rules of how XPath expressions are structured – and don’t worry, there aren’t many rules to learn – you’ll see XPath for what it truly is: a wonderfully expressive syntax that can be as easy to read as your native tongue.

If you’re skeptical of this point, that’s fair enough. When I first came across XPath, it seemed like impenetrable gibberish. If you stick with it until the end of this book though, you’ll have a strong grasp of any XPath expression that you come across.

Benefit #5: Ubiquity

Some web scraping tools, like Python’s BeautifulSoup library, have their own syntax for extracting content from the web, but throughout the web scraping world, you’ll find that XPath is the standard. It’s supported by a ton of scraping libraries, tools, and applications, including:

The benefit of this ubiquity is that, once you learn XPath, your knowledge of the syntax will carry over between countless platforms. You want be locked into any specific, proprietary way of working, so if you ever want to move to a different set of tools, you’ll be able to simply copy and paste your XPath expressions and they should work exactly the same.

XPath vs. CSS

Many web scraping tools that support XPath expressions also allow users to extract content from the web using CSS selectors, and there’s a couple of significant reasons you might prefer to use CSS instead of XPath:

  1. You’re probably already familiar with CSS selectors, which means you don’t have to learn anything new.
  2. The syntax is a lot simpler.

To illustrate the second point, imagine that we want to grab all of the elements from a web page that have a class of “button” by using the CSS features of the Nokogiri library. Here’s how we’d write that code:

page.css('.button')

Pretty slick, right?

In contrast, this is how we’d achieve the same effect with XPath:

page.xpath("//*[@class='button']")

The syntax, clearly, isn’t as seamless.

The problem with CSS is that you’re limited with the amount of logic you can write into your expressions. You can’t, for instance, use CSS to grab all of the href attributes from all of the links on a page, and you couldn’t extract the second-last row of data from the first table on a page. You could select all of the links on a page, and select that first table, but you’d have to write (and maintain) additional code for the rest of the logic.

It’s for precisely this reason that you need to learn XPath if you’re serious about web scraping. If you only have knowledge of using CSS selectors, you’ll quickly find yourself limited (and frustrated) by what you can achieve.

That’s not to say CSS doesn’t have its place. I still use CSS in my web scrapers on a regular basis. It’s quick, easy, and aesthetically pleasing, so it’s a handy tool to have in your toolkit. The moment I need to do anything remotely sophisticated though, I reach for XPath.

Getting Started

The best way to learn how to write XPath expressions is to actually write them. You need to be actively experimenting, tweaking, and testing expressions, allowing yourself to learn why some expressions work as expected and why some expressions don’t. XPath isn’t a particularly finicky syntax, but there are some “gotchas” that are most easily understood once you’ve encountered them in the wild.

With this mind, we’re going to dive right into using XPath to extract content from a web page. Specifically, we’re going to extract content from example.com – a website that was designed to be used as an example project in books.

It’s not most exciting website in the world, but once we have a strong grasp of XPath’s fundamentals (which won’t take long), we can start working on some more compelling projects.

What You’ll Need

XPath, in itself, is not a piece of software for scraping the web. It’s a syntax that is used to find and extract elements from HTML documents, and this syntax is simply supported by a number of scraping tools.

With this in mind, it’s important that everyone reading this book has access to software that allows them to play around with XPath expressions.

If you’re already familiar with a particular scraping tool, feel free to use it. The expressions should work the same, no matter what tool you’re using. If you’re not familiar with a web scraping tool though, or if you’d like to use the exact tool that I’ll be using in this book:

  1. Download a copy of the Firefox web browser.
  2. Install the FirePath extension. (This extension relies on the Firebug extension, so both will need to be installed.)

FirePath is a brilliant tool that allows us to test XPath expressions from directly within the browser. I use it for all of my web scraping projects, and it’s wonderfully useful for both learning and actively using XPath.

After installing FirePath, open the Firebug pane and click the tab that says “FirePath”. This pane is where the magic happens.

FirePath is a fairly straightforward tool, but for the sake of completeness, here’s everything you need to know about how to use it:

You start by writing XPath expressions in the text box at the top of the pane and then clicking the “Evaluate” button.

If your XPath expression is invalid – maybe you’ve left out a necessary character, for instance – the text field will turn red. If your expression is valid though, three things will happen:

  1. Whatever content is found by the XPath expression will be highlighted in the browser with a dashed blue outline.
  2. Whatever content is found by the XPath expression will be highlighted in the source code, beneath the text box.
  3. The number of elements found by the XPath expression will appear in the status bar, at the bottom of the browser.

You’ll also notice that, if you right click any element in web page, there’s a new “Inspect in FirePath” option that appears in the context menu. If you select this option, FirePath will figure out the XPath expression for this particular expression and display the results of it, as if you’d typed it out yourself and clicked the “Evaluate” button. The expressions that FirePath generate aren’t particularly elegant – they’re cryptic and overly specific – but, once you’re familiar with the XPath syntax, they can be a handy way to lay the foundations for your own creations.

Project Setup

Before we write any XPath expressions, we need to ask ourselves an important question: “What content do we actually want to extract from the web page?””

In this case, the answer doesn’t precisely matter, since we’re scraping an arbitrary page without any particular goal in mind. But for the sake of having some direction, let’s say that our initial goal is to extract the text of the page’s heading. We’ll do a lot more than this, but this is a good first step.

XPath Fundamentals

Selecting Elements

In the majority of cases, XPath expressions start with a forward-slash:

/

There is one exception to this rule, where XPath expressions will sometimes start with a period and then a forward-lash, like so:

./

But that’s for a specific reason, which we’ll cover later in this book, so while you might see both of these options floating around in other tutorials, you only have to think about the single forward-slash for the time being.

Beyond the initial forward-slash, we can reference whatever element we want to extract from the web page.

For example, if we wanted to select and extract the entire web page, we could reference the html element itself:

/html

If you enter this expression into FirePath and click the “Eval” button:

  1. a blue outline will appear around the web page
  2. the html element will be highlighted in the source code
  3. the status bar will read, “1 matching node”

The entire web page has now been selected.

But what if we wanted to extract every from between the web page’s head tags? To achieve this, we can use another forward-slash and reference the head element:

/html/head

This expression will extract everything from between the page’s head tags, including the page’s title – “Example Domain” – and the page’s styles.

If we wanted to specifically extract the page’s title element, we can dig one level deeper with another forward-slash:

/html/head/title

Already, a pattern should be obvious. With each expression, we’re using the forward-slash to descend through the tree-like structure of the document.

But what if we wanted to only extract the contents of the body element?

To achieve this, we’d write the following expression:

/html/body

Here, it’s important to understand that we go straight from the html element to the body element because the body element is a direct child of the html element.

With this in mind, consider how we might explain these XPath expressions using simple words:

Basically, if we read our XPath expressions from right to left, the / character is synonymous with the words “which is a child of”.

For example, try converting the following expression into words:

/html/body/div/h1

The same rule applies:

Get the h1 element, which is a child of the div element, which is a child of the body element, which is a child of the html element.

It doesn’t matter how deeply structured the HTML might be. You can always navigate down, descending through as many layers of elements as you wish.

Multiple Elements

If you’ve been paying attention, you might have noticed that, each time we’ve evaluated our XPath expressions, the status bar has read: “1 matching node”.

This is because, with every expression we’ve written, only a single element has been found. But this is simply because of the specific expressions we’ve been writing. The default behavior of XPath is to actually return all of the elements found by an expression – not just the first one.

To see this in action, try the following expression:

/html/body/div/p

This expression says:

Find all of the p elements, which are a child of a div element, which are a child of the body element, which is a child of the html element.

You’ll see that FirePath highlights all of the p elements.

Sometimes, this default behavior is what you’ll want to happen, but sometimes it’s not. Soon, we’ll see how we can extract individual elements, or even a range of specific elements. For the time being though, it’s only important to understand that XPath will return all matching elements unless told otherwise.

Finding Descendants

So far, we’ve been writing XPath expressions that navigate through the HTML on a child-by-child basis. To grab the h1 tags, for instance, we had to specify an expression with four parts:

/html/body/div/h1

But the more specific an expression is, the more fragile it becomes, because what if the website owner moves the h1 tags inside a span element? Or they wrap the current div element inside another div element? All of a sudden, our expression would break, even though the content we want to extract hasn’t fundamentally changed.

What we want to be able to say is this:

Get all of the h1 elements on this page. I don’t care exactly where those h1 elements are. It doesn’t matter if they’re wrapped in a span element or if they’re nested in layers upon layers of div elements. Just give me the h1 elements.

What we can’t do is this:

/html/h1

This expression won’t work because the h1 element is not a direct child of the html element. The h1 element is a descendant of the html element, and that’s a critical distinction.

There is, however, a simple solution to this problem.

All we have to do is use two forward slashes within our expression:

/html//h1

This expression will find all of the h1 elements that are descendants – not necessarily children – of the html element. Basically, it’s a much looser way of finding and selecting elements. XPath won’t care about direct, parent-child relationships. It’s simply looking for any instance of the specified element, no matter where exactly it’s located.

Because of how the // works though, we don’t actually need the /html part at the start of the expression. This expression will provide the same result:

//h1

The downside of using // is that, technically, it’s slower, since we’re forcing XPath to search through the entire document. In practice, you probably won’t notice this slowness, but it’s generally considered a best practice to use / instead of // if you’re absolutely sure that two elements share a parent-child relationship.

Extracting Text

The problem with the //h1 expression is that it’s extracting the entire h1 element, when what we really want is to extract the text from within the h1 element. We don’t want to extract the text and and the element’s tags.

To achieve this, we can use XPath’s text() function:

//h1/text()

Here, we have the same expression as before, but we’ve added a forward-slash, along with this text() function. This expression now says:

Get all of the text nodes that are children of the h1 element.

What is a “text node”?

A text node is basically a chunk of text.

In this HTML, for instance:

<h1>Example Domain</h1>

…the “Example Domain” part is a text node.

But if there was some other element between these two words:

<h1>Example <br> Domain</h1>

…we’d actually be left with two text nodes. “Example” would be a text node and “Domain” would be another text node.

It’s also worth noting that we still need to take into consideration whether or not we want to use a single or double forward-slash.

For example, if the text between the h1 tags was inside a span element:

<h1><span>Example Domain</span></h1>

…this expression would not extract the text from between the h1 tags:

//h1/text()

Why?

Because we’re using a single forward-slash, but the text nodes are not a direct child of the h1 element. The span element is the direct child of the h1 element. As such, the single forward-slash is not the right choice.

To account for this, we could either add the span element to the expression:

//h1/span/text()

…or use a double forward-slash:

//h1//text()

This latter expression would extract all of the text nodes from within the h1 element, no matter what other elements might be wrapped around the individual text nodes.

Extracting Attributes

An essential part of web scraping is not only extracting elements from a web page, but also extracting the attributes of elements. For example, if you’re building a scraper that will navigate through a website and extract all of its links, for instance, you need to be able to extract the href attributes of those links. Fortunately, XPath makes this extremely easy to do.

To begin, extract all of the a elements from the page:

//a

(The page we’re working with at the moment only has a single a element, so this expression will only return that one element, but you’re more than welcome to test these expressions on other web pages.)

Then modify the expression to resemble the following:

//a/@href

Here, we’ve added this /@href part to the expression, and whenever the @ symbol is used within an XPath expression, its synonymous with the word attribute. It doesn’t matter what attribute is being referenced – it could be the href attribute or the src attribute or the rel attribute – the @ symbol always means “attribute”.

You could extract all of the class attributes from every div element:

//div/@class

…or extract all of the src attributes from every img element:

//img/@src

It’s worth noting, however, that the @ symbol does not have to be used directly after a forward-slash. We could, for instance, extract every href attribute from a page without specifying the element:

//@href

The problem with this approach is its lack of specificity. The href attribute, after all, is not just used with the a element. You’ll also find the href attribute inside the link element, which is used to embed a stylesheet. As such, I generally suggest using the @ symbol directly after a single forward-slash, to represent that direct, parent-child relationship. In this context, being explicit ensures that you don’t accidentally select unwanted data.

Using Conditions

To make our XPath expressions more powerful, we can use conditions, which allow us to only select elements that match (or don’t match) a particular set of rules that we define.

For example, imagine we want to extract all of the a elements from a page that have a class attribute. At this point, we don’t care what the value of the class attribute actually is. We just want to ensure that the a elements have a class attribute.

To achieve this, start by writing a familiar expression:

//a

Then append a pair of square brackets to the a element:

//a[]

It’s between these brackets that we define the conditions we want to apply to the element immediately to the left of the brackets.

In this case, to extract the a elements that have a class attribute, we can write:

//a[@class]

It’s worth noting that we never place a forward-slash between the name of an element and the square brackets of a condition. The condition should always be placed to the immediate right of an element name.

If we want to check whether an element’s attribute is equal to a certain value, rather than only testing whether or not an attribute exists, we can do that as well.

For example, here’s how we can extract all of the a elements from a page that have their class attribute defined as “button”:

//a[@class='button']

…and here’s how we can retrieve a div element that has its id attribute defined as “content”:

//div[@id='content']

It’s worth noting, however, that writing conditions in this format is somewhat restrictive, because these expressions will only retrieve elements that match our condition exactly.

For example, this condition:

//a[@class='button']

…would not extract any a elements that are structured like this:

<a href="http://example.com" class="button primary">Home</a>

Why?

Because the expression is looking for a elements where the class attribute is equal to “button”, not a elements where the class attribute is equal to “button primary”. It’s not an exact match, so the expression won’t return this element.

In some cases, this will be the desired behavior, but not always, and in a later section, we’ll explore a less strict way to use conditions.

It’s also worth noting that multiple conditions can be used throughout an XPath expression, which is what makes them particularly powerful.

For example, here’s how we’d retrieve all of the div elements on a page where the class attribute is equal to “entry-content”:

//div[@class='entry-content']

…and here’s how we could use another condition to extract all of the a elements from within this div element where the rel attribute of those a elements is equal to “entry-content”:

//div[@class='entry-content']//a[@rel='nofollow']

Pretty nifty, right?

Practical XPath

Code Snippets

Conclusion