XPath for Web Scraping

You are currently reading XPath for Web Scraping: A Practical, Hands-On Guide to Web Scraping with XPath — a free, online book that is designed to get developers up-and-running with XPath as quickly and as easily as possible.

Introduction

This is chapter 1 of 3. Go to next chapter »

What is XPath?

Why XPath?

The magic of XPath is best experienced, rather than explained, which is why the bulk of this book is based around practical examples and lessons. If you’re not sure if XPath is worth your time though, I’m happy to put on my salesman hat and convince you that, yes, if you’re serious about web scraping, you need to know how to write XPath expressions.

Here’s why:

Benefit #1: Flexibility

You can use XPath expressions to grab almost any content from any web page. The big exception to this rule is content that’s delivered via JavaScript – there are separate tools to handle that part of the process – but when it comes to grabbing text, links, images, or any other standard elements from HTML documents, XPath is the best tool for the job.

To illustrate this flexibility, consider the following scenarios:

  • You want to grab the href value of every link on a page that has the words “Buy Now” in the anchor text.
  • You want to grab the src attributes of the first img element that appears inside a div element with a class of “content”
  • You want to grab all of the rows of data from a table, except for the first row and the last row.

In theory, these sound like relatively complicated things to achieve, but XPath can handle all of these situations without breaking a sweat. In fact, these are some of the easiest things we can do with XPath. Once we dive into the more advanced details, what we can achieve is absolutely mind-boggling.

Benefit #2: Power

XPath expressions don’t simply make it easier to scrape the web. They allow us to avoid writing a ton of repetitive code. When you consider how volatile web scraping can be – the web is constantly changing, after all – this ability to write less code becomes an incredible advantage.

For example, think about how much time it’d take to handle the scenarios we just discussed using the programming language of your choice, or a visual scraping tool (like import.io). Then know that we can achieve all of that same functionality with three, relatively short XPath expressions:

Extract href attribute of “Buy Now” links:

//a[text()="Buy Now"]/@href

Extract src of first image inside div.content:

(//div[@class="content"]//img)[1]/@src

Extract all rows of a table, except for the first and last rows:

//table//tr[position()>1 and position()<last()]

You don’t have to understand what these expressions are actually doing – we’ll uncover their mysteries soon enough – but at least appreciate how much functionality we’re able to achieve in less than fifty characters.

Benefit #3: Reliability

Let me know if this sounds familiar:

  1. You create a web scraper.
  2. The website you’re scraping changes its design.
  3. Everything breaks.

Even if you’re new to the world of web scraping, it’s not hard to imagine battling this scenario on a fairly regular basis. The web is a constantly shifting beast, and it’s this lack of stability that ensures web scraping remains an inherently fragile process.

The bad news is, we can’t avoid this fragility altogether (unless we only scrape websites that literally never change), but there is some good news:

XPath has a number of features that allow us to scrape content in a far more reliable manner, so even if a website does change, there is less of a risk that the our scrapers will completely fall apart. This does require a bit of luck (or psychic powers), but since these features are so easy to implement, there’s plenty of upside for only a minimal investment of time.

Benefit #4: Readability

Okay, this might not look readable:

(//div[@class='content']//img)[1]/@src

In fact, it probably looks downright cryptic. But once you learn the basic rules of how XPath expressions are structured – and don’t worry, there aren’t many rules to learn – you’ll see XPath for what it truly is: a wonderfully expressive syntax that can be as easy to read as your native tongue.

If you’re skeptical of this point, that’s fair enough. When I first came across XPath, it seemed like impenetrable gibberish. If you stick with it until the end of this book though, you’ll have a strong grasp of any XPath expression that you come across.

Benefit #5: Ubiquity

Some web scraping tools, like Python’s BeautifulSoup library, have their own syntax for extracting content from the web, but throughout the web scraping world, you’ll find that XPath is the standard. It’s supported by a ton of scraping libraries, tools, and applications, including:

The benefit of this ubiquity is that, once you learn XPath, your knowledge of the syntax will carry over between countless platforms. You want be locked into any specific, proprietary way of working, so if you ever want to move to a different set of tools, you’ll be able to simply copy and paste your XPath expressions and they should work exactly the same.

XPath vs. CSS

Many web scraping tools that support XPath expressions also allow users to extract content from the web using CSS selectors, and there’s a couple of significant reasons you might prefer to use CSS instead of XPath:

  1. You’re probably already familiar with CSS selectors, which means you don’t have to learn anything new.
  2. The syntax is a lot simpler.

To illustrate the second point, imagine that we want to grab all of the elements from a web page that have a class of “button” by using the CSS features of the Nokogiri library. Here’s how we’d write that code:

page.css('.button')

Pretty slick, right?

In contrast, this is how we’d achieve the same effect with XPath:

page.xpath("//*[@class='button']")

The syntax, clearly, isn’t as seamless.

The problem with CSS is that you’re limited with the amount of logic you can write into your expressions. You can’t, for instance, use CSS to grab all of the href attributes from all of the links on a page, and you couldn’t extract the second-last row of data from the first table on a page. You could select all of the links on a page, and select that first table, but you’d have to write (and maintain) additional code for the rest of the logic.

It’s for precisely this reason that you need to learn XPath if you’re serious about web scraping. If you only have knowledge of using CSS selectors, you’ll quickly find yourself limited (and frustrated) by what you can achieve.

That’s not to say CSS doesn’t have its place. I still use CSS in my web scrapers on a regular basis. It’s quick, easy, and aesthetically pleasing, so it’s a handy tool to have in your toolkit. The moment I need to do anything remotely sophisticated though, I reach for XPath.