Skip to the content.

Use regular expressions to parse text

In this chapter, you’ll learn how to use Regular Expressions to search and replace text.

What is RegEx and what to use it for

Regular Expressions or RegEx for short, is used for searching and replacing text. Technically a RegEx is a sequence of characters that specifies a search pattern.

Why use RegEx

RegEx primary usage area is for searching text, replacing it and also extracting text. There do exist string libraries that can do some of the functionality RegEx is capable of. Sometimes using those string libraries might even be the best thing to do. However, sometimes a RegEx pattern is better.

Fair word of warning though, RegEx is hard to get right. You are encouraged to learn more of how they work cause they are quite powerful.

My hope is by you reading this chapter, that you will find RegEx less intimidating and see it as a valuable tool in your toolbox.

Where is it used ?

RegEx shows up in many different contexts:

Your first RegEx

Let’s construct a simple RegEx to get a feel for it. Here it is:

an

if you apply this search pattern an to the following text:

highlands is a part of Scotland

It will match like so:

highlands is a part of Scotland

For simpler cases, where you are looking to see if a specific word matches, in one or more places in a sentence, a pattern like the above is enough.

RegEx in Go

To start using RegEx in Go, there’s the regexp library. There are two approaches:

Character classes

Character classes are able to distinguish between different types of characters. Different types can be newlines, digits, letters and so on.

Let’s have a look at some common types you are likely to encounter:

Type Description
. This type matches any character except for a carriage return
\ This type escapes what’s coming next
\w matches any character from the latin alphabet including underscore _
\d matches any digit
\D this is the inverse of \d and matches any character that’s not a digit
\s matches a white space character like space tab, line feed etc.

Lets show an example:

matched, err := regexp.FindString("\d", "abc123")

There would be a match above due to 123. However, there would be no match against “abc” as there’s no digits in it.

Repetition

If you want to express repetition, there’s two characters of interest:

Anchors and boundaries

There are different anchors you can use like for example:

Groups

Groups are way to capture part of a string and have that returned. It’s very useful for parsing out the info you need. Consider this example parsing out the info from a CSV row:

Name: myarticle, Price: 114, Quantity: 3 

To get the data you need, you want everything after the colon, :. You can construct a RegEx like so:

\w+:\s?(\w+)

what we are doing is defining we want to capture a group using parenthesis () but that group should happen after:

All this ends up capturing myarticle, 114 and 3.

Named groups

A named group is a group you want to capture where the groups have names. Why would you want that? Well, say that you want to break down a URL in pieces and wants to know what’s what. Given a URL “http://myapi.com/products?page=1”, you have:

So how can we break it apart and give it a name?

Well, to break it apart, we will use something called named groups, it will allow us to look at our matches and know what’s what. So instead of getting:

http

We will get a key and value that says:

protocol: http

Syntax wise, we need to use ?<name of our group> within our parenthesis ().

You use the following syntax:

(?<mygroup>\w+)

In Go, we need a P right after the question mark, so the code for this would be:

r, err := regexp.Compile(`(?P<mygroup>\w+):`)

Extract the data from a URL

Let’s approach this problem then given the string “http://myapi.com/products?page=1”:

Here’s what our Go code would look like:

r, err := regexp.Compile(`^(?P<protocol>\w+):\/\/(?P<domain>\w+\.\w+)\/(?P<route>\w+)\/?`)

Ok, so we have the pattern, what about printing the parsed parts?

To pair the named groups with their values, we need to combine values from both the Regex and the response. First, we call FindStringSubmatch(), that will give us the values.

m := r.FindStringSubmatch("http://myapi.com/products")

Then, we need to match the names with these values. We will need to call r.SubexpNames() and iterate over the response.

result := make(map[string]string)
 for i, name := range r.SubexpNames() {
  if i != 0 && name != "" {
   result[name] = m[i]
  }
 }

Note this line where each name is assigned a value:

result[name] = m[i]

Finally, to get the values, we can print them out as they are now in a map structure:

fmt.Println(result["protocol"]) // http
fmt.Println(result["domain"]) // myapi.com
fmt.Println(result["route"]) // products

Assignment - create a Go program that parses a URL

From the above use case on named groups, write a Go program that takes a URL and analyzes it. It should work like so:

Type URL: http://myapi.com/products
The URL consist of:
protocol: http
domain: myapi.com
route: products

Solution

package main

import (
 "fmt"
 "log"
 "regexp"
)

func main() {
 var url string
 fmt.Println("Type URL: ")
 fmt.Scan(&url)

 r, err := regexp.Compile(`^(?P<protocol>\w+):\/\/(?P<domain>\w+\.\w+)\/(?P<route>\w+)\/?`)
 if err != nil {
  log.Fatal("Error compiling: ", err)
 }
 m := r.FindStringSubmatch(url)
 if m == nil {
  panic("mo match")
 }
 result := make(map[string]string)
 for i, name := range r.SubexpNames() {
  if i != 0 && name != "" {
   result[name] = m[i]
  }
 }
 fmt.Println("The URL consist of:")
 fmt.Println(result["protocol"])
 fmt.Println(result["domain"])
 fmt.Println(result["route"])
}

Replacing

A common use case for Regex is when it’s used to replace something with something else.

There’s more than one method in Go you could be using but one you could use is ReplaceAllString() that sits on the compiled RegEx object:

r := regexp.MustCompile(`aa`)
 s := r.ReplaceAllString("aabbcc", "cc") // s = ccbbcc

The above replaces all occurrences of aa with cc on the string aabbcc.

You can also use capture groups and replace a captured group with a string. Here’s an example:

r := regexp.MustCompile(`(\d)`)
 s := r.ReplaceAllString("productid:114", "0${1}") // s = productid:0114

in the above case, we replace 114 with itself but we also prepend it with a 0.

Use case, replace XML Nodes

Imagine you are working with XML for example and want to rename all nodes with a certain name.

Here’s your XML

<books>
    <book>
      <author>Shakespeare</author>
      <title>Romeo and Juliet</title>
      <pages>400</pages>
      <type>paperback</type>
      <cost>17</cost>
    </book>
    <book>
      <author>Shakespeare</author>
      <title>Hamlet</title>
      <pages>270</pages>
      <type>paperback</type>
      <cost>15</cost>
    </book>
</books>

Imagine title should be replaced by name, how do we do that?

Well, it would be straight forward to replace title by name. Let’s say we have this file content though:

<books>
    <book>
      <author>Shakespeare</author>
      <title>The title is Romeo and Juliet</title>
      <pages>400</pages>
      <type>paperback</type>
      <cost>17</cost>
    </book>

</books>

Then we would not only rename the element title to name but also the content would be replaced o “The title is Romeo and Juliet”, that’s NOT what we want.

We need to restrict the replace operation to only target element, like so:

\<\/?(title)\>

The above would match for example <title> and </title>. If we try this however on this XML, we almost get what we want:

<author>Shakespeare</author>

becomes

nameShakespearename

What happened, why did we loose <> ? We need a way to express keeping what was there before AND replace the name. A way to do that is to express capture groups on <> and the element name, like so:

(\<\/?)(title)(\>)

Now we have three groups, we need to fit the result together, and this is something we can express like so:

${1}name${3}

Assignment - replace content

Take the file books.xml containing:

<books>
    <book>
      <author>Shakespeare</author>
      <title>Romeo and Juliet</title>
      <pages>400</pages>
      <type>paperback</type>
      <cost>17</cost>
    </book>
    <book>
      <author>Shakespeare</author>
      <title>Hamlet</title>
      <pages>270</pages>
      <type>paperback</type>
      <cost>15</cost>
    </book>
</books>

and replace:

TIP: you might need to apply the replace twice.

Solution II

package main

import (
 "fmt"
 "regexp"
)

func main() {
 file := `<books>
    <book>
      <author>Shakespeare</author>
      <title>Romeo and Juliet</title>
      <pages>400</pages>
      <type>paperback</type>
      <cost>17</cost>
    </book>
    <book>
      <author>Shakespeare</author>
      <title>Hamlet</title>
      <pages>270</pages>
      <type>paperback</type>
      <cost>15</cost>
    </book>
</books>`

 r := regexp.MustCompile(`(\<\/?)(title)(\>)`)
 s := r.ReplaceAllString(file, "${1}name${3}")
 fmt.Println(s)

 r = regexp.MustCompile(`(\<\/?)(cost)(\>)`)
 s = r.ReplaceAllString(s, "${1}price${3}")
 fmt.Println(s)
}