small medium large xlarge

Text Processing with Ruby: Extract Value from the Data That Surrounds You


Cover image for Text Processing with Ruby

Text Processing with Ruby

Extract Value from the Data That Surrounds You


Whatever you want to do with text, Ruby is up to the job. No matter what the source – web pages, databases, the contents of files – learn how to acquire the text and get it into your program. Explore techniques to process that text and then output the transformed or extracted text. Cut even the most complex text-based tasks down to size and learn how to master regular expressions, scrape information from Web pages, develop reusable utilities to process text in pipelines, and more.

Customer Reviews

It is rare that a programming language can be unequivocally stated to be the right
tool for a job. But when it comes to scanning, extracting, and transforming text,
Ruby is that tool, and Rob Miller is the right guide to instruct you in the most effective
and efficient application of it.

- Avdi Grimm

Author, "Confident Ruby;" Head Chef,

This is a fun, readable, and very useful book. I’d recommend it to anyone who
needs to deal with text—which is probably everyone.

- Paul Battley

Developer, maintainer of text gem

While Ruby has become established as a Web development language, thanks to
Rails, it’s an excellent language for working with text as well. Text Processing with
covers the nuts and bolts of what I believe is a natural domain for Ruby, all
the way from bringing text into the environment via files, the Web, and other
means through to parsing what it says and sending it back out again.

- Peter Cooper

Editor of "Ruby Weekly," Cooper Press

I’d recommend this book to anyone who wants to get started with text processing.
Ruby has powerful tools and libraries for the whole ETL workflow, and this book
describes everything you need to get started and succeed in learning.

- Hajba Gábor László


A lot of people get into Ruby via Rails. This book is really well suited to anyone
who knows Rails, but wants to know more Ruby.

- Drew Neil

Director, Studio Nelstrom, and author of "Practical Vim"

See All Reviews

About this Title

Pages: 272
Published: 2015-09-25
Release: P1.0 (2015-09-29)
ISBN: 978-1-68050-070-7

Most information in the world is in text format, and programmers often find themselves needing to make sense of the data hiding within. You want to do this efficiently, avoiding labor-intensive, manual work—and Ruby is ideally suited to this task.

Text Processing with Ruby takes a practical approach to working with text:

  • First, Acquire: Explore Ruby’s core and standard library, and what’s possible with IO and its derived classes like File. Extract text into your Ruby programs from the file system and standard input. Process delimited files such as CSVs, and write utilities that interact with other programs in text-processing pipelines. Process web pages with Nokogiri to pull out information from even the messiest of HTML, and decipher character encoding mysteries.
  • Second, Transform: Use regular expressions to match, extract, and replace patterns in text. Write a parser using Ruby’s StringScanner library. Use Natural Language Processing techniques to extract keywords and implement fuzzy searching.
  • Finally, Load: Write the transformed text and data to standard output, files and other processes. Serialize text into JSON, XML, and CVS, and use ERB to create more complex formats.

You’ll soon be able to tackle even the most enormous and entangled text with ease, scything through gigabytes of data and effortlessly extracting the bits that matter.

Top Five Text Processing Tips
by Rob Miller, author of Text Processing with Ruby

Clean up your data first
Data in the real world is messy. It almost always pays off to take some
time to normalize different sources of data and to get them into the
same format before you begin whatever actual processing you need to do.
You’ll have less exceptions and special cases in your code, and it’ll be
a lot more resilient.

Master regular expressions
There are definitely some text processing problems that can’t be solved
with regular expressions, but not that many. While they’re not always
the best or more readable option, knowing regular expressions well will
get you out of many tight spots, and even more often than that will be
the first step towards a more robust solution.

Break your problem into discrete steps
Almost all text processing tasks, no matter how complicated they seem on the face of it, are really a series of small transformations. Figuring out how to frame your problem in this way will make it easy to take a pipeline approach, where your text flows through a series of small,
discrete steps, each of which transform the data in a particular way and
then passes it on. Such programs are both easier to reason about and
easier to modify and extend.

Figure out a strategy for missing data
Data in the real world, as well as being messy, also frequently has gaps. Decide early on how you’re going to cope with that — how you’ll represent the absence of particular fields or properties — and you’ll
avoid messiness later on.

Make the most of existing tools
There are hundreds of command-line tools that exist solely to process
textual data. Each of them is capable of performing a particular
transformation, which means you don’t need to reinvent the wheel. If you
use existing tools for the parts of your problem that have already been
solved, all that remains is to solve the unique problem that you have.

What You Need

This book requires a passing familiarity with the Ruby programming language, and assumes that you already have Ruby installed on your computer.

Contents & Extracts


  • Extract: Acquiring Text
    • Reading from Files
      • Opening a File
      • Reading from a File
      • Treating Files as Streams
      • Reading Fixed-Width Files
      • Wrapping Up
    • Processing Standard Input
      • Redirecting Input from Other Processes
      • Example: Extracting URLs
      • Concurrency and Buffering
      • Wrapping Up
    • Shell One-liners excerpt
      • Arguments to the Ruby Interpreter
      • Prepending and Appending Code
      • Example: Parsing Log Files
      • Wrapping Up
    • Flexible Filters with ARGF
      • Reading from ARGF as a Stream
      • Modifying Files
      • Manipulating ARGV
      • Wrapping Up
    • Delimited Data
      • Parsing a TSV
      • Delimited Data and the Command Line
      • The CSV Format
      • Wrapping Up
    • Scraping HTML
      • The Right Tool for the Job: Nokogiri
      • Searching the Document
      • Working With Elements
      • Exploring a Page
      • Example: Reading a League Table
      • Wrapping Up
    • Encodings
      • A Brief Introduction to Character Encodings
      • Ruby’s Support for Character Encodings
      • Detecting Encodings
      • Wrapping Up
  • Transform: Modifying and Manipulating Text
    • Regular Expressions Basics
      • A Gentle Introduction
      • Pattern Syntax
      • Regular Expressions in Ruby
      • Wrapping Up
    • Extraction and Substitution with Regular Expressions
      • Matching Against Patterns
      • Global Match Variables
      • Extracting Multiple Matches
      • Transforming Text
      • Wrapping Up
    • Writing Parsers
      • Simple Parsers with StringScanner
      • Example: Parsing a Config File
      • Rule-Based Parsers
      • Example: Parsing RTF Files
      • Wrapping Up
    • Natural Language Processing excerpt
      • What is Natural Language Processing?
      • Example: Extracting Keywords from Articles
      • Example: Fuzzy Searching
      • Wrapping Up
  • Load: Writing Text
    • Standard Output and Standard Error
      • Simple Output
      • Formatting Output with printf
      • Redirecting Standard Output
      • Wrapping Up
    • Writing to Other Processes and to Files excerpt
      • Writing to Other Processes
      • Writing to Files
      • Temporary Files
      • Wrapping Up
    • Serialization and Structure: JSON, XML, CSV
      • JSON
      • XML
      • CSV
      • Wrapping Up
    • Templating output with ERB
      • Writing Templates
      • Example: Generating a Purchase Ledger
      • Evaluating Templates
      • Passing Data to Templates
      • Controlling Presentation with Decorators
      • Wrapping Up
  • Appendices
    • A Shell Primer
      • Running Commands
      • Controlling Output
      • Exit Statuses and Flow Control
    • Useful Shell Commands


Rob Miller is Head of Digital for a London-based marketing consultancy. He spends his days merrily chewing through huge quantities of text in Ruby, turning raw data into meaningful analysis. He blogs at and tweets @robmil.