Hunter Liu's Website

7. Week 4 Tuesday: Characters and Strings

≪ 6. Week 3 Thursday: While Loops and For Loops | Table of Contents | 8. Week 4 Thursday: Loop Management ≫

Imagine the following scenario: you are a secretary working for a large public school, and you realise that there’s an issue in how names are stored in the school’s database of students. Fortunately, everyone’s names are spelled correctly, but you realise there are errors when consolidating records. Consider the student John Old McDonald, who has a bunch of mismatched paperwork because of the following issues:

In addition, you may imagine a future student named Smith O'Niel enrolling at your school. The apostrophe may wreak havoc on how the computer sorts names in the system, and you’d like to have the system remember the name SMITH ONIEL instead.

Between report cards, immunisation records, disciplinary records, contact information, and whatever other paperwork schools keep track of, there are just too many students and too much information for you to crawl through yourself. However, you realise this is a process that can be automated, and you want to put your programming skills to use.

The above tasks are actually relatively common in various contexts. While more sophisticated tasks like correcting close misspellings is beyond the scope of PIC 10A, we can certainly develop some techniques for manipulating text by separating a string into several components, seeing if a string contains another string, and making minor modifications like capitalising certain characters and “blacking out” others.

Characters and the ASCII Table

A string variable represents a string of text, and that’s really just a bunch of char variables (i.e. characters) that have been grouped together. In order to learn how to manipulate strings, we ought to learn how to manipulate individual characters.

char variables, unlike string variables, are declared using single quotation marks, such as char c = 'C';. We can also declare char variables using numbers? Consider the following code:

1char c = 67; 
2int n = 67; 
3cout << c << " " << n << endl; 

What do you think the output will be? If you run this on your computer, you should get the output C 67. We can even take it one step farther:

1char c = 'C'; 
2if(c != 67) {
3    cout << "Numbers and letters are different" << endl; 
4} else {
5    cout << "Computers are illiterate" << endl; 
6}

What do you think the output is this time? This illuminates two important facts:

  1. The character 'C' and the integer 67 are the same thing. More broadly, all characters are just numbers to the computer.
  2. C++ interprets the value 67 (or the character 'C') differently depending on what type of variable it’s stored in.

Naturally, we should ask how the computer knows which numbers correspond to which letter and vice-versa. About 3 billion years ago, the leading prokaryotic cells of the time congregated and designed the American Standard Code for Information Interchange, or ASCII. This is a table describing which characters correspond to which numbers, and all standard C++ programs adhere to this code. You may look up an ASCII table online through Google, and you will not need to memorise any portion of the ASCII table for this class.

As with all numbers, we may add, subtract, and compare characters to each other. This is useful for a variety of reasons:

  1. The digits 0 through 9 occupy a contiguous block on the ASCII table. To see if a char variable c is a digit, we may use the code c >= '0' && c <= '9', as opposed to c == '0' || ... || c == '9'. We can do a similar trick for checking if c is a lower case letter or an upper case letter.
  2. To convert a character from lower case to upper case, we can serendipitously notice that 'a' has a value of 97 while 'A' has a value of 65; they differ by exactly 32. The rest of the 25 letters obey the same rule! If c is a char variable that holds a lower case letter, then to convert it to upper case, one may perform c -= 32;. If the constant 32 is too hard to remember (I think it is), you can also do c = c - 'a' + 'A';. Try to squint at the table and see why this works!

Accessing Characters within Strings

Let’s go back to the start of this post. Imagine one has the line

1string name = "John Old McDonald"; 

(More realistically, a user may input their name.) You may want to convert this name to have all upper-case characters. You know how to do this for a single character at a time, but what about a full string? A natural thing to do is to modify each character one at a time. Here’s some pseudocode that describes this approach:

  1. For every character in the name, repeat the following:
    1. Check if the character is a lower case letter.
    2. If it is, replace it with its upper case counterpart.
    3. If it isn’t, don’t do anything.

Okay that was rather short. Let’s put this into code. Each character in the string has an “index”, and it describes the position of the character within the string. In the example above, we have:

J  o  h  n     O  l  d     M  c  D  o  n  a  l  d 
0  1  2  3  4  5  6  7  8  9  10 11 12 13 14 15 16

To access the character occupying index i in the string, use the syntax name.at(i). You may use this syntax to retrieve and store a specific character, and you can also use it to replace a character at a specific index. For instance,

1string name = "John Old McDonald"; 
2name.at(0) = 'a'; 
3char c = name.at(1); 
4
5cout << name << endl; 
6cout << c << endl; 

prints aohn Old McDonald on one line and o on the next. Line 2 replaces the character at index 0 (i.e., the J) with an a, and line 3 declares a char variable named c with the value of the character at index 1 (i.e., o).

Common Mistake 1. Indexing by 1 vs. indexing by 0

Notice that the indices start at 0 instead of 1 (we call this “indexing by 0”). That is, the first character has index 0, the fifth character has index 4, etc. This is the source of much pain. It’s very common to start loops at the second character instead of the first, and it’s also common for programs to crash when attempting to access the last character of a string.

We are now almost ready to convert this into code. We should write a for loop to do this: we will be performing some task on name.at(i), where i = 0, 1, ..., 16. But how do we know when to stop if the name is something that the user inputs?

We need a way to determine the length of the string. In this case, "John Old McDonald" has 17 characters, and we stop just before the index reaches the length. In general, we can use name.length() to find the length of the string. Thus, our indices are i = 0, 1, ..., name.length() - 1. In code,

 1string name = "John Old McDonald"; 
 2for(int i = 0; i < name.length(); i++) {
 3    // if the i-th character is lower case, add the quantity 
 4    // 'A' - 'a' to convert it to upper case. 
 5    if(name.at(i) >= 'a' && name.at(i) <= 'z') {
 6        name.at(i) += 'A' - 'a'; 
 7    } 
 8
 9    // otherwise, do nothing. 
10}
11
12cout << name << endl; 

You can replace name with any string, or even have a user input their own name, and this code will still convert it to all caps!

Some Other String Operations

There are some other operations that are above the level of character-by-character manipulations, and here are the ways we would perform them:

There are some other operations one can perform with strings, but they’re far too numerous for me to list here. If you’re curious about how to search for a substring of a string, how to remove spaces from either end of a string, etc., you may check the C++ reference, which contains documentation of every single function available for use on strings! Some key functions to know are find, rfind, getline, push_back, and pop_back.

Common Mistake 2. Concatenating String Literals

One may be tempted to write the code

1string name = "John Old " + 
2              "McDonald";

if one has a particularly narrow screen. When using + to join two strings, at least one of the two summands must be a string variable instead of a string literal, i.e. an explicit string of text enclosed by quotations.

Practise Problems

Strings can be difficult to work with, especially when incorporating skills from loops and control flow at the same time. Practise is key, and below are a few problems to hone your skills with. Make sure you know the difference between cin and getline! I won’t spell out the difference here, so make sure you read the documentation.

Problem 3. Name Formatting

Write a C++ program that allows the user to input their full name. Then, remove any characters that aren’t spaces or letters, and convert their name to all caps. Print out the fully capitalised name.

Sample Run: 
INPUT   John Smith
OUTPUT  JOHN SMITH

INPUT   John Old McDonald
OUTPUT  JOHN OLD MCDONALD

INPUT   John O'Niel McDonald, Sr. 
OUTPUT  JOHN ONIEL MCDONALD SR 

Problem 4. Digits

Write a C++ program that allows a user to input any line of text. Then, sum the value of every digit present in the inputted text, ignoring negative signs, and finally print out this sum. Print a zero if no digits are present.

Sample run: 
INPUT   1 2 3 4 5
OUTPUT  15

INPUT   1 23 ORUCGD4OEUCOH5
OUTPUT  15

INPUT   drcgoekbocuhaoelucraoeutnamkbbjkhc
OUTPUT  0

Problem 5. Phone Numbers

A phone number, for the purposes of this problem, contains three components: a country code, a 3-digit area code, and a 7 or 8-digit phone number. The country code is sometimes preceeded by a ‘+’; the area code is sometimes surrounded by parentheses; the phone number itself is sometimes hyphenated. These may are may not be separated by spaces. The following are all examples of valid phone numbers:

+1 (800) 888-8888 
86 372 2649371 
1247736491
1(747)9914   416

Formatting aside, we consider two phone numbers to by the same if they consist of identical digits. So, +1 (800) 888-8888 and 18008888888 are the same phone number.

Write a C++ program that accepts two valid phone numbers as input (each on its own line). Determine if the two phone numbers are the same phone number or not.

Sample Run: 
INPUT   +1 (800) 888-8888
INPUT   18008888888
OUTPUT  Same 

INPUT   +1(800)  888-88-88
INPUT   1(747)991 4416
OUTPUT  Different

Problem 6. Card Game

In a standard deck of playing cards, each card has a suit and a face value. The suit can be Hearts, Spades, Clubs, or Diamonds. The face value can be an ace, a number between 2 and 10, a jack, queen, or king. Each card can be represented by two characters: a suit (see the bolded characters) and a face value. Aces, Tens, Jacks, Queens, and Kings are represented by their leading letters, and the numbers 2 through 9 are represented by their digits. For instance,

5H  Five of Hearts 
TC  Ten of Clubs
AS  Ace of Spades
KD  King of Diamonds

Two cards can be compared by their face values. Kings beat queens, which beat jacks, which beat tens. Followed are the number cards, arranged in the obvious order, and the ace is the lowest value. The suits do not affect this.

Write a C++ program that allows a user to input two cards. Then, determine which card is larger and output the winning card, or decide if there is a tie.

Sample Run: 
INPUT   5H TC 
OUTPUT  TC

INPUT   AD KS
OUTPUT  KS

INPUT   JD JH
OUTPUT  Tie