7. Week 4 Tuesday: Characters and Strings
≪ 6. Week 3 Thursday: While Loops and For Loops | Table of Contents | 8. Week 4 Thursday: Loop Management ≫Imagine the following scenario: you are a secretary working for a large public school, and you realise that there’s an issue in how names are stored in the school’s database of students. Fortunately, everyone’s names are spelled correctly, but you realise there are errors when consolidating records. Consider the student John Old McDonald, who has a bunch of mismatched paperwork because of the following issues:
- Separating first, last, and middle names: the system knows his name is
McDonald, John Old
, but believes his first name isJohn Old
. - Capitalisation: his name is stored as
McDonald, John Old
, but other paperwork indicates his name asMCDONALD, JOHN OLD
orMcdonald, John Old
. - Misspellings: some records have his name spelled as
MacDonald, John Old
instead ofMcDonald, John Old
, and you need to have these corrected.
In addition, you may imagine a future student named Smith O'Niel
enrolling at your school. The apostrophe may wreak havoc on how the computer sorts names in the system, and you’d like to have the system remember the name SMITH ONIEL
instead.
Between report cards, immunisation records, disciplinary records, contact information, and whatever other paperwork schools keep track of, there are just too many students and too much information for you to crawl through yourself. However, you realise this is a process that can be automated, and you want to put your programming skills to use.
The above tasks are actually relatively common in various contexts. While more sophisticated tasks like correcting close misspellings is beyond the scope of PIC 10A, we can certainly develop some techniques for manipulating text by separating a string into several components, seeing if a string contains another string, and making minor modifications like capitalising certain characters and “blacking out” others.
Characters and the ASCII Table
A string
variable represents a string of text, and that’s really just a bunch of char
variables (i.e. characters) that have been grouped together. In order to learn how to manipulate strings, we ought to learn how to manipulate individual characters.
char
variables, unlike string
variables, are declared using single quotation marks, such as char c = 'C';
. We can also declare char
variables using numbers? Consider the following code:
What do you think the output will be? If you run this on your computer, you should get the output C 67
. We can even take it one step farther:
1char c = 'C';
2if(c != 67) {
3 cout << "Numbers and letters are different" << endl;
4} else {
5 cout << "Computers are illiterate" << endl;
6}
What do you think the output is this time? This illuminates two important facts:
- The character
'C'
and the integer67
are the same thing. More broadly, all characters are just numbers to the computer. - C++ interprets the value
67
(or the character'C'
) differently depending on what type of variable it’s stored in.
Naturally, we should ask how the computer knows which numbers correspond to which letter and vice-versa. About 3 billion years ago, the leading prokaryotic cells of the time congregated and designed the American Standard Code for Information Interchange, or ASCII. This is a table describing which characters correspond to which numbers, and all standard C++ programs adhere to this code. You may look up an ASCII table online through Google, and you will not need to memorise any portion of the ASCII table for this class.
As with all numbers, we may add, subtract, and compare characters to each other. This is useful for a variety of reasons:
- The digits
0
through9
occupy a contiguous block on the ASCII table. To see if achar
variablec
is a digit, we may use the codec >= '0' && c <= '9'
, as opposed toc == '0' || ... || c == '9'
. We can do a similar trick for checking ifc
is a lower case letter or an upper case letter. - To convert a character from lower case to upper case, we can serendipitously notice that
'a'
has a value of97
while'A'
has a value of65
; they differ by exactly32
. The rest of the 25 letters obey the same rule! Ifc
is achar
variable that holds a lower case letter, then to convert it to upper case, one may performc -= 32;
. If the constant32
is too hard to remember (I think it is), you can also doc = c - 'a' + 'A';
. Try to squint at the table and see why this works!
Accessing Characters within Strings
Let’s go back to the start of this post. Imagine one has the line
1string name = "John Old McDonald";
(More realistically, a user may input their name.) You may want to convert this name to have all upper-case characters. You know how to do this for a single character at a time, but what about a full string? A natural thing to do is to modify each character one at a time. Here’s some pseudocode that describes this approach:
- For every character in the name, repeat the following:
- Check if the character is a lower case letter.
- If it is, replace it with its upper case counterpart.
- If it isn’t, don’t do anything.
Okay that was rather short. Let’s put this into code. Each character in the string has an “index”, and it describes the position of the character within the string. In the example above, we have:
J o h n O l d M c D o n a l d
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
To access the character occupying index i
in the string, use the syntax name.at(i)
. You may use this syntax to retrieve and store a specific character, and you can also use it to replace a character at a specific index. For instance,
1string name = "John Old McDonald";
2name.at(0) = 'a';
3char c = name.at(1);
4
5cout << name << endl;
6cout << c << endl;
prints aohn Old McDonald
on one line and o
on the next. Line 2
replaces the character at index 0
(i.e., the J
) with an a
, and line 3
declares a char
variable named c
with the value of the character at index 1
(i.e., o
).
Common Mistake 1. Indexing by 1 vs. indexing by 0
Notice that the indices start at 0
instead of 1
(we call this “indexing by 0”). That is, the first character has index 0
, the fifth character has index 4
, etc. This is the source of much pain. It’s very common to start loops at the second character instead of the first, and it’s also common for programs to crash when attempting to access the last character of a string.
We are now almost ready to convert this into code. We should write a for
loop to do this: we will be performing some task on name.at(i)
, where i = 0, 1, ..., 16
. But how do we know when to stop if the name is something that the user inputs?
We need a way to determine the length of the string. In this case, "John Old McDonald"
has 17
characters, and we stop just before the index reaches the length. In general, we can use name.length()
to find the length of the string. Thus, our indices are i = 0, 1, ..., name.length() - 1
. In code,
1string name = "John Old McDonald";
2for(int i = 0; i < name.length(); i++) {
3 // if the i-th character is lower case, add the quantity
4 // 'A' - 'a' to convert it to upper case.
5 if(name.at(i) >= 'a' && name.at(i) <= 'z') {
6 name.at(i) += 'A' - 'a';
7 }
8
9 // otherwise, do nothing.
10}
11
12cout << name << endl;
You can replace name
with any string, or even have a user input their own name, and this code will still convert it to all caps!
Some Other String Operations
There are some other operations that are above the level of character-by-character manipulations, and here are the ways we would perform them:
- Joining two strings together:
Ifstring first = "John";
andstring last = "McDonald";
, we may want to put them into a single variable holding the full name. To do this, we would typestring full_name = first + last;
, and this (almost) does what you would expect.full_name
now holds the stringJohnMcDonald
. To add a space between the names, we may performfirst + " " + last
. - Getting a substring:
Ifstring full_name = "John Old McDonald";
and we want a variable only holding the first name, we may use thesubstr
function. The syntax isfull_name.substr([start], [length])
. Thus, we could typestring first = full_name.substr(0, 4);
, which keeps the first four characters starting from index0
. You may omit the length, in which case the substring will run to the end of the string; for instance,string last = full_name.substr(9);
keeps everything from index9
and onwards, solast
will hold"McDonald"
.
There are some other operations one can perform with strings, but they’re far too numerous for me to list here. If you’re curious about how to search for a substring of a string, how to remove spaces from either end of a string, etc., you may check the C++ reference, which contains documentation of every single function available for use on strings! Some key functions to know are find, rfind, getline, push_back, and pop_back.
Common Mistake 2. Concatenating String Literals
One may be tempted to write the code
if one has a particularly narrow screen. When using +
to join two strings, at least one of the two summands must be a string
variable instead of a string literal, i.e. an explicit string of text enclosed by quotations.
Practise Problems
Strings can be difficult to work with, especially when incorporating skills from loops and control flow at the same time. Practise is key, and below are a few problems to hone your skills with. Make sure you know the difference between cin
and getline
! I won’t spell out the difference here, so make sure you read the documentation.
Problem 3. Name Formatting
Write a C++ program that allows the user to input their full name. Then, remove any characters that aren’t spaces or letters, and convert their name to all caps. Print out the fully capitalised name.
Sample Run:
INPUT John Smith
OUTPUT JOHN SMITH
INPUT John Old McDonald
OUTPUT JOHN OLD MCDONALD
INPUT John O'Niel McDonald, Sr.
OUTPUT JOHN ONIEL MCDONALD SR
Problem 4. Digits
Write a C++ program that allows a user to input any line of text. Then, sum the value of every digit present in the inputted text, ignoring negative signs, and finally print out this sum. Print a zero if no digits are present.
Sample run:
INPUT 1 2 3 4 5
OUTPUT 15
INPUT 1 23 ORUCGD4OEUCOH5
OUTPUT 15
INPUT drcgoekbocuhaoelucraoeutnamkbbjkhc
OUTPUT 0
Problem 5. Phone Numbers
A phone number, for the purposes of this problem, contains three components: a country code, a 3-digit area code, and a 7 or 8-digit phone number. The country code is sometimes preceeded by a ‘+’; the area code is sometimes surrounded by parentheses; the phone number itself is sometimes hyphenated. These may are may not be separated by spaces. The following are all examples of valid phone numbers:
+1 (800) 888-8888
86 372 2649371
1247736491
1(747)9914 416
Formatting aside, we consider two phone numbers to by the same if they consist of identical digits. So, +1 (800) 888-8888
and 18008888888
are the same phone number.
Write a C++ program that accepts two valid phone numbers as input (each on its own line). Determine if the two phone numbers are the same phone number or not.
Sample Run:
INPUT +1 (800) 888-8888
INPUT 18008888888
OUTPUT Same
INPUT +1(800) 888-88-88
INPUT 1(747)991 4416
OUTPUT Different
Problem 6. Card Game
In a standard deck of playing cards, each card has a suit and a face value. The suit can be Hearts, Spades, Clubs, or Diamonds. The face value can be an ace, a number between 2 and 10, a jack, queen, or king. Each card can be represented by two characters: a suit (see the bolded characters) and a face value. Aces, Tens, Jacks, Queens, and Kings are represented by their leading letters, and the numbers 2 through 9 are represented by their digits. For instance,
5H Five of Hearts
TC Ten of Clubs
AS Ace of Spades
KD King of Diamonds
Two cards can be compared by their face values. Kings beat queens, which beat jacks, which beat tens. Followed are the number cards, arranged in the obvious order, and the ace is the lowest value. The suits do not affect this.
Write a C++ program that allows a user to input two cards. Then, determine which card is larger and output the winning card, or decide if there is a tie.
Sample Run:
INPUT 5H TC
OUTPUT TC
INPUT AD KS
OUTPUT KS
INPUT JD JH
OUTPUT Tie