Reading a Line in a File Python
How to extract specific portions of a text file using Python
Updated: 06/xxx/2020 by Reckoner Hope

Extracting text from a file is a mutual task in scripting and programming, and Python makes it easy. In this guide, we'll discuss some elementary means to extract text from a file using the Python three programming language.
Make sure you're using Python 3
In this guide, we'll exist using Python version 3. Well-nigh systems come up pre-installed with Python 2.vii. While Python 2.seven is used in legacy lawmaking, Python 3 is the present and hereafter of the Python language. Unless y'all accept a specific reason to write or support Python 2, nosotros recommend working in Python 3.
For Microsoft Windows, Python three can be downloaded from the Python official website. When installing, make certain the "Install launcher for all users" and "Add Python to PATH" options are both checked, as shown in the image below.
On Linux, yous can install Python three with your package manager. For instance, on Debian or Ubuntu, y'all tin install it with the following command:
sudo apt-get update && sudo apt-get install python3
For macOS, the Python 3 installer can be downloaded from python.org, as linked above. If yous are using the Homebrew package manager, it can too be installed by opening a final window (Applications → Utilities), and running this command:
mash install python3
Running Python
On Linux and macOS, the control to run the Python 3 interpreter is python3. On Windows, if you installed the launcher, the control is py. The commands on this folio use python3; if you're on Windows, substitute py for python3 in all commands.
Running Python with no options starts the interactive interpreter. For more information about using the interpreter, see Python overview: using the Python interpreter. If you accidentally enter the interpreter, you can get out it using the command exit() or quit().
Running Python with a file name will interpret that python programme. For example:
python3 program.py
...runs the program independent in the file program.py.
Okay, how tin can we use Python to extract text from a text file?
Reading data from a text file
First, permit's read a text file. Let'due south say nosotros're working with a file named lorem.txt, which contains lines from the Lorem Ipsum example text.
Lorem ipsum dolor sit down amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.
Annotation
In all the examples that follow, we work with the four lines of text contained in this file. Copy and paste the latin text higher up into a text file, and save it equally lorem.txt, so you can run the example code using this file as input.
A Python plan can read a text file using the built-in open() function. For case, the Python iii programme beneath opens lorem.txt for reading in text way, reads the contents into a string variable named contents, closes the file, and prints the data.
myfile = open up("lorem.txt", "rt") # open lorem.txt for reading text contents = myfile.read() # read the entire file to cord myfile.close() # shut the file impress(contents) # print cord contents
Here, myfile is the proper name we give to our file object.
The "rt" parameter in the open() role ways "we're opening this file to read text data"
The hash mark ("#") ways that everything on that line is a comment, and it'due south ignored past the Python interpreter.
If you save this program in a file called read.py, yous can run it with the following command.
python3 read.py
The command to a higher place outputs the contents of lorem.txt:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.
Using "with open up"
Information technology's important to close your open files as soon as possible: open up the file, perform your operation, and close it. Don't leave it open up for extended periods of fourth dimension.
When yous're working with files, it's good practice to use the with open up...as chemical compound statement. It'southward the cleanest way to open up a file, operate on it, and close the file, all in i piece of cake-to-read cake of code. The file is automatically closed when the code block completes.
Using with open up...every bit, nosotros can rewrite our program to expect like this:
with open up ('lorem.txt', 'rt') as myfile: # Open lorem.txt for reading text contents = myfile.read() # Read the entire file to a string print(contents) # Print the string
Note
Indentation is important in Python. Python programs use white space at the beginning of a line to define scope, such equally a block of code. We recommend you use iv spaces per level of indentation, and that you use spaces rather than tabs. In the following examples, brand sure your lawmaking is indented exactly as it'south presented here.
Instance
Salve the program every bit read.py and execute information technology:
python3 read.py
Output:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.
Reading text files line-by-line
In the examples so far, we've been reading in the whole file at one time. Reading a full file is no big deal with pocket-sized files, but mostly speaking, it's non a cracking idea. For one thing, if your file is bigger than the amount of bachelor retention, y'all'll come across an error.
In almost every case, it's a better idea to read a text file one line at a time.
In Python, the file object is an iterator. An iterator is a type of Python object which behaves in sure ways when operated on repeatedly. For instance, you can use a for loop to operate on a file object repeatedly, and each time the same functioning is performed, you'll receive a different, or "side by side," upshot.
Example
For text files, the file object iterates one line of text at a time. It considers one line of text a "unit" of data, so nosotros can utilise a for...in loop statement to iterate 1 line at a fourth dimension:
with open up ('lorem.txt', 'rt') as myfile: # Open lorem.txt for reading for myline in myfile: # For each line, read to a string, impress(myline) # and print the string.
Output:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.
Discover that nosotros're getting an actress line pause ("newline") subsequently every line. That's considering two newlines are being printed. The beginning ane is the newline at the end of every line of our text file. The second newline happens because, by default, print() adds a linebreak of its own at the end of whatsoever you've asked information technology to print.
Allow's store our lines of text in a variable — specifically, a listing variable — so we tin look at it more closely.
Storing text data in a variable
In Python, lists are like to, but not the same as, an array in C or Coffee. A Python list contains indexed data, of varying lengths and types.
Example
mylines = [] # Declare an empty listing named mylines. with open ('lorem.txt', 'rt') as myfile: # Open lorem.txt for reading text information. for myline in myfile: # For each line, stored as myline, mylines.append(myline) # add its contents to mylines. impress(mylines) # Impress the listing.
The output of this program is a little dissimilar. Instead of printing the contents of the list, this program prints our listing object, which looks like this:
Output:
['Lorem ipsum dolor sit amet, consectetur adipiscing elit.\n', 'Nunc fringilla arcu congue metus aliquam mollis.\n', 'Mauris nec maximus purus. Maecenas sit amet pretium tellus.\north', 'Quisque at dignissim lacus.\n']
Here, we come across the raw contents of the listing. In its raw object form, a list is represented as a comma-delimited list. Here, each element is represented as a cord, and each newline is represented as its escape character sequence, \due north.
Much similar a C or Java assortment, the list elements are accessed by specifying an index number afterward the variable proper name, in brackets. Index numbers kickoff at zero — other words, the northth element of a listing has the numeric index due north-1.
Note
If yous're wondering why the index numbers start at zip instead of one, yous're not solitary. Computer scientists have debated the usefulness of zero-based numbering systems in the past. In 1982, Edsger Dijkstra gave his opinion on the field of study, explaining why nil-based numbering is the best way to alphabetize data in computer science. You can read the memo yourself — he makes a compelling statement.
Example
Nosotros tin can print the showtime element of lines by specifying alphabetize number 0, contained in brackets after the name of the list:
print(mylines[0])
Output:
Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis.
Example
Or the tertiary line, by specifying index number two:
print(mylines[ii])
Output:
Quisque at dignissim lacus.
But if we try to access an alphabetize for which at that place is no value, we get an error:
Case
print(mylines[3])
Output:
Traceback (virtually recent call last): File <filename>, line <linenum>, in <module> print(mylines[3]) IndexError: list index out of range
Instance
A listing object is an iterator, so to print every element of the list, we tin iterate over it with for...in:
mylines = [] # Declare an empty listing with open up ('lorem.txt', 'rt') as myfile: # Open lorem.txt for reading text. for line in myfile: # For each line of text, mylines.append(line) # add that line to the list. for element in mylines: # For each element in the list, print(chemical element) # impress information technology.
Output:
Lorem ipsum dolor sit down amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.
Only nosotros're still getting extra newlines. Each line of our text file ends in a newline character ('\north'), which is being printed. Also, after printing each line, impress() adds a newline of its ain, unless you tell it to do otherwise.
We tin change this default behavior by specifying an end parameter in our print() call:
impress(element, end='')
By setting end to an empty string (two single quotes, with no space), we tell print() to impress nothing at the end of a line, instead of a newline grapheme.
Example
Our revised program looks like this:
mylines = [] # Declare an empty list with open ('lorem.txt', 'rt') as myfile: # Open file lorem.txt for line in myfile: # For each line of text, mylines.append(line) # add that line to the list. for element in mylines: # For each element in the listing, impress(element, cease='') # impress it without extra newlines.
Output:
Lorem ipsum dolor sit down amet, consectetur adipiscing elit. Nunc fringilla arcu congue metus aliquam mollis. Mauris nec maximus purus. Maecenas sit amet pretium tellus. Quisque at dignissim lacus.
The newlines you run into here are really in the file; they're a special character ('\n') at the end of each line. We want to get rid of these, so we don't have to worry virtually them while we process the file.
How to strip newlines
To remove the newlines completely, we tin strip them. To strip a string is to remove 1 or more than characters, usually whitespace, from either the get-go or cease of the string.
Tip
This process is sometimes too called "trimming."
Python 3 string objects take a method chosen rstrip(), which strips characters from the correct side of a string. The English language language reads left-to-right, so stripping from the correct side removes characters from the end.
If the variable is named mystring, we can strip its right side with mystring.rstrip(chars), where chars is a string of characters to strip. For example, "123abc".rstrip("bc") returns 123a.
Tip
When you represent a string in your program with its literal contents, it's chosen a cord literal. In Python (as in most programming languages), cord literals are always quoted — enclosed on either side by single (') or double (") quotes. In Python, single and double quotes are equivalent; you can apply 1 or the other, as long as they match on both ends of the string. It's traditional to correspond a human-readable string (such every bit Hello) in double-quotes ("Hello"). If you lot're representing a unmarried character (such as b), or a single special character such as the newline character (\n), it'due south traditional to use single quotes ('b', '\northward'). For more information about how to utilise strings in Python, yous tin read the documentation of strings in Python.
The statement string.rstrip('\northward') will strip a newline character from the right side of string. The following version of our program strips the newlines when each line is read from the text file:
mylines = [] # Declare an empty list. with open ('lorem.txt', 'rt') as myfile: # Open lorem.txt for reading text. for myline in myfile: # For each line in the file, mylines.append(myline.rstrip('\northward')) # strip newline and add together to list. for element in mylines: # For each element in the list, impress(chemical element) # impress it.
The text is now stored in a listing variable, so individual lines can be accessed past index number. Newlines were stripped, then we don't have to worry about them. We tin always put them back later on if we reconstruct the file and write it to disk.
Now, let'southward search the lines in the list for a specific substring.
Searching text for a substring
Let's say we want to locate every occurrence of a sure phrase, or even a single letter. For instance, possibly we demand to know where every "eastward" is. We can accomplish this using the string's find() method.
The list stores each line of our text equally a string object. All string objects have a method, find(), which locates the get-go occurrence of a substrings in the string.
Permit's use the find() method to search for the letter "e" in the kickoff line of our text file, which is stored in the list mylines. The first element of mylines is a string object containing the first line of the text file. This string object has a detect() method.
In the parentheses of find(), we specify parameters. The first and merely required parameter is the string to search for, "e". The statement mylines[0].find("eastward") tells the interpreter to search forward, starting at the kickoff of the string, 1 grapheme at a time, until it finds the letter of the alphabet "e." When it finds one, information technology stops searching, and returns the alphabetize number where that "eastward" is located. If information technology reaches the end of the string, it returns -1 to indicate zip was found.
Example
impress(mylines[0].discover("due east"))
Output:
3
The render value "3" tells us that the alphabetic character "eastward" is the quaternary graphic symbol, the "eastward" in "Lorem". (Remember, the index is zip-based: index 0 is the first graphic symbol, 1 is the second, etc.)
The discover() method takes two optional, additional parameters: a start alphabetize and a terminate index, indicating where in the cord the search should begin and end. For instance, string.find("abc", ten, 20) searches for the substring "abc", but only from the 11th to the 21st grapheme. If cease is not specified, notice() starts at index start, and stops at the terminate of the string.
Example
For case, the post-obit argument searchs for "e" in mylines[0], beginning at the 5th character.
print(mylines[0].find("e", 4))
Output:
24
In other words, starting at the fifth character in line[0], the beginning "e" is located at index 24 (the "e" in "nec").
Example
To start searching at index 10, and stop at index 30:
impress(mylines[i].find("e", ten, 30))
Output:
28
(The first "eastward" in "Maecenas").
If find() doesn't locate the substring in the search range, it returns the number -ane, indicating failure:
print(mylines[0].find("east", 25, 30))
Output:
-ane
In that location were no "e" occurrences between indices 25 and 30.
Finding all occurrences of a substring
Simply what if nosotros want to locate every occurrence of a substring, not but the start one nosotros meet? Nosotros can iterate over the string, starting from the alphabetize of the previous lucifer.
In this example, we'll use a while loop to repeatedly find the alphabetic character "e". When an occurrence is found, nosotros call observe again, starting from a new location in the string. Specifically, the location of the terminal occurrence, plus the length of the string (and so we can move forward past the last ane). When find returns -i, or the outset alphabetize exceeds the length of the string, nosotros stop.
# Build array of lines from file, strip newlines mylines = [] # Declare an empty list. with open ('lorem.txt', 'rt') as myfile: # Open up lorem.txt for reading text. for myline in myfile: # For each line in the file, mylines.append(myline.rstrip('\n')) # strip newline and add to list. # Locate and impress all occurences of letter "e" substr = "due east" # substring to search for. for line in mylines: # string to be searched index = 0 # electric current index: character beingness compared prev = 0 # previous index: last character compared while index < len(line): # While index has not exceeded string length, index = line.discover(substr, index) # ready alphabetize to beginning occurrence of "e" if alphabetize == -1: # If naught was plant, break # exit the while loop. impress(" " * (index - prev) + "due east", end='') # print spaces from previous # match, then the substring. prev = index + len(substr) # think this position for next loop. index += len(substr) # increment the index by the length of substr. # (Repeat until alphabetize > line length) print('\north' + line); # Print the original string under the eastward'due south
Output:
due east e e e eastward Lorem ipsum dolor sit down amet, consectetur adipiscing elit. due east e Nunc fringilla arcu congue metus aliquam mollis. e e due east east eastward e Mauris nec maximus purus. Maecenas sit amet pretium tellus. e Quisque at dignissim lacus.
Incorporating regular expressions
For complex searches, utilise regular expressions.
The Python regular expressions module is called re. To utilize it in your program, import the module before you use it:
import re
The re module implements regular expressions by compiling a search pattern into a pattern object. Methods of this object tin and then be used to perform match operations.
For example, allow's say you desire to search for any give-and-take in your document which starts with the letter d and ends in the letter r. Nosotros can reach this using the regular expression "\bd\w*r\b". What does this mean?
character sequence | pregnant |
---|---|
\b | A word boundary matches an empty string (anything, including nothing at all), but just if information technology appears before or after a non-word grapheme. "Word characters" are the digits 0 through 9, the lowercase and uppercase letters, or an underscore ("_"). |
d | Lowercase letter d. |
\w* | \west represents any word grapheme, and * is a quantifier meaning "zero or more of the previous character." And so \w* volition match zero or more word characters. |
r | Lowercase letter r. |
\b | Word boundary. |
So this regular expression volition match any cord that can exist described as "a word purlieus, then a lowercase 'd', and so zip or more than word characters, and so a lowercase 'r', then a word purlieus." Strings described this mode include the words destroyer, dour, and physician, and the abbreviation dr.
To use this regular expression in Python search operations, nosotros commencement compile information technology into a design object. For instance, the following Python statement creates a pattern object named design which nosotros can use to perform searches using that regular expression.
design = re.compile(r"\bd\westward*r\b")
Annotation
The letter r earlier our string in the statement above is important. It tells Python to interpret our string every bit a raw string, exactly as nosotros've typed it. If we didn't prefix the string with an r, Python would interpret the escape sequences such every bit \b in other ways. Whenever you demand Python to interpret your strings literally, specify it as a raw string by prefixing it with r.
Now we can use the pattern object's methods, such as search(), to search a string for the compiled regular expression, looking for a match. If it finds one, it returns a special outcome chosen a friction match object. Otherwise, it returns None, a congenital-in Python constant that is used similar the boolean value "false".
import re str = "Skillful forenoon, doc." pat = re.compile(r"\bd\w*r\b") # compile regex "\bd\west*r\b" to a pattern object if pat.search(str) != None: # Search for the pattern. If found, print("Found information technology.")
Output:
Found it.
To perform a case-insensitive search, you can specify the special constant re.IGNORECASE in the compile stride:
import re str = "Hello, DoctoR." pat = re.compile(r"\bd\w*r\b", re.IGNORECASE) # upper and lowercase volition lucifer if pat.search(str) != None: print("Constitute it.")
Output:
Found it.
Putting it all together
So at present we know how to open up a file, read the lines into a list, and locate a substring in any given listing element. Permit's apply this knowledge to build some instance programs.
Print all lines containing substring
The program beneath reads a log file line by line. If the line contains the discussion "mistake," it is added to a list called errors. If not, it is ignored. The lower() string method converts all strings to lowercase for comparison purposes, making the search example-insensitive without altering the original strings.
Note that the observe() method is called directly on the result of the lower() method; this is called method chaining. Also, note that in the print() statement, we construct an output string by joining several strings with the + operator.
errors = [] # The listing where we will shop results. linenum = 0 substr = "fault".lower() # Substring to search for. with open up ('logfile.txt', 'rt') every bit myfile: for line in myfile: linenum += i if line.lower().observe(substr) != -i: # if case-insensitive match, errors.append("Line " + str(linenum) + ": " + line.rstrip('\due north')) for err in errors: print(err)
Input (stored in logfile.txt):
This is line ane This is line 2 Line 3 has an fault! This is line 4 Line 5 also has an fault!
Output:
Line 3: Line 3 has an error! Line 5: Line v also has an error!
Extract all lines containing substring, using regex
The program beneath is similar to the above plan, but using the re regular expressions module. The errors and line numbers are stored as tuples, e.yard., (linenum, line). The tuple is created by the additional enclosing parentheses in the errors.append() statement. The elements of the tuple are referenced similar to a list, with a cipher-based index in brackets. As constructed here, err[0] is a linenum and err[1] is the associated line containing an fault.
import re errors = [] linenum = 0 design = re.compile("error", re.IGNORECASE) # Compile a instance-insensitive regex with open ('logfile.txt', 'rt') as myfile: for line in myfile: linenum += one if blueprint.search(line) != None: # If a match is found errors.append((linenum, line.rstrip('\northward'))) for err in errors: # Iterate over the list of tuples print("Line " + str(err[0]) + ": " + err[1])
Output:
Line 6: Mar 28 09:x:37 Fault: cannot contact server. Connectedness refused. Line x: Mar 28 x:28:15 Kernel mistake: The specified location is not mounted. Line fourteen: Mar 28 11:06:xxx ERROR: usb 1-i: can't fix config, exiting.
Extract all lines containing a phone number
The program below prints whatever line of a text file, info.txt, which contains a United states of america or international phone number. It accomplishes this with the regular expression "(\+\d{1,2})?[\s.-]?\d{three}[\due south.-]?\d{iv}". This regex matches the following phone number notations:
- 123-456-7890
- (123) 456-7890
- 123 456 7890
- 123.456.7890
- +91 (123) 456-7890
import re errors = [] linenum = 0 pattern = re.compile(r"(\+\d{1,2})?[\south.-]?\d{iii}[\s.-]?\d{4}") with open up ('info.txt', 'rt') equally myfile: for line in myfile: linenum += 1 if design.search(line) != None: # If pattern search finds a match, errors.suspend((linenum, line.rstrip('\due north'))) for err in errors: impress("Line ", str(err[0]), ": " + err[1])
Output:
Line 3 : My phone number is 731.215.8881. Line vii : Y'all can reach Mr. Walters at (212) 558-3131. Line 12 : His amanuensis, Mrs. Kennedy, tin be reached at +12 (123) 456-7890 Line fourteen : She tin also be contacted at (888) 312.8403, extension 12.
Search a dictionary for words
The program below searches the dictionary for any words that start with h and terminate in pe. For input, it uses a lexicon file included on many Unix systems, /usr/share/dict/words.
import re filename = "/usr/share/dict/words" pattern = re.compile(r"\bh\w*pe$", re.IGNORECASE) with open(filename, "rt") as myfile: for line in myfile: if design.search(line) != None: print(line, end='')
Output:
Hope heliotrope hope hornpipe horoscope hype
Source: https://www.computerhope.com/issues/ch001721.htm
Post a Comment for "Reading a Line in a File Python"