Python File System
File System in Python
File System in Python
In Python 2, the file system was accessed using a variety of functions built into several modules: sys
, os
, glob
, shutil
are some of the more common. The standard way to open a file was with the built-in function open(<str>)
.
a = open('myTest.txt')
print (type(a)) #<type 'file'>
Which attempted to open the file represented by the string and returns a file-object. File-objects could be read, written to and searched. However, it was not possible or convenient to rename, move, copy etc the file object. Making matters worse, path naming conventions vary by operating system, meaning Python 2 programs written for Windows might not work in OS, Android etc.
Python 3 resolves this by introducing in 2014 the Path
object in the pathlib
module. Path-objects are operating system agnostic1 and offer a host of methods to support a rich range of operations.
from pathlib import Path
a = Path('myTest.txt')
print (type(a)) #<class 'pathlib.PosixPath'> Linux (our OS) uses a posix file system
Path objects can be files, directories or place-holders. If they are files they can be opened for reading and writing. We will use Path-objects for file access in this course, Path-objects are like a supper-set of File-objects.
There are various resources the Python interpreter, or any computer language, must receive from the operating systems. The most common are memory and Input/output streams^[a file descriptor is a kind of IO stream].
The Python interpreter manages the memory your programs uses. It occasionally runs a “garbage collector” which searches for objects that have been deleted or have fallen out of scope and tells the OS to reclaim the memory. this impacts performance as the garbage collector uses up computer cycles even if there is nothing to collect.
Older languages (notably C) and high performance languages make memory management the programmer’s responsibility. Programmers allocate (malloc()
) and free up memory (free()
) as required. Well written programs of this type are smaller (lower memory footprint) and faster in execution that managed memory languages like Python.
Most programming languages, including Python, leave managing file system up to the programmer. While a program may make thousands of memory requests a second, they will normally only have a handful of files open at a time. There is basically one rule:
Once A File is Opened it MUST be Closed
It can be hard to do this if when File/Path-objects are attributes of a class, object; passed as parameters or when exceptions occur.
1 from pathlib import Path
2 some_path = Path('some_file.txt')
3 in_file = some_path.open('r')
4 for line in in_file:
5 print(int(line))
6 in_file.close()
Here we
That is fine, as long as there is no error.
But what if a line in the file were ‘cat’. Certainly an exception would get thrown, ‘cat’ cannot be turned into an int
.
The program would terminate, but the OS will not be told to free file handler. This is bad as the more file handlers that are open, the slower performance. When writing to files there are additional hazards:
To help with this Python has built in a context management function: with .. as .. :
.
with <item> as <identifier>:
with-body
1 from pathlib import Path
2 some_path = Path('some_file.txt')
3 with some_path.open('r') as in_file:
4 for line in in_file:
5 print(line)
6
Here we
with
structurewith
will ensure that the .close()
is run on in_file
whenever the context (with block) is left. It will do this even if the reason for leaving the with-block is an unhandled exception.
The with
statement runs a special enter-method when the with-block starts and exit-method when it finishes. This is based on the type (class) of the item you create. With Path-object.open()
creates a io.TextIOWrapper
, which is a kind of stream which has these special methods defined.
Writing these methods requires detailed knowledge of the operating systems, network protocols and application specific “shutdown” procedures.
Imagine the complexity of shutting down a networked database session that uses multiple network connections for speed and redundancy. In this case, __exit__()
might have to properly sequence database, network and parallel thread “shutdown” activities to ensure no corrupt data gets to the database and the network connections can be re-established.
You should always wrap your file reads and writes in a with
-statement.
When Python is installed, it sets up the pathlib
module for the installed OS– allowing the “generic” path constructs to access the file system ↩︎
As you should recall, a Linux (Posix) path can be represented by a string, starting from the root (/
).
The terminal always starts in /home/codio/workspace
and we have typically worked in /home/codio/workspace/python
. When in the terminal, the present working directory (pwd) is the directory “level” you are currently at.
When you invoke the Python interpreter with the command python3
, Python sets the current working directory (cwd) for the program to Linux’s pwd.
/home/codio/workspace$ python3 demo/demo.py
# cwd in program is set to /home/codio/workspace
/home/codio/workspace/demo$ python3 demo.py
# cwd in program is set to /home/codio/workspace/demo
To create a path object in Python, we use code similar to this
from pathlib import Path
p_object = Path("/home/codio/workspace/demo/memo.txt")
In the code above, we can simply replace the string "/home/codio/workspace/demo/memo.txt"
with any string to create the indicated path object. It will even accept relative paths, so p_object("memo.txt")
would create a path <cwd>/memo.txt
.
Path objects have access to many class attributes and methods. Some useful ones are:
Creating a Path-object does not interact with the file system. So a reasonable practice when trying to use or access file is to create the path then check if it exists. In this manner one can safely create new directories and files. Particularly with files, programming languages assume the programmer knows what they are doing. Most methods to write to a file will overwrite and destroy any existing file with the same name.
By first checking to see if the path exists, it is possible to choose a new, unused file name so as to not overwrite existing data.
This process of checking if a path exists and is a file before attempting to treat it as such is a type of “look before you leap” programming.
In the file demo/demo.py
you can see the use of some common Path features.
p_memo = Path("memo.txt")
try:
print (p_memo.name)
print (p_memo.absolute())
print (p_memo.exists())
except IOError as e:
print(e)
It creates a Path-object for /home/codio/workspace/demo/memo.txt
and prints out its name, absolute path and whether or not it exists.
Note the exception handling. While not all features of the Path-object interact with the file system, Python documentation does not state which attributes and methods can throw which exceptions^[Several languages, like JAVA, require module authors to list, by method or class, which exceptions can be raised. This simplifies the programmer’s task.]. Here if there is any IO problem with accessing the object’s features, we simply catch and print out error.
In the below examples if is assumed that from pathlib import Path
has been used.
We can also create either a file or directory based on a Path:
pathObject = Path("test")
# Create a directory
pathObject.mkdir()
# Create a file (by opening it)
pathObject.touch()
The mkdir()
method will raise a FileExistsError if something already exists at that path. To create a file, .touch()
creates an empty file. It does nothing if the file exists.
There are also methods we can use to copy or move an item from one path to another path:
destination = pathlib.Path("/home/codio/workspace/file2.txt")
# Move (rename) a file
pathObject.rename(destination)
# Copy a file
# We need to add `import shutil` to the top of the script
shutil.copy2(str(pathObject), str(destination))
These work very similarly to the cp
and mv
commands we’ve already seen on the Linux terminal. To move a file, we can simply rename it. We’re even allowed to use the rename()
method to change the path to the file as well.
To copy a file, we use the copy2()
method from the shutil
library. In Python 3.4, which is the version of Python currently supported in Codio, we need to convert both of our path objects to strings before we can use them with copy2()
. However, in more recent versions of Python, we don’t have to do that step.
We can also delete an existing file or path:
# Delete a file
pathObject.unlink()
# Delete an empty directory
pathObject.rmdir()
The unlink()
method is used to delete a single file. To remove a directory, we can use the rmdir()
method instead. However, the directory must be empty, or else the method will raise an OSError.
Lets look at how we might take input from either a file or the keyboard. Our goal is that the program uses the first command line argument for input if it is a file. If no argument is given it should use the keyboard. If the first command line argument is not a file, or if an error occurs, the program should print “Error processing the first argument as a file”. A start might be:
import sys
import io
import pathlib
...
def main(cls, args):
if len(args) > 1:
pass
# some commands to handle the file reading
# reader = ???
else:
reader = sys.stdin
According to the Python documentation , Path
’s .open('r')
opens the Path as a file for read-access1. Our strategy will be to read the entire file into memory as a string, then convert that string into an ioString Stream so it has the same method calls as the keyboard stream.
if len(args) > 1:
# make a Path-object
in_path = Path(args[1])
# make a File-object from the Path object
try:
if in_path.is_file():
with in_path.open('r') as in_file:
# store all the text in the file as a string
data = "".join(in_file.readlines())
# convert 'data' to a stream of text
reader = io.StringIO(data)
else:
raise IOError()
except IOError as e:
print("Path is not a file")
return
else:
reader = sys.stdin
data = "".join(input_file.readlines())
reads all the lines from the file.input_file.readlines()
creates a list of str
to represent the data, and data = "".join(..)
is a way to make them all one str
.str
is then used to create an object of type oi.StringIO–this object is in the same “family” as sys.stdin
and they share most of the same method names.You can look up readlines() and str.join() if you want more information.
Running this code snippet has one of three outcomes:
reader
is a string-stream attached to the data that was in the input file (the file is closed by exiting the with
statement)reader
is a string-stream attached standard inputOf course, as we learned in an earlier chapter, we should also add some Try-Except statements to this code to prevent any exceptions from crashing the program. So, let’s do that now:
if len(args) > 1:
# make a Path-object
in_path = Path(args[1])
# make a File-object from the Path object
try:
if in_path.is_file():
with in_path.open('r') as in_file:
# store all the text in the file as a string
data = "".join(in_file.readlines())
# convert 'data' to a stream of text
reader = io.StringIO(data)
else:
print("Error processing the first argument as a file")
return
except Exception as e:
print("Error processing the first argument as a file")
return
else:
reader = sys.stdin
We wrap all the Path and File commands in a single try-catch block.
If there is a command line argument, the code uses the Path method is_file()
to decide if it is a file and thus if it should try and open in for reading in the with statement. Note we print an error message and exit main() using a return
statement if the path is not a file.
Te number of exceptions are which could be thrown are vast, esoteric and potentially undocumented2–since we know the file exists, we will never see a FileNotFoundError. But we could still lose connection with a network drive, or if the file is huge, run out of memory trying to create the StringIO object. So lets catch the generic Exception and exit the program.
An unfortunate compromise: in order to have the exact same code be able to handle the terminal or a file input and accept a variable number inputs, there has to be a way for a program to determine it is done with getting data. We have chosen that a blank line represents this sentinel value. This means we cannot write programs that accept blank lines as valid input.
Fortunately, it is vary rare to need a program that handles the input from the terminal and a file exactly the same. In well designed object oriented code, different classes, thus different code will generally be used to process input from different sources.
Data read via the readline()
or readlines()
methods includes the operating system specific line terminator–a fancy way of saying the \n
character in for Linux.
In this course, it is best to use the string strip()
method to eliminate leading and trailing whitespace characters. Again, we want you focused on the logic and syntax of programming, not hunting for missing “spaces”.
So to read and print every line you would
for line in reader:
if line is None or len(line.strip()) == 0:
break
...
This content is presented in the course directly through Codio. Any references to interactive portions are only relevant for that interface. This content is included here as reference only.
Beyond just reading data from files, we can also create our own files and write data directly to them.
In Python, we’ll use the .open()
method we’ve been using to read from a file to also open a file for writing. So, here’s an example of what that might look like:
import sys
from pathlib import Path
import io
class Writer:
@classmethod
def main(cls, args):
if len(args) > 1:
path = Path(args[1])
try:
with path.open("w") as writer:
pass # do some writing
except IOError as e:
# unable to write to the file
print("IOError: {}".format(e))
except Exception as e:
# unable to write to the file
print("Exception: {}".format(e))
Let’s break this code down into smaller parts so we can understand how it works.
First, we are using a Try-Except statement to handle any exceptions that are raised when we use a file. Next we check to see if a command line argument was provided and if so we make a Path object using it.
Then, we use a With statement to handle automatically closing our file once we are done with it. Otherwise, we’d need to add a finally
block that includes writer.close()
to make sure the file is closed properly. If we don’t do that step, there is a chance that our data may not get written to the file correctly.
We have the With statement :
with path.open("w") as writer:
This line creates a io object using Path’s open()
method to open a file. We provide the argument: “w”, which stands for “write” and allows us to open the file for writing. If we don’t provide that argument, it will just open the file for reading. Then, we use the as
keyword to assign it to the variable writer
. This is very similar to how we’ve already seen the as
keyword.
It is important to note that, by default, if the file we are writing to already exists, it will be overwritten with the new output. If it doesn’t exist, it will be created. There are ways to open a file and append new data to it without overwriting the file, which we’ll discuss below.
Inside of the With statement, we see two lines that write data to the file using the write()
method. That method can be used to write any string to the file. So, we can use this just like we would print()
when writing output to the terminal. We can even use formatted strings!
The second line prints a newline character to the output file. This is because the write()
method does not output a newline character by default each time it is used. This is different than print()
, which always outputs a newline after each output unless we specify otherwise. So, we need to use \n
each time we want to print a newline to the file.
Finally, there are several exception handlers at the end of the Try-Except statement. They handle the most common exceptions that can occur when opening and writing to a file, but may not account for all possible exceptions. So, we’ve also included a block to catch the generic Exception
as well.
When opening a file, we can also give a set of options, known as modes in Python, to specify how we’d like to handle the file when it is opened. By default, when we use the open()
method to open a file, it uses the following option:
r
- open for readingIf we’d like to change those options, we can specify them when opening the file. For example, if we’d like to append to an existing file, we can use the following code in our With statement to open the file:
with path.open("a") as writer:
There are many other modes available in Python.
When writing data to a file using a program, it is important to understand how the underlying operating system handles that process. In many cases, the operating system will store, or buffer the output in memory, then write the output directly to the file a bit later. This allows the operating system to tell our program that the write was successful while it waits for the storage device the file is actually stored on to respond. So, our programs appear to run very quickly.
However, at times we want to tell the operating system to write the data it has stored in memory directly to the file. To do that, we can use the flush()
method of our file object to flush the buffer, or make sure the data is written to the file. Here’s an example:
writer.write("Hello World")
writer.write("\n")
writer.flush()
writer.write("More data")
writer.close()
Thankfully, the close()
method will automatically write any buffered data to the file before closing it. So, we can either use the close()
method ourselves, or use a With statement to make sure that the file is closed automatically for us.
We can even use sys.stdout.flush()
to perform the same operation when printing output to the terminal. In most cases all of our output is printed directly to the terminal, but we can make sure that the output buffer is empty by using the sys.stdout.flush()
method anytime.
Here’s another quick example program to make sure we are able to write to files correctly. Feel free to adapt the code above to complete the exercise.
This content is presented in the course directly through Codio. Any references to interactive portions are only relevant for that interface. This content is included here as reference only.
This video uses `open(str)` instead of `pathlib` Paths. Most on line sources and examples will use the older style as `pathlib` is fairly recent. The written text will use `pathlib`'s approach.
Now that we’ve seen how to handle working with files in Python, let’s go through an example program to see how we can apply that knowledge to a real program.
Here’s a problem statement we can use:
Write a driver program in Example.py that accepts three files as command line arguments. The first two represent input files, and the third one represents the desired output file. If there aren’t three arguments provided, either input file is not an existing file, or the output file is an existing directory, print “Invalid Arguments” and exit the program. The output file may be an existing file, since it will be overwritten.
The program should open each input file and read the contents. Each input file will consist of a list of whole numbers, one per line. If there are any errors parsing the contents of either file, the program should print “Invalid Input” and exit. As the input is read, the program should keep track of both the count and sum of all even inputs and odd inputs.
Once all input is read, the program should create the output file and print the following four items, in this order, one per line: number of even inputs, sum of even inputs, number of odd inputs, sum of odd inputs.
Finally, when the program is done, it should simply print “Complete” and exit. Don’t forget to close any open files!
So, let’s break down this problem statement and see if we can build a program to perform this action.
First sketch out the control flow, where and what kind of loops you want to use, what variables you’ll create and which modules you will need.
Above is an example sketch for this problem. It does not contain all the details but does give a guide on how the program might to go. Notice how the flow chart abstracts away from the class structure – for an OOP driver program it is detail which is unnecessary to understanding the program flow.
Let’s start programming by parsing the command line arguments. First, we’ll want to make sure there are exactly three arguments. This is probably best done using an If statement, since it makes the code a bit simpler to read than if we would use a Try-Except statement. However, either approach will work.
import sys
from pathlib import Path
class Example:
@classmethod
def main(cls, args):
if len(args) != 4:
print("Invalid Arguments")
return # retruning from main exits the program
if __name___ == "__main__":
Example.main(sys.argv)
Don’t forget that sys.argv
includes the name of the program, so we’ll really need to make sure that there are 4 arguments in total.
In general, you should check for detectable errors rather than handling them with try-except blocks.
For example, the code snippet above could be skipped and an IndexOutOfBoundsError caught later when access to a missing argument is attempted. But exceptions and exception handlers are many times slower than regular conditional comparison when they are thrown.
Additionally, the logic of the IF statements is more clear–it does not require a reader to be familiar with the exceptions which your code can throw then scanning “except” blocks to see which ones are handled.
An exception to this “Look Before You Leap” approach is type checking a variable before using it; more on this in the methods chapter.
Next, we’ll need to check and make sure that each of the first two arguments is a valid file that we can open. Since we intend to open them anyway, let’s just use a Try-Except statement containing a With statement:
in_file1 = Path(args[1])
in_file2 = Path(args[2])
out_file1 = Path(args[3])
try:
if not in_file1.is_file() or not in_file2.is_file():
print("Invalid Arguments")
return
with in_file1.open('r') as scanner1, in_file2.open('r') as scanner2, out_file1.open("w") as writer:
# -=-=-=-=- MORE CODE GOES HERE -=-=-=-=-
except IOError as e:
print("Invalid Arguments")
return
# -=-=-=-=- MORE EXCEPTIONS GO HERE -=-=-=-=-
There are a few new things in this code that we haven’t seen before:
,
.sys.stdin
, we can just open our files directly in the With statement. Actually, this makes the code very straightforward.sys.argv
variable is actually a list, so we can access additional command line arguments by simply using different list indices. We haven’t done that yet, but it should make sense.Now that we’ve confirmed that we can open each file, we can start coding the program’s logic. For the rest of this example, we’ll look at a smaller portion of the code. That code can be placed where the MORE CODE GOES HERE
comment is in the skeleton above. We’ll also need to handle a few more exceptions, which can be added where the MORE EXCEPTIONS GO HERE
comment is above.
The program’s logic should be pretty straightforward. First, we’ll need to create loops to read input from each input file:
Notice that we are using two separate For loops here. Since we are dealing with two different input files that are unrelated, this is the simplest way to go.
Next, we can parse the input to an integer, and then determine if it is even or odd:
for line in scanner1:
line = line.strip()
input = int(line)
if input % 2 == 0:
# even
else:
# odd
for line in scanner2:
line = line.strip()
input = int(line)
if input % 2 == 0:
# even
else:
# odd
Notice that we are using two separate For loops here. Since we are dealing with two different input files that are unrelated, this is the simplest way to go.
Next, we can parse the input to an integer, and then determine if it is even or odd:
Finally, we can add a few state variables to keep track of how many of each type we’ve had, and their sum as well:
countEven = 0
countOdd = 0
sumEven = 0
sumOdd = 0
for line in scanner1:
line = line.strip()
input = int(line)
if input % 2 == 0:
countEven += 1
sumEven += input
else:
countOdd += 1
sumOdd += input
for line in scanner2:
line = line.strip()
input = int(line)
if input % 2 == 0:
countEven += 1
sumEven += input
else:
countOdd += 1
sumOdd += input
In the new code above, we are converting strings to integers, which could result in a ValueError. So, we’ll need to add one more except
block to the Try-Except statement in the skeleton at the top of this page:
except ValueError as e:
print("Invalid Input")
That will handle any problems with the input files themselves.
Finally, we can simply print our four variables to the output file:
countEven = 0
countOdd = 0
sumEven = 0
sumOdd = 0
for line in scanner1:
line = line.strip()
input = int(line)
if input % 2 == 0:
countEven += 1
sumEven += input
else:
countOdd += 1
sumOdd += input
for line in scanner2:
line = line.strip()
input = int(line)
if input % 2 == 0:
countEven += 1
sumEven += input
else:
countOdd += 1
sumOdd += input
writer.write("{}\n".format(countEven))
writer.write("{}\n".format(sumEven))
writer.write("{}\n".format(countOdd))
writer.write("{}\n".format(sumOdd))
print("Complete")
The four lines at the end should be pretty self-explanatory. We can simply print each number, but we’ll need to convert them to a string first. The simplest way to do that is to use the .format()
method to create a formatted string. We’ll also need to remember to print a newline between each of them using the \n
character. Of course, we also need to print Complete
once we are finished!
Once we’ve completed the code, we can use the button below to test it and see if it works. Don’t forget to open the example.pregrade.log.txt
file that it creates to see detailed feedback from your program.
This content is presented in the course directly through Codio. Any references to interactive portions are only relevant for that interface. This content is included here as reference only.