Read in Last Sheet in Excel File Pandas
Introduction
With pandas information technology is piece of cake to read Excel files and convert the data into a DataFrame. Unfortunately Excel files in the real world are ofttimes poorly constructed. In those cases where the data is scattered beyond the worksheet, you lot may need to customize the manner you read the data. This article volition discuss how to use pandas and openpyxl to read these types of Excel files and cleanly convert the information to a DataFrame suitable for further analysis.
The Problem
The pandas read_excel
office does an first-class task of reading Excel worksheets. However, in cases where the data is not a continuous table starting at cell A1, the results may non be what you expect.
If y'all endeavour to read in this sample spreadsheet using read_excel(src_file)
:
You lot will get something that looks similar this:
These results include a lot of Unnamed
columns, header labels within a row as well as several extra columns we don't demand.
Pandas Solutions
The simplest solution for this information set is to use the header
and usecols
arguments to read_excel()
. The usecols
parameter, in detail, can exist very useful for controlling the columns y'all would like to include.
If you lot would like to follow along with these examples, the file is on github.
Here is one alternative approach to read only the information we demand.
import pandas every bit pd from pathlib import Path src_file = Path . cwd () / 'shipping_tables.xlsx' df = pd . read_excel ( src_file , header = 1 , usecols = 'B:F' )
The resulting DataFrame merely contains the information we need. In this example, we purposely exclude the notes column and date field:
The logic is relatively straightforward. usecols
can take Excel ranges such as B:F
and read in only those columns. The header
parameter expects a single integer that defines the header column. This value is 0-indexed so we pass in 1
even though this is row ii in Excel.
In some example, we may want to ascertain the columns as a list of numbers. In this case, we could define the listing of integers:
df = pd . read_excel ( src_file , header = 1 , usecols = [ i , 2 , 3 , 4 , 5 ])
This approach might exist useful if y'all have some sort of numerical design you lot want to follow for a large data fix (i.e. every 3rd column or but fifty-fifty numbered columns).
The pandas usecols
can also take a list of cavalcade names. This code will create an equivalent DataFrame:
df = pd . read_excel ( src_file , header = one , usecols = [ 'item_type' , 'club id' , 'gild date' , 'country' , 'priority' ])
Using a list of named columns is going to be helpful if the column order changes just you lot know the names will not change.
Finally, usecols
can take a callable role. Here'southward a simple long-form case that excludes unnamed columns as well as the priority column.
# Ascertain a more complex office: def column_check ( x ): if 'unnamed' in 10 . lower (): return Imitation if 'priority' in 10 . lower (): render Fake if 'club' in x . lower (): return True render Truthful df = pd . read_excel ( src_file , header = 1 , usecols = column_check )
The key concept to keep in listen is that the role will parse each column past name and must return a True
or False
for each column. Those columns that get evaluated to True
will be included.
Another approach to using a callable is to include a lambda
expression. Here is an example where we desire to include only a defined listing of columns. Nosotros normalize the names past converting them to lower case for comparison purposes.
cols_to_use = [ 'item_type' , 'gild id' , 'order date' , 'state' , 'priority' ] df = pd . read_excel ( src_file , header = one , usecols = lambda 10 : x . lower () in cols_to_use )
Callable functions requite us a lot of flexibility for dealing with the existent earth messiness of Excel files.
Ranges and Tables
In some cases, the information could be even more obfuscated in Excel. In this instance, we accept a table called ship_cost
that we want to read. If you must work with a file like this, it might exist challenging to read in with the pandas options we have discussed then far.
In this case, we can utilize openpyxl directly to parse the file and catechumen the information into a pandas DataFrame. The fact that the data is in an Excel tabular array can brand this process a lilliputian easier.
Here'due south how to employ openpyxl (once it is installed) to read the Excel file:
from openpyxl import load_workbook import pandas equally pd from pathlib import Path src_file = src_file = Path . cwd () / 'shipping_tables.xlsx' wb = load_workbook ( filename = src_file )
This loads the whole workbook. If we want to see all the sheets:
['sales', 'shipping_rates']
To access the specific canvas:
sheet = wb [ 'shipping_rates' ]
To see a list of all the named tables:
dict_keys(['ship_cost'])
This key corresponds to the name we assigned in Excel to the table. At present we access the table to get the equivalent Excel range:
lookup_table = canvass . tables [ 'ship_cost' ] lookup_table . ref
'C8:E16'
This worked. We now know the range of data we desire to load. The concluding step is to catechumen that range to a pandas DataFrame. Here is a short code snippet to loop through each row and convert to a DataFrame:
# Access the data in the table range data = canvass [ lookup_table . ref ] rows_list = [] # Loop through each row and get the values in the cells for row in information : # Get a list of all columns in each row cols = [] for col in row : cols . append ( col . value ) rows_list . append ( cols ) # Create a pandas dataframe from the rows_list. # The kickoff row is the column names df = pd . DataFrame ( data = rows_list [ 1 :], index = None , columns = rows_list [ 0 ])
Here is the resulting DataFrame:
Now we have the clean table and can utilise for further calculations.
Summary
In an platonic world, the data we use would be in a unproblematic consequent format. Meet this paper for a overnice discussion of what good spreadsheet practices look similar.
In the examples in this article, you could easily delete rows and columns to make this more well-formatted. All the same, at that place are times where this is not feasible or advisable. The good news is that pandas and openpyxl give united states all the tools nosotros need to read Excel information - no matter how crazy the spreadsheet gets.
Changes
- 21-Oct-2020: Clarified that we don't desire to include the notes column
Source: https://pbpython.com/pandas-excel-range.html
Postar um comentário for "Read in Last Sheet in Excel File Pandas"