Read in Last Sheet in Excel File Pandas

março 03, 2022 Postar um comentário

article header image

Introduction

With pandas information technology is piece of cake to read Excel files and convert the data into a DataFrame. Unfortunately Excel files in the real world are ofttimes poorly constructed. In those cases where the data is scattered beyond the worksheet, you lot may need to customize the manner you read the data. This article volition discuss how to use pandas and openpyxl to read these types of Excel files and cleanly convert the information to a DataFrame suitable for further analysis.

The Problem

The pandas read_excel office does an first-class task of reading Excel worksheets. However, in cases where the data is not a continuous table starting at cell A1, the results may non be what you expect.

If y'all endeavour to read in this sample spreadsheet using read_excel(src_file) :

Excel

You lot will get something that looks similar this:

Excel

These results include a lot of Unnamed columns, header labels within a row as well as several extra columns we don't demand.

Pandas Solutions

The simplest solution for this information set is to use the header and usecols arguments to read_excel() . The usecols parameter, in detail, can exist very useful for controlling the columns y'all would like to include.

If you lot would like to follow along with these examples, the file is on github.

Here is one alternative approach to read only the information we demand.

                            import              pandas              every bit              pd              from              pathlib              import              Path              src_file              =              Path              .              cwd              ()              /              'shipping_tables.xlsx'              df              =              pd              .              read_excel              (              src_file              ,              header              =              1              ,              usecols              =              'B:F'              )

The resulting DataFrame merely contains the information we need. In this example, we purposely exclude the notes column and date field:

Clean DataFrame

The logic is relatively straightforward. usecols can take Excel ranges such as B:F and read in only those columns. The header parameter expects a single integer that defines the header column. This value is 0-indexed so we pass in 1 even though this is row ii in Excel.

In some example, we may want to ascertain the columns as a list of numbers. In this case, we could define the listing of integers:

                            df              =              pd              .              read_excel              (              src_file              ,              header              =              1              ,              usecols              =              [              i              ,              2              ,              3              ,              4              ,              5              ])

This approach might exist useful if y'all have some sort of numerical design you lot want to follow for a large data fix (i.e. every 3rd column or but fifty-fifty numbered columns).

The pandas usecols can also take a list of cavalcade names. This code will create an equivalent DataFrame:

                            df              =              pd              .              read_excel              (              src_file              ,              header              =              one              ,              usecols              =              [              'item_type'              ,              'club id'              ,              'gild date'              ,              'country'              ,              'priority'              ])

Using a list of named columns is going to be helpful if the column order changes just you lot know the names will not change.

Finally, usecols can take a callable role. Here'southward a simple long-form case that excludes unnamed columns as well as the priority column.

                            # Ascertain a more complex office:              def              column_check              (              x              ):              if              'unnamed'              in              10              .              lower              ():              return              Imitation              if              'priority'              in              10              .              lower              ():              render              Fake              if              'club'              in              x              .              lower              ():              return              True              render              Truthful              df              =              pd              .              read_excel              (              src_file              ,              header              =              1              ,              usecols              =              column_check              )

The key concept to keep in listen is that the role will parse each column past name and must return a True or False for each column. Those columns that get evaluated to True will be included.

Another approach to using a callable is to include a lambda expression. Here is an example where we desire to include only a defined listing of columns. Nosotros normalize the names past converting them to lower case for comparison purposes.

                            cols_to_use              =              [              'item_type'              ,              'gild id'              ,              'order date'              ,              'state'              ,              'priority'              ]              df              =              pd              .              read_excel              (              src_file              ,              header              =              one              ,              usecols              =              lambda              10              :              x              .              lower              ()              in              cols_to_use              )

Callable functions requite us a lot of flexibility for dealing with the existent earth messiness of Excel files.

Ranges and Tables

In some cases, the information could be even more obfuscated in Excel. In this instance, we accept a table called ship_cost that we want to read. If you must work with a file like this, it might exist challenging to read in with the pandas options we have discussed then far.

Excel table

In this case, we can utilize openpyxl directly to parse the file and catechumen the information into a pandas DataFrame. The fact that the data is in an Excel tabular array can brand this process a lilliputian easier.

Here'due south how to employ openpyxl (once it is installed) to read the Excel file:

                            from              openpyxl              import              load_workbook              import              pandas              equally              pd              from              pathlib              import              Path              src_file              =              src_file              =              Path              .              cwd              ()              /              'shipping_tables.xlsx'              wb              =              load_workbook              (              filename              =              src_file              )

This loads the whole workbook. If we want to see all the sheets:

['sales', 'shipping_rates']

To access the specific canvas:

                            sheet              =              wb              [              'shipping_rates'              ]

To see a list of all the named tables:

dict_keys(['ship_cost'])

This key corresponds to the name we assigned in Excel to the table. At present we access the table to get the equivalent Excel range:

                            lookup_table              =              canvass              .              tables              [              'ship_cost'              ]              lookup_table              .              ref

'C8:E16'

This worked. We now know the range of data we desire to load. The concluding step is to catechumen that range to a pandas DataFrame. Here is a short code snippet to loop through each row and convert to a DataFrame:

                            # Access the data in the table range              data              =              canvass              [              lookup_table              .              ref              ]              rows_list              =              []              # Loop through each row and get the values in the cells              for              row              in              information              :              # Get a list of all columns in each row              cols              =              []              for              col              in              row              :              cols              .              append              (              col              .              value              )              rows_list              .              append              (              cols              )              # Create a pandas dataframe from the rows_list.              # The kickoff row is the column names              df              =              pd              .              DataFrame              (              data              =              rows_list              [              1              :],              index              =              None              ,              columns              =              rows_list              [              0              ])

Here is the resulting DataFrame:

Excel shipping table

Now we have the clean table and can utilise for further calculations.

Summary

In an platonic world, the data we use would be in a unproblematic consequent format. Meet this paper for a overnice discussion of what good spreadsheet practices look similar.

In the examples in this article, you could easily delete rows and columns to make this more well-formatted. All the same, at that place are times where this is not feasible or advisable. The good news is that pandas and openpyxl give united states all the tools nosotros need to read Excel information - no matter how crazy the spreadsheet gets.

Changes

21-Oct-2020: Clarified that we don't desire to include the notes column

schulertrook1972.blogspot.com

Source: https://pbpython.com/pandas-excel-range.html

Schuler Trook1972