Python 3, Pandas and Encoding Issues

It is not unusual to come across encoding problems when opening files in Python 3. The subject matter is a large topic of discussion, and here I am providing some quick ways to deal with a typical encoding issue you are likely to encounter.

Say you are interested in opening a CSV file to be loaded into a pandas dataframe. If the stars align and the generator of your CSV is magnanimous, they may have saved the file using UTF-8. If so you may get away with reading the file (here called my file.csv) as follows

import python as pd

df = pd.read_csv('myfile.csv')

You should in principle pass a parameter to pandas telling it what encoding the file has been saved with, so a more complete version of the snippet above would be:

import python as pd

df = pd.read_csv('myfile.csv', encoding='utf-8')

Encoding conundrum

What happens when you don’t know what encoding was used to save the file? Well, you can ask, but it is very unlikely that the file generator know… What to do? Well there are some libraries that can be helpful.

Install the chardet module as follows from the terminal

pip install chardet

And use the following snippet as a guide:

import chardet
import pandas as pd

def find_encoding(fname):
    r_file = open(fname, 'rb').read()
    result = chardet.detect(r_file)
    charenc = result['encoding']
    return charenc


my_encoding = find_encoding('myfile.csv')
df = pd.read_csv('myfile.csv', encoding=my_encoding)

Et voilà!

Python 3, Pandas and Encoding Issues

Related

1 thought on “Python 3, Pandas and Encoding Issues”