It is not unusual to come across encoding problems when opening files in Python 3. The subject matter is a large topic of discussion, and here I am providing some quick ways to deal with a typical encoding issue you are likely to encounter.
Say you are interested in opening a CSV file to be loaded into a pandas dataframe. If the stars align and the generator of your CSV is magnanimous, they may have saved the file using UTF-8. If so you may get away with reading the file (here called my file.csv
) as follows
import python as pd df = pd.read_csv('myfile.csv')
You should in principle pass a parameter to pandas telling it what encoding the file has been saved with, so a more complete version of the snippet above would be:
import python as pd df = pd.read_csv('myfile.csv', encoding='utf-8')
Encoding conundrum
What happens when you don’t know what encoding was used to save the file? Well, you can ask, but it is very unlikely that the file generator know… What to do? Well there are some libraries that can be helpful.
Install the chardet
module as follows from the terminal
pip install chardet
And use the following snippet as a guide:
import chardet import pandas as pd def find_encoding(fname): r_file = open(fname, 'rb').read() result = chardet.detect(r_file) charenc = result['encoding'] return charenc my_encoding = find_encoding('myfile.csv') df = pd.read_csv('myfile.csv', encoding=my_encoding)
Et voilà!
Pingback: File Encoding with the Command Line - Determining and Converting - Quantum Tunnel
Comments are closed.