Python Quickstart

  1. Before you start
  2. Install the Python Client
  3. Work with qri in a Jupyter Notebook

1. Before you start

Qri has a simple python client that makes it easier to work with qri datasets using the tools you already use–like pandas and Jupyter.

Since the qri python client is using qri under the hood, for it to work correctly you need to have qri setup and installed before using the python client. To install qri, you can either install the desktop app for OS X or the CLI and create your repo (for the CLI run qri setup to do this).

Next try adding a dataset. If you need an example dataset (or would like to follow along in later steps) you can download one from our repo with the following command

curl https://raw.githubusercontent.com/qri-io/qri-python/master/example_data/body.csv -o "body.csv" https://raw.githubusercontent.com/qri-io/qri-python/master/example_data/head.yaml -o "head.yaml"

And to add it, from the directory of your downloaded files run

qri new  --file head.yaml --body body.csv me/BirthdatesOfUSPresidents

(where ‘presidentBdays’ can be changed to any name you find descriptive) If this succeeds you should be good to move on to installing and using the qri python client.

2. Install the Python Client

Ensure that you are running python 3 and then install with

pip install qri

3. Work with qri in a Jupyter Notebook

The qri python client does not currently support the full array of features available in the desktop and CLI clients. The functionality it does currently support includes

  • listing datasets saved in your repo
  • loading datasets into a pandas dataframe for manipulation in python
  • saving dataset back to your repo

To demonstrate the functionality we’ll walk through loading, changing, and saving the dataset we just added in a Jupyter notebook:


first we import qri

import qri

to see what datasets we have in our repo, we use qri.list_ds

qri.list_ds()
['fivethirtyeight/weather_ksea',
 'dustmop/test3_repo',
 'osterbit/BirthdatesOfUSPresidents']

To load a dataset into memory, use qri.load_ds, passing the name of the dataset you want to load.

ds = qri.load_ds('osterbit/BirthdatesOfUSPresidents')

The python QriDataset is represented as an object with two properties: a head containing the dataset’s metadata as a python dictionary and a body containing the data as a pandas dataframe. To manipulate these objects we can just use the native methods already available to them.

ds.head
{'bodyPath': '/ipfs/QmV5kQAyeDEJKkTTk97pEAFCnEty2iRvrzErYhgsz87BZu',
 'commit': {'author': {'id': 'QmQDAHk8jx6mJ1migbC6oEij52odepBV7RHBoGoGFWUr7F'},
  'path': '/ipfs/QmUpPqbzXQMWFDXChruFH4JTJchbHkGNyaL9A6g9LWXbJa',
  #...
 'meta': {'description': 'Date and location of birth and death of US Presidents as of 2018',
  'qri': 'md:0',
  'title': 'Birthdsates of US Presidents'},
 'path': '/ipfs/QmQc5vDSpa9UfpUu9o2vpmahoVJueE5F7mdgoS1pfD37kR/dataset.json',
 'qri': 'ds:0',
 'root': 'osterbit/[email protected]GFWUr7F/ipfs/QmQc5vDSpa9UfpUu9o2vpmahoVJueE5F7mdgoS1pfD37kR',
 'structure': {'checksum': 'QmPtotmvHgy8bREmf5oQKN5EDKfpjzjUYmeunXyqV9UVHR',
  #...
  'schema': {'items': {'items': [{'title': 'president', 'type': 'string'},
     {'title': 'birth_date', 'type': 'string'},
     {'title': 'birth_place', 'type': 'string'},
     {'title': 'death_date', 'type': 'string'},
     {'title': 'location_of_death', 'type': 'string'}],
    'type': 'array'},
   'type': 'array'}}}

You’ll notice in the body below, that the field ‘birth_date’ is inconsistently formatted. In some entries the date is abbreviated while in others it is written out, and on the later entries the date is given with the day before the month:

ds.body
birth_date birth_place death_date location_of_death president
0 Feb 22, 1732 Westmoreland Co., Va. Dec 14, 1799 Mount Vernon, Va. George Washington
1 Oct 30, 1735 Quincy, Mass. July 4, 1826 Quincy, Mass. John Adams
2 Apr 13, 1743 Albemarle Co., Va. July 4, 1826 Albemarle Co., Va. Thomas Jefferson
3 Mar 16, 1751 Port Conway, Va. June 28, 1836 Orange Co., Va. James Madison
4 Apr 28, 1758 Westmoreland Co., Va. July 4, 1831 New York, New York James Monroe
5 July 11, 1767 Quincy, Mass. Feb 23, 1848 Washington, D.C. John Quincy Adams
... ... ... ... ... ...
33 29-May-17 Brookline, Mass. 22-Nov-63 Dallas, Texas John F. Kennedy
34 27-Aug-08 Gillespie Co., Texas 22-Jan-73 Gillespie Co., Texas Lyndon B. Johnson
35 9-Jan-13 Yorba Linda, Cal. 22-Apr-94 New York, New York Richard Nixon
36 14-Jul-13 Omaha, Nebraska 26-Dec-06 Rancho Mirage, Cal. Gerald Ford
37 1-Oct-24 Plains, Georgia Jimmy Carter
38 6-Feb-11 Tampico, Illinois 5-Jun-04 Los Angeles, Cal. Ronald Reagan
39 12-Jun-24 Milton, Mass. George Bush
40 19-Aug-46 Hope, Arkansas Bill Clinton
41 6-Jul-46 New Haven, Conn. George W. Bush
42 4-Aug-61 Honolulu, Hawaii Barack Obama
43 14-Jun-46 New York, New York Donald Trump

Fixing the data

To fix the inconstent date formatting, we’ll write a function to parse the dates. It looks like there are 3 different formats so we’ll want to handle each of these (for more info on dates and strings in python strftime.org has a good cheat sheet).

import datetime
def parse_date(date_string):
    for date_fmt in ('%b %d, %Y', '%B %d, %Y', '%d-%b-%y'):
        try:
            return datetime.datetime.strptime(date_string, date_fmt)
        except ValueError:
            pass
    raise ValueError('unable to parse dates with given formats')

If we apply this function alone we’ll have an issue where two digit years are assumed to be prefixed with ‘20’- rather than ‘19’-. So we’ll use the fact that a president must be at least 35 years old to fix this and then combine the two functions and convert back to a string:

def fix_date(date, max_year, adjustment=100):
    if date.year > max_year:
        # if the year is gt than our max date it is making an error on the century (1900s vs 2000s)
        corrected_date = datetime.datetime(year=date.year-adjustment, month=date.month, day=date.day)
        return corrected_date
    else:
        return date

# (combining the above)
def make_dates_consistent(date_string):
    # parse the date into a datetime object
    date_obj = parse_date(date_string)
    # since the minimum age to be president is 35
    # we'll set the max year to reflect that
    max_possible_dob_year = datetime.datetime.today().year - 35
    dob = fix_date(date_obj, max_possible_dob_year)
    # since csv's don't know about datetime objects we'll convert back to a string
    # since it's easy to do we'll also include a format that also gives the day of the week
    fmt = "%a %b %d, %Y"
    return dob.strftime(fmt)

Now we can apply this function to the field as we would with a regular dataframe:

ds.body['birth_date'] = ds.body['birth_date'].apply(lambda d: make_dates_consistent(d))

Next we check the output to make sure it worked:

ds.body
birth_date birth_place death_date location_of_death president
0 Fri Feb 22, 1732 Westmoreland Co., Va. Dec 14, 1799 Mount Vernon, Va. George Washington
1 Sun Oct 30, 1735 Quincy, Mass. July 4, 1826 Quincy, Mass. John Adams
2 Sat Apr 13, 1743 Albemarle Co., Va. July 4, 1826 Albemarle Co., Va. Thomas Jefferson
3 Tue Mar 16, 1751 Port Conway, Va. June 28, 1836 Orange Co., Va. James Madison
4 Fri Apr 28, 1758 Westmoreland Co., Va. July 4, 1831 New York, New York James Monroe
5 Sat Jul 11, 1767 Quincy, Mass. Feb 23, 1848 Washington, D.C. John Quincy Adams
... ... ... ... ... ...
33 Tue May 29, 1917 Brookline, Mass. 22-Nov-63 Dallas, Texas John F. Kennedy
34 Thu Aug 27, 1908 Gillespie Co., Texas 22-Jan-73 Gillespie Co., Texas Lyndon B. Johnson
35 Thu Jan 09, 1913 Yorba Linda, Cal. 22-Apr-94 New York, New York Richard Nixon
36 Mon Jul 14, 1913 Omaha, Nebraska 26-Dec-06 Rancho Mirage, Cal. Gerald Ford
37 Wed Oct 01, 1924 Plains, Georgia Jimmy Carter
38 Mon Feb 06, 1911 Tampico, Illinois 5-Jun-04 Los Angeles, Cal. Ronald Reagan
39 Thu Jun 12, 1924 Milton, Mass. George Bush
40 Mon Aug 19, 1946 Hope, Arkansas Bill Clinton
41 Sat Jul 06, 1946 New Haven, Conn. George W. Bush
42 Fri Aug 04, 1961 Honolulu, Hawaii Barack Obama
43 Fri Jun 14, 1946 New York, New York Donald Trump

Saving your improvements

Finally, we save the dataset back to our repo with a commit message describing the changes we made:

ds.save("fixed inconsistent date formatting", publish=True)
posting dataset to registry ...

dataset saved