Travel Diary of a Data Scientist: February 2017

Saturday, February 11, 2017

Pandas Documentation

The challenges with all of the free Pandas videos and tutorials online are these:

1. Presenter quality varies
2. Sound/video quality varies
3. Teacher/presenter skills vary
4. Organization is often lacking
5. The resource (audio/video file) label usually doesn't describe the contents adequately, so much time is spent just looking around to find out if it's useful
6. Difficult to find again
7. Difficult to keep organized - and remember one's place - when navigating away from it and returning later

Currently, my go-to resources are:

1. Working through examples from the official found at PyData: http://pandas.pydata.org/

2. Taking the inexpensive, so-far high-quality Python courses at Udemy.

Notes:

I am starting to get the hang of various methods and functions. Tonight, I am studying these items:

1. Object Creation:

s = pd.Series([1, 3, 5, np.nan, 6, 8])

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

2. Viewing Data:

df.head()
df.tail(3)
df.index()
df.columns()
df.values()
df.describe()
df.T
df.sort_index(axis=1, ascending=False)

3. Data Selection:

df['A'] ## selecting a single column, which yields a series equivalent
df[0:3] ## slice rows
df.loc[dates[0]] ## get a cross-section using a label
df.loc[:, ['A','B']]   ## select on a multi-access label
df.loc[dates[0], 'A'] ## for getting a scalar value

4. Selection by Position

df.iloc[3] ## via position of passed integers
df.iloc[3:5, 0:2]   ## by integer slice
df.iloc[[1,2,4],[0,2]   ## by lists of integer position locations, similar to numpy
df.iloc[1:3, :] ## for slicing rows explicitly
df.iloc[:, 1:3] ## for slicing columns explicitly
df.iloc[1,1] ## for getting a value explicitly

Monday, February 6, 2017

Pandas, Pandas, Pandas

Today I'm going to get caught up on the three ConversionXL Institute "Google Tag Manager" videos on which I am behind. I have watched the pre-videos but none of the records of the live webinars.

In Python news: I'm increasingly aware of: a. how great Wes McKinney's book is, and yet, b. how outdated it is. Pandas has come a long way in 5 years. I've already pre-ordered his 2nd edition - but there is no promised release date.

I'm going to favor the online pandas documentation and cookbook for the time being. But I will still be reviewing Wes's work.

I've also read parts of "Data Wrangling." Good stuff but it feels like a lot of items covered there can be handled more easily with pandas. I'm looking forward to reading her section on how to choose the appropriate data set for analysis.

Tag Manager Notes:

Benefits:

Use Tag Manager like a video game controller, setting up "if, then" rules for firing events based upon very specific actions.
Does what Google Analytics can't do; GA is incomplete; we want to know more about specific actions (how far did they scroll down? how long were they actively engaged on page? what is real bounce rate?); with Tag Manager, much more useful information is sent into Google Analytics.
Future-proofs you: everything is moving into Tag Manager

Definitions:

Tags: what you want GTM to do

Triggers: when you want GTM to fire a tag

Variables: additional information you can provide to GTM to help the trigger to fire a tag (not always needed with every tag or trigger)

Folders: a way to organize

Notes:

Tags:

Use Universal Analytics (unless GA account still has the old, classic GA)
For Google Adwords and Google Retargeting, use one of the built-in tags in GTM rather than creating a new one from scratch
Facebook and Google don't play nice; so we have to create a custom HTML tag to fire Facebook or other third party scripts that way
Use custom image tags that fire back pixel data

Saturday, February 4, 2017

Data Cleanup Toolset

Listened to "Talk Python to Me" episode #90 today, "Data Wrangling with Python" featuring guest Katharine Jarmul, co-author of "Data Wrangling with Python."

Here is a list of some amazing data cleanup tools Katharine shared:

Dedupe Python Library: github.com/datamade/dedupe
probablepeople: github.com/datamade/probablepeople
usaddress: github.com/datamade/usaddress
jellyfish: github.com/jamesturk/jellyfish
Fuzzywuzzy: github.com/seatgeek/fuzzywuzzy
scrubadub: github.com/datascopeanalytics/scrubadub
pint: pint.readthedocs.io
arrow: github.com/crsmithdev/arrow
pdftables.six: github.com/vnaydionov/pdftables
Datacleaner: github.com/rhiever/datacleaner
Parserator: github.com/datamade/parserator
Gensim: radimrehurek.com/gensim
Faker: github.com/joke2k/faker
Dask: dask.pydata.org
SpaCy: spacy.io
Airflow: airflow.incubator.apache.org
Luigi: luigi.readthedocs.io
Hypothesis (testing): hypothesis.works

Friday, February 3, 2017

Pandas Pivot Tables

Working hard on learning the Pandas pivot tables function, pd.pivot_table.

Notably, several of the methods in McKinney's book are outdated and I've had to search for online documentation.

This has been a great site:

http://pandas.pydata.org/pandas-docs/stable/

And specifically for this page:

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html

Currently, looking to differentiate between "index" and "columns."

Here an example query on some data I am working with:

pd.pivot_table(my_data,values =["Number"],columns=["Accumulating Suburban Families", "Client"],aggfunc=[np.sum])