Travel Diary of a Data Scientist

Saturday, February 11, 2017

Pandas Documentation

The challenges with all of the free Pandas videos and tutorials online are these:

1. Presenter quality varies
2. Sound/video quality varies
3. Teacher/presenter skills vary
4. Organization is often lacking
5. The resource (audio/video file) label usually doesn't describe the contents adequately, so much time is spent just looking around to find out if it's useful
6. Difficult to find again
7. Difficult to keep organized - and remember one's place - when navigating away from it and returning later

Currently, my go-to resources are:

1. Working through examples from the official found at PyData: http://pandas.pydata.org/

2. Taking the inexpensive, so-far high-quality Python courses at Udemy.

Notes:

I am starting to get the hang of various methods and functions. Tonight, I am studying these items:

1. Object Creation:

s = pd.Series([1, 3, 5, np.nan, 6, 8])

df = pd.DataFrame(np.random.randn(6, 4), index=dates, columns=list('ABCD'))

2. Viewing Data:

df.head()
df.tail(3)
df.index()
df.columns()
df.values()
df.describe()
df.T
df.sort_index(axis=1, ascending=False)

3. Data Selection:

df['A'] ## selecting a single column, which yields a series equivalent
df[0:3] ## slice rows
df.loc[dates[0]] ## get a cross-section using a label
df.loc[:, ['A','B']]   ## select on a multi-access label
df.loc[dates[0], 'A'] ## for getting a scalar value

4. Selection by Position

df.iloc[3] ## via position of passed integers
df.iloc[3:5, 0:2]   ## by integer slice
df.iloc[[1,2,4],[0,2]   ## by lists of integer position locations, similar to numpy
df.iloc[1:3, :] ## for slicing rows explicitly
df.iloc[:, 1:3] ## for slicing columns explicitly
df.iloc[1,1] ## for getting a value explicitly

Monday, February 6, 2017

Pandas, Pandas, Pandas

Today I'm going to get caught up on the three ConversionXL Institute "Google Tag Manager" videos on which I am behind. I have watched the pre-videos but none of the records of the live webinars.

In Python news: I'm increasingly aware of: a. how great Wes McKinney's book is, and yet, b. how outdated it is. Pandas has come a long way in 5 years. I've already pre-ordered his 2nd edition - but there is no promised release date.

I'm going to favor the online pandas documentation and cookbook for the time being. But I will still be reviewing Wes's work.

I've also read parts of "Data Wrangling." Good stuff but it feels like a lot of items covered there can be handled more easily with pandas. I'm looking forward to reading her section on how to choose the appropriate data set for analysis.

Tag Manager Notes:

Benefits:

Use Tag Manager like a video game controller, setting up "if, then" rules for firing events based upon very specific actions.
Does what Google Analytics can't do; GA is incomplete; we want to know more about specific actions (how far did they scroll down? how long were they actively engaged on page? what is real bounce rate?); with Tag Manager, much more useful information is sent into Google Analytics.
Future-proofs you: everything is moving into Tag Manager

Definitions:

Tags: what you want GTM to do

Triggers: when you want GTM to fire a tag

Variables: additional information you can provide to GTM to help the trigger to fire a tag (not always needed with every tag or trigger)

Folders: a way to organize

Notes:

Tags:

Use Universal Analytics (unless GA account still has the old, classic GA)
For Google Adwords and Google Retargeting, use one of the built-in tags in GTM rather than creating a new one from scratch
Facebook and Google don't play nice; so we have to create a custom HTML tag to fire Facebook or other third party scripts that way
Use custom image tags that fire back pixel data

Saturday, February 4, 2017

Data Cleanup Toolset

Listened to "Talk Python to Me" episode #90 today, "Data Wrangling with Python" featuring guest Katharine Jarmul, co-author of "Data Wrangling with Python."

Here is a list of some amazing data cleanup tools Katharine shared:

Dedupe Python Library: github.com/datamade/dedupe
probablepeople: github.com/datamade/probablepeople
usaddress: github.com/datamade/usaddress
jellyfish: github.com/jamesturk/jellyfish
Fuzzywuzzy: github.com/seatgeek/fuzzywuzzy
scrubadub: github.com/datascopeanalytics/scrubadub
pint: pint.readthedocs.io
arrow: github.com/crsmithdev/arrow
pdftables.six: github.com/vnaydionov/pdftables
Datacleaner: github.com/rhiever/datacleaner
Parserator: github.com/datamade/parserator
Gensim: radimrehurek.com/gensim
Faker: github.com/joke2k/faker
Dask: dask.pydata.org
SpaCy: spacy.io
Airflow: airflow.incubator.apache.org
Luigi: luigi.readthedocs.io
Hypothesis (testing): hypothesis.works

Friday, February 3, 2017

Pandas Pivot Tables

Working hard on learning the Pandas pivot tables function, pd.pivot_table.

Notably, several of the methods in McKinney's book are outdated and I've had to search for online documentation.

This has been a great site:

http://pandas.pydata.org/pandas-docs/stable/

And specifically for this page:

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.pivot_table.html

Currently, looking to differentiate between "index" and "columns."

Here an example query on some data I am working with:

pd.pivot_table(my_data,values =["Number"],columns=["Accumulating Suburban Families", "Client"],aggfunc=[np.sum])

Tuesday, January 31, 2017

Getting a Grip on the Tools

Resources I have been exploring:

The Quandl API for grabbing copious amounts of clean, good, useful data (free and paid) via the Quandl API and brining it right into Python as a DataFrame.

Listening to the "Talk Python to Me" podcast series, which is a great resource for hearing how experienced programmers think, how they talk to each other, and how they see the world.

Update: Finally decided to plow through Wes McKinney's "Python for Data Analysis." I admire his intelligence, dedication and entrepreneurship in writing Pandas for Python. I've been avoiding really going through this book for 3 years. Time to take it seriously and dive in.

Key Concepts:

import json
path = 'path1/path2/file.txt'   ## assigns path to a variable
records = [json.loads(line) for line in open (path, 'rb')] ## loads the records, line by line, from path (not sure what 'rb' means/does)

records[0] ## reads the first full record, outputs multiple lines
records[0]['some_column_name']   ## displays one column of first record, including unicode
print records [0]['some_column_name']   ## displays one column of first record, sans unicode

from pandas import DataFrame, Series ## imports the modules DataFrame and Series from the Pandas package

import pandas as pd

frame = DataFrame(records) ## converts the records in the variable "records" to a data frame, assigns the new data frame a name called "frame"

frame.info()  ## yields the summary statistics view, namely number of total records, and non-null counts of each column/attribute

We need to import these in order to properly plot with matplotlib:

import matplotlib
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Monday, January 30, 2017

Learning Google Tag Manager

Learning Google Tag Manager. I have purchased the eBook, "Practical Google Analytics & Google Tag Manager for Developers," by Jonathan Weber. It is pretty huge (thick), so I also purchased an online course by ConversionXL, with host Chris Mercer.

So far, before the course started a I created an account and my first container. This first video is showing me the basics of the Tag Manager environment and helping me make sure I get set up properly.

Other notes:

Python: finding that there are several ways to import .CSV and .XLS files. You can reference the path as a variable, then open_csv("Path") or you can just add the path. There are several variants.

Also, in Jupyter Notebook (I have it installed with the Anaconda Package), I find that specifying paths seems inconsistent. I have launched the same instance of the same code twice (two different days) and witnessed it not working the second time.

Resources I've been using lately:

Analytics Vidhya, for some free, some paid courses and content

"Python for Data Analysis," by Wes McKinney

"Data Wrangling with Python" by Jacqueline Kazil & Katharine Jarmul

(Re-) Starting the Learning Journey

I am a data science hack. Meaning: I have only formally studied basic statistics, math in school (through calculus), and research design (during my MBA and Ph.D. studies). But I am self-taught in every other data science skill I have picked up along the way over the past 20 years of my life as a businessperson, analyst and marketing expert. My skill set as it stands today, listed roughly in descending order in terms of relative mastery:

Very strong:
Subject Matter Expertise in Marketing & Marketing Analytics (i.e., knowing what questions to ask of the data)
Understanding How Business Works
Quantitative Customer Segmentation
MS Excel - both as a glorified database and as an analytics tool
Descriptive Statistics
Web Analytics
Research Design

Strong:
Neural Networks & Machine Learning
Inferential Statistics
MS Access
Regression Analysis
Data Visualization

Beginner/Hack:
SQL
JSON data structures
Google Tag Manager
Python (particularly Pandas & NumPy)
Cloud Computing/Amazon Web Services (AWS)
Hadoop
R

No Knowledge/Experience (but Seek it) in:
Apache Spark (PySpark)
Deep Learning
Data Pipelines
Flask

Side bar: here are 6 types of data scientists; I can best be classified as a Business Data Scientist at this point, with leanings toward Machine Learning, Software Engineering, and Visualization.

As the co-founder and CEO of MindEcology, a data-driven advertising company, I have learned how to do what we do for our customer VERY well, having delivered hundreds of high-quality, innovative data-oriented projects for clients over the past 10 years. However, I am continually daunted by four challenges as I strive to improve and deepen my skill set:

1. Data science itself is ill-defined, and there are several types or strains of data scientists

2. There is a breathtakingly large amount of breadth (tools, techniques, platforms) and depth (mastery) related to data science itself, as a field; in other words: it really is a never-ending journey for any and all of us

3. I run a full-time business and am a married father of three; meaning that to do anything more than "dabble" in this or that new skill requires a concerted effort

4. I am not surrounded on a daily basis by other data scientists with whom I can do a quick chat or get simple questions answered; most of them I connect with through their own podcasts and blogs.

This blog, Travel Diary of a Data Scientist is my effort to document - as would a travel log or diary - my overcoming these four above challenges (i.e., definitional challenges, sheer amount to learn, time constraints and data-scientist-access constraints). My goal is to properly:

a. prioritize my limited self-educational/self-training time toward identifying the tools and techniques that will confer maximum benefit upon me in my daily life as chief data scientist at MindEcology, while at the same time devoting time to learning emerging technologies that might benefit us in five years or that are contextually relevant to what I am working on

b. learning the above-mentioned tools and techniques to the appropriate depth - no more, no less - so that I can be the best data scientist I can be in my world

c. provide a place for me to record my learnings

The writing will be more like a log or set of notes along the way, as opposed to an outward-facing exposition on my journey. More nuts-and-bolts, less philosophical. That said, I am posting this online as a way to "keep me honest" as I chart my progress. Therefore, I welcome readers who want to follow my continuing self-education journey in the wide world of data science.