Travel Diary of a Data Scientist: January 2017

Tuesday, January 31, 2017

Getting a Grip on the Tools

Resources I have been exploring:

The Quandl API for grabbing copious amounts of clean, good, useful data (free and paid) via the Quandl API and brining it right into Python as a DataFrame.

Listening to the "Talk Python to Me" podcast series, which is a great resource for hearing how experienced programmers think, how they talk to each other, and how they see the world.

Update: Finally decided to plow through Wes McKinney's "Python for Data Analysis." I admire his intelligence, dedication and entrepreneurship in writing Pandas for Python. I've been avoiding really going through this book for 3 years. Time to take it seriously and dive in.

Key Concepts:

import json
path = 'path1/path2/file.txt'   ## assigns path to a variable
records = [json.loads(line) for line in open (path, 'rb')] ## loads the records, line by line, from path (not sure what 'rb' means/does)

records[0] ## reads the first full record, outputs multiple lines
records[0]['some_column_name']   ## displays one column of first record, including unicode
print records [0]['some_column_name']   ## displays one column of first record, sans unicode

from pandas import DataFrame, Series ## imports the modules DataFrame and Series from the Pandas package

import pandas as pd

frame = DataFrame(records) ## converts the records in the variable "records" to a data frame, assigns the new data frame a name called "frame"

frame.info()  ## yields the summary statistics view, namely number of total records, and non-null counts of each column/attribute

We need to import these in order to properly plot with matplotlib:

import matplotlib
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Monday, January 30, 2017

Learning Google Tag Manager

Learning Google Tag Manager. I have purchased the eBook, "Practical Google Analytics & Google Tag Manager for Developers," by Jonathan Weber. It is pretty huge (thick), so I also purchased an online course by ConversionXL, with host Chris Mercer.

So far, before the course started a I created an account and my first container. This first video is showing me the basics of the Tag Manager environment and helping me make sure I get set up properly.

Other notes:

Python: finding that there are several ways to import .CSV and .XLS files. You can reference the path as a variable, then open_csv("Path") or you can just add the path. There are several variants.

Also, in Jupyter Notebook (I have it installed with the Anaconda Package), I find that specifying paths seems inconsistent. I have launched the same instance of the same code twice (two different days) and witnessed it not working the second time.

Resources I've been using lately:

Analytics Vidhya, for some free, some paid courses and content

"Python for Data Analysis," by Wes McKinney

"Data Wrangling with Python" by Jacqueline Kazil & Katharine Jarmul

(Re-) Starting the Learning Journey

I am a data science hack. Meaning: I have only formally studied basic statistics, math in school (through calculus), and research design (during my MBA and Ph.D. studies). But I am self-taught in every other data science skill I have picked up along the way over the past 20 years of my life as a businessperson, analyst and marketing expert. My skill set as it stands today, listed roughly in descending order in terms of relative mastery:

Very strong:
Subject Matter Expertise in Marketing & Marketing Analytics (i.e., knowing what questions to ask of the data)
Understanding How Business Works
Quantitative Customer Segmentation
MS Excel - both as a glorified database and as an analytics tool
Descriptive Statistics
Web Analytics
Research Design

Strong:
Neural Networks & Machine Learning
Inferential Statistics
MS Access
Regression Analysis
Data Visualization

Beginner/Hack:
SQL
JSON data structures
Google Tag Manager
Python (particularly Pandas & NumPy)
Cloud Computing/Amazon Web Services (AWS)
Hadoop
R

No Knowledge/Experience (but Seek it) in:
Apache Spark (PySpark)
Deep Learning
Data Pipelines
Flask

Side bar: here are 6 types of data scientists; I can best be classified as a Business Data Scientist at this point, with leanings toward Machine Learning, Software Engineering, and Visualization.

As the co-founder and CEO of MindEcology, a data-driven advertising company, I have learned how to do what we do for our customer VERY well, having delivered hundreds of high-quality, innovative data-oriented projects for clients over the past 10 years. However, I am continually daunted by four challenges as I strive to improve and deepen my skill set:

1. Data science itself is ill-defined, and there are several types or strains of data scientists

2. There is a breathtakingly large amount of breadth (tools, techniques, platforms) and depth (mastery) related to data science itself, as a field; in other words: it really is a never-ending journey for any and all of us

3. I run a full-time business and am a married father of three; meaning that to do anything more than "dabble" in this or that new skill requires a concerted effort

4. I am not surrounded on a daily basis by other data scientists with whom I can do a quick chat or get simple questions answered; most of them I connect with through their own podcasts and blogs.

This blog, Travel Diary of a Data Scientist is my effort to document - as would a travel log or diary - my overcoming these four above challenges (i.e., definitional challenges, sheer amount to learn, time constraints and data-scientist-access constraints). My goal is to properly:

a. prioritize my limited self-educational/self-training time toward identifying the tools and techniques that will confer maximum benefit upon me in my daily life as chief data scientist at MindEcology, while at the same time devoting time to learning emerging technologies that might benefit us in five years or that are contextually relevant to what I am working on

b. learning the above-mentioned tools and techniques to the appropriate depth - no more, no less - so that I can be the best data scientist I can be in my world

c. provide a place for me to record my learnings

The writing will be more like a log or set of notes along the way, as opposed to an outward-facing exposition on my journey. More nuts-and-bolts, less philosophical. That said, I am posting this online as a way to "keep me honest" as I chart my progress. Therefore, I welcome readers who want to follow my continuing self-education journey in the wide world of data science.