Saturday, February 4, 2017

Data Cleanup Toolset

Listened to "Talk Python to Me" episode #90 today, "Data Wrangling with Python" featuring guest Katharine Jarmul, co-author of "Data Wrangling with Python."

Here is a list of some amazing data cleanup tools Katharine shared:

Dedupe Python Library: github.com/datamade/dedupe
probablepeople: github.com/datamade/probablepeople
usaddress: github.com/datamade/usaddress
jellyfish: github.com/jamesturk/jellyfish
Fuzzywuzzy: github.com/seatgeek/fuzzywuzzy
scrubadub: github.com/datascopeanalytics/scrubadub
pint: pint.readthedocs.io
arrow: github.com/crsmithdev/arrow
pdftables.six: github.com/vnaydionov/pdftables
Datacleaner: github.com/rhiever/datacleaner
Parserator: github.com/datamade/parserator
Gensim: radimrehurek.com/gensim
Faker: github.com/joke2k/faker
Dask: dask.pydata.org
SpaCy: spacy.io
Airflow: airflow.incubator.apache.org
Luigi: luigi.readthedocs.io
Hypothesis (testing): hypothesis.works

No comments:

Post a Comment