Update, October 10, 2020:
1. Part 2 Completed and Live
Part 2 of the XSEDE CVW tutorial 'Data Science With Python: Part 2" is completed and now available here.
Update, October 15, 2019:
1. Part 2 In Development
A new Github repository was established to host the pages for Part 2 of the CVW version and can be reviewed at:
https://github.com/CornellCAC/CVW_PyData2
Target alpha testing date: December 1, 2019.
Update, April 3, 2019:
1. Part 1 Content Near Completion
We have separated the tutorial into 2 parts due to its gradual expansion of content. A new Github repository was established to host the pages for Part 1 of the CVW version and can be reviewed at:
https://github.com/CornellCAC/CVW_PyData1
A comprehensive reformatting was applied to conform to the CVW format. Pages are finalized and content is nearing completion.
Target alpha testing date: May 1, 2019.
Update, January 23, 2019:
1. Content Currently Under Development
A Github repository is currently being used for Python code development at:
https://github.com/jsale/data_science_with_python
A shared Google Drive folder containing work-in-progress can be found here:
https://drive.google.com/open?id=10LZmdX5jCutlsuII5h8FQP9mUSKwc5Qq
We are taking some of our work offline, specifically the HTML-based content development until we identify a functional workflow for development. Once we have a good workflow for HTML content development we will post it to our Github site.
Content for the 3 main lessons listed below are near completion and can be reviewed by clicking the links below:
We are exploring options for stylizing embedded code using Prism.js and image highlighting using Featherlight. You can see how this works in the pages linked above.
2. Table of Contents and Outline Completed
A full outline and table of contents have been completed and updated below.
3. Learning Outcomes Completed
Below is a first draft of the learning outcomes for this tutorial:
Lesson 1. The Many Facets of Data Science
After this lesson you will be able to:
Describe in general methods for accessing, importing, cleaning and filtering data.
List examples of how one might interrogate or transform data
Identify appropriate models or analytical methods for gaining insight into data
List basic types of visualizations and popular visualization applications
Describe the basic processes involved in performing machine learning
Lesson 2. The Python Ecosystem for Data Science
After this lesson you will be able to:
Explain differences between versions of the Python language as they pertain to data science
List some examples of commonly-used Python libraries and their purpose(s)
Explain important factors which must be taken into account which impact Python installations and distributions in the various XSEDE systems and environments
Lesson 3. Dealing with Data
After this lesson you will be able to:
List different forms of data
Describe the datasets used in these lessons
List some of the commonly-used tools for accessing, importing and manipulating data
Lesson 4. Statistics with Data
After this lesson you will be able to:
Explain the purpose of performing statistical analysis on data
List some of the types of statistical analysis one might perform
Perform basic statistical analysis on various types of data using Pandas and other Python tools for data science such as scipy and statsmodels.
Lesson 5. Visualizing Data
After this lesson you will be able to:
Provide a review of visualization principles and practices and differentiate between them
Apply pandas and matplotlib to visualize tweet frequency distributions
Explain the process of collecting Twitter data
Read and write Twitter data in JSON format
Filter Twitter data
Perform basic data science analytics
Plot time series of Twitter data
Plot graph networks of Twitter data
Perform simple graph analytics on Twitter data
Apply networkx to visualize social networks
Generate a scatterplot matrix of multivariate data
Generate a ‘heat map’ of multivariate data
Create an interactive visualization using Bokeh (optional)
Lesson 6. Machine Learning with Data
After this lesson you will be able to:
Configure machine learning tools and environments
Apply standard machine learning methods to create a model which performs images recognition
Lesson 7. Modeling with Data
After this lesson you will be able to:
Tools: scipy/scikits/networkx/etc.
Lesson 8. Using XSEDE Resources for Data Science
After this lesson you will be able to:
Getting data onto XSEDE resources
Performance considerations
Parallel file systems?
Environments and packages
Training resources
TACC Visualization Portal
Sample Assessments Currently Under Development
Sample assessments for the first Lesson have been created (shown below) and are under development for the remaining lessons. Some of these assessments will be used in the XSEDE Beginner Badge for Data Science With Python.
Lesson 1
Outcome: Describe in general methods for accessing, importing, cleaning and filtering data.
Sample Question: Sort the following in the appropriate order:
Importing
Filtering
Cleaning
Accessing
Correct answer: D, A, C, B (i.e. accessing, importing, cleaning and filtering)
Outcome: List examples of how one might interrogate or transform data.
Sample Question: What is the command to convert a tweet creation date to a pandas date?:
(correct answer in bold)
to_date
to_datetime
to_date_time
date_to_pandas
Outcome: Identify appropriate models or analytical methods for gaining insight into data.
Sample Question: Which is the best description for what the following line of code does?
(correct answer in bold)
df.groupby(height).size().sort_values(ascending=False).reset_index()
Sorts height data from small to large
Sorts height data from tall to short
Sorts size data from large to small
None of the above
Outcome: List basic types of visualizations and popular visualization applications.
Sample Question: Which of the following would be most appropriate to visualize correlations between pairs of several different variables at once?:
(correct answer in bold)
Graph Network
Time Series
Scatterplot Matrix
Heat Map
Update, November 28, 2018:
Lessons are in development for the introductory sections (Chris) and the visualization section (Jeff, with Chris' guidance).
Datasets to be used
We are tentatively looking at the Lahman Baseball database in row-column format and a Twitter dataset consisting of tweet data in JSON format.
We are developing assessment metrics and examples in Python and using Jupyter Notebooks.
Our emphasis will be on two main themes, 1) how to use Python data science tools, and 2) how to use these tools within a data model designed to perform computational tasks optimally in an XSEDE environment.
Update, October 25, 2018:
Chris and Jeff met to discuss strategies for identifying useful datasets and an overall timeline with approximate target dates. Chris created an initial outline shown below:
Brief Outline (updated January 23, 2019)
Lesson 1. The Many Facets of Data Science
Dealing with Data: accessing, importing, cleaning and filtering
Interrogating and transforming data
Data analysis and modeling
Visualization of data
Machine Learning
Lesson 2. The Python Ecosystem for Data Science
Python language
Python libraries
Installations and distributions
Lesson 3. Dealing with Data
3.1. Different forms of data
Regular Data
Heterogeneous Data
Tabular Data
Relational Data
“Big” Data
Image Data
Textual Data
Remote Data
Streaming Data
3.2. Introducing Datasets
Twitter
Baseball
Wildfire historical data??
Image data??
3.3. Tools for accessing, importing and manipulating data
Numpy
Pandas
h5py
Dask
Sqlalchemy + SQLs...
Scikit-image / PIL-Pillow / ?
NLTK / Spacy / TextBlob
Pyspark
etc.
Lesson 4. Statistics with Data
Tools: pandas/scipy/statsmodels/etc.
Lesson 5. Visualizing Data
Brief review of visualization principles, practices, and existing resources
Using pandas and matplotlib to visualize tweet frequency distributions
Scatterplot Matrix: Using Seaborn to visualize baseball batting statistics
Using Bokeh to create an interactive viz for Baseball data
Lesson 6. Machine Learning with Data
Tools: scikit-learn/tensorflow/keras/caffe/pytorch/etc.
Lesson 7. Modeling with Data
Tools: scipy/scikits/networkx/etc.
Lesson 8. Using XSEDE Resources for Data Science
Getting data onto XSEDE resources
Performance considerations
Parallel file systems?
Environments and packages
Training resources
TACC Visualization Portal
Appendix. Online data repositories
Examples
TBD
Exercises
TBD
Quiz Questions
TBD
Timeline
Target for first release: March, 2019
November:
Identify datasets, examples, assessments
December 2018:
Develop content
January 2019:
Develop content
February 2019:
Alpha and beta testing, iterative design improvement
March 2019:
Release production version
Possible data sets
Stanford Large Network Dataset Collection
http://snap.stanford.edu/data/index.html
UCI Machine Learning Repository
https://archive.ics.uci.edu/ml/index.php
Kaggle
TOS and copyright limits copying. Fair use suggests small subsets might be reasonable.
ImageNet
“Hands” Image Dataset
http://www.robots.ox.ac.uk/~vgg/data/hands/
Global Terrorism Database
https://www.start.umd.edu/gtd/
Lahman Baseball Database
http://www.seanlahman.com/baseball-archive/statistics/
GDELT
https://www.gdeltproject.org/data.html
Twitter Data: Botometer Bot Repository
https://botometer.iuni.iu.edu/bot-repository/
Twitter Data: Suspended state-backed information propagation dataset
https://about.twitter.com/en_us/values/elections-integrity.html#data
HPWREN Weather Data
http://hpwren.ucsd.edu/Sensors/