Child pages
  • XSEDE Data Science With Python Tutorial
Skip to end of metadata
Go to start of metadata

Update, October 10, 2020: 

1. Part 2 Completed and Live

 Part 2 of the XSEDE CVW tutorial 'Data Science With Python: Part 2" is completed and now available here.




Update, October 15, 2019: 

1. Part 2 In Development

 A new Github repository was established to host the pages for Part 2 of the CVW version and can be reviewed at:

https://github.com/CornellCAC/CVW_PyData2

Target alpha testing date: December 1, 2019. 



Update, April 3, 2019: 

1. Part 1 Content Near Completion

 We have separated the tutorial into 2 parts due to its gradual expansion of content. A new Github repository was established to host the pages for Part 1 of the CVW version and can be reviewed at:

https://github.com/CornellCAC/CVW_PyData1

A comprehensive reformatting was applied to conform to the CVW format. Pages are finalized and content is nearing completion.

Target alpha testing date: May 1, 2019. 



Update, January 23, 2019:

1. Content Currently Under Development

A Github repository is currently being used for Python code development at:

https://github.com/jsale/data_science_with_python

A shared Google Drive folder containing work-in-progress can be found here:

https://drive.google.com/open?id=10LZmdX5jCutlsuII5h8FQP9mUSKwc5Qq

We are taking some of our work offline, specifically the HTML-based content development until we identify a functional workflow for development. Once we have a good workflow for HTML content development we will post it to our Github site. 

Content for the 3 main lessons listed below are near completion and can be reviewed by clicking the links below:

We are exploring options for stylizing embedded code using Prism.js and image highlighting using Featherlight. You can see how this works in the pages linked above. 

2. Table of Contents and Outline Completed

A full outline and table of contents have been completed and updated below.

3. Learning Outcomes Completed

Below is a first draft of the learning outcomes for this tutorial:

Lesson 1. The Many Facets of Data Science

After this lesson you will be able to:

  • Describe in general methods for accessing, importing, cleaning and filtering data.

  • List examples of how one might interrogate or transform data

  • Identify appropriate models or analytical methods for gaining insight into data

  • List basic types of visualizations and popular visualization applications

  • Describe the basic processes involved in performing machine learning

Lesson 2. The Python Ecosystem for Data Science

After this lesson you will be able to:

  • Explain differences between versions of the Python language as they pertain to data science

  • List some examples of commonly-used Python libraries and their purpose(s)

  • Explain important factors which must be taken into account which impact Python installations and distributions in the various XSEDE systems and environments

Lesson 3. Dealing with Data

After this lesson you will be able to:

  • List different forms of data

  • Describe the datasets used in these lessons

  • List some of the commonly-used tools for accessing, importing and manipulating data

Lesson 4. Statistics with Data

After this lesson you will be able to:

  • Explain the purpose of performing statistical analysis on data

  • List some of the types of statistical analysis one might perform

  • Perform basic statistical analysis on various types of data using Pandas and other Python tools for data science such as scipy and statsmodels.

Lesson 5. Visualizing Data

After this lesson you will be able to:

  • Provide a review of visualization principles and practices and differentiate between them

  • Apply pandas and matplotlib to visualize tweet frequency distributions

    • Explain the process of collecting Twitter data

    • Read and write Twitter data in JSON format

    • Filter Twitter data

    • Perform basic data science analytics

    • Plot time series of Twitter data

    • Plot graph networks of Twitter data

    • Perform simple graph analytics on Twitter data

  • Apply networkx to visualize social networks

  • Generate a scatterplot matrix of multivariate data

  • Generate a ‘heat map’ of multivariate data

  • Create an interactive visualization using Bokeh (optional)

Lesson 6. Machine Learning with Data

After this lesson you will be able to:

  • Configure machine learning tools and environments

  • Apply standard machine learning methods to create a model which performs images recognition

Lesson 7. Modeling with Data

After this lesson you will be able to:

  • Tools: scipy/scikits/networkx/etc.

Lesson 8. Using XSEDE Resources for Data Science

After this lesson you will be able to:

  • Getting data onto XSEDE resources

  • Performance considerations

    • Parallel file systems?

  • Environments and packages

  • Training resources

  • TACC Visualization Portal

Sample Assessments Currently Under Development

Sample assessments for the first Lesson have been created (shown below) and are under development for the remaining lessons. Some of these assessments will be used in the XSEDE Beginner Badge for Data Science With Python

Lesson 1

Outcome: Describe in general methods for accessing, importing, cleaning and filtering data.

Sample Question: Sort the following in the appropriate order:

  1. Importing

  2. Filtering

  3. Cleaning

  4. Accessing

Correct answer:  D, A, C, B (i.e. accessing, importing, cleaning and filtering)


Outcome: List examples of how one might interrogate or transform data.

Sample Question: What is the command to convert a tweet creation date to a pandas date?:

(correct answer in bold)

  1. to_date

  2. to_datetime

  3. to_date_time

  4. date_to_pandas


Outcome: Identify appropriate models or analytical methods for gaining insight into data.

Sample Question: Which is the best description for what the following line of code does?

(correct answer in bold)

df.groupby(height).size().sort_values(ascending=False).reset_index()
  1. Sorts height data from small to large

  2. Sorts height data from tall to short

  3. Sorts size data from large to small

  4. None of the above


Outcome: List basic types of visualizations and popular visualization applications.

Sample Question: Which of the following would be most appropriate to visualize correlations between pairs of several different variables at once?:

(correct answer in bold)

  1. Graph Network

  2. Time Series

  3. Scatterplot Matrix

  4. Heat Map


 


Update, November 28, 2018:

Lessons are in development for the introductory sections (Chris) and the visualization section (Jeff, with Chris' guidance). 

Datasets to be used

We are tentatively looking at the Lahman Baseball database in row-column format and a Twitter dataset consisting of tweet data in JSON format. 

We are developing assessment metrics and examples in Python and using Jupyter Notebooks. 

Our emphasis will be on two main themes, 1) how to use Python data science tools, and 2) how to use these tools within a data model designed to perform computational tasks optimally in an XSEDE environment. 

 


Update, October 25, 2018:

Chris and Jeff met to discuss strategies for identifying useful datasets and an overall timeline with approximate target dates. Chris created an initial outline shown below:

 


Brief Outline (updated January 23, 2019)

Lesson 1. The Many Facets of Data Science

  • Dealing with Data: accessing, importing, cleaning and filtering

  • Interrogating and transforming data

  • Data analysis and modeling

  • Visualization of data

  • Machine Learning

Lesson 2. The Python Ecosystem for Data Science

  • Python language

  • Python libraries

  • Installations and distributions

Lesson 3. Dealing with Data

3.1. Different forms of data

  • Regular Data

  • Heterogeneous Data

  • Tabular Data

  • Relational Data

  • “Big” Data

  • Image Data

  • Textual Data

  • Remote Data

  • Streaming Data

3.2.  Introducing Datasets

  • Twitter

  • Baseball

  • Wildfire historical data??

  • Image data??

3.3. Tools for accessing, importing and manipulating data

  • Numpy

  • Pandas

  • h5py

  • Dask

  • Sqlalchemy + SQLs...

  • Scikit-image / PIL-Pillow / ?

  • NLTK / Spacy / TextBlob

  • Pyspark

  • etc.

Lesson 4. Statistics with Data

  • Tools: pandas/scipy/statsmodels/etc.

Lesson 5. Visualizing Data

  • Brief review of visualization principles, practices, and existing resources

  • Using pandas and matplotlib to visualize tweet frequency distributions

  • Scatterplot Matrix: Using Seaborn to visualize baseball batting statistics

  • Using Bokeh to create an interactive viz for Baseball data

Lesson 6. Machine Learning with Data

  • Tools: scikit-learn/tensorflow/keras/caffe/pytorch/etc.

Lesson 7. Modeling with Data

  • Tools: scipy/scikits/networkx/etc.

Lesson 8. Using XSEDE Resources for Data Science

  • Getting data onto XSEDE resources

  • Performance considerations

    • Parallel file systems?

  • Environments and packages

  • Training resources

  • TACC Visualization Portal

Appendix.  Online data repositories


Examples

TBD

Exercises

TBD

Quiz Questions

TBD

Timeline

Target for first release: March, 2019

November:

Identify datasets, examples, assessments

December 2018:

Develop content

January 2019:

Develop content

February 2019:

Alpha and beta testing, iterative design improvement

March 2019:

Release production version

Possible data sets

Stanford Large Network Dataset Collection

http://snap.stanford.edu/data/index.html

UCI Machine Learning Repository

https://archive.ics.uci.edu/ml/index.php

Kaggle

https://www.kaggle.com/

TOS and copyright limits copying. Fair use suggests small subsets might be reasonable.

ImageNet

http://www.image-net.org/

“Hands” Image Dataset

http://www.robots.ox.ac.uk/~vgg/data/hands/

Global Terrorism Database

https://www.start.umd.edu/gtd/

Lahman Baseball Database

http://www.seanlahman.com/baseball-archive/statistics/

GDELT

https://www.gdeltproject.org/data.html

Twitter Data: Botometer Bot Repository

https://botometer.iuni.iu.edu/bot-repository/

Twitter Data: Suspended state-backed information propagation dataset

https://about.twitter.com/en_us/values/elections-integrity.html#data

HPWREN Weather Data

http://hpwren.ucsd.edu/Sensors/

MODIS Data

https://modis.gsfc.nasa.gov/data/dataprod/

  • No labels