Data Science Course Project

CSE 519 — Data Science (Fall 2017)

Prof. Steven Skiena

Semester Projects

 

Proposal Due: Tuesday, October 23, 2017

Progress Report Due: Thursday, November 15, 2017

Poster/Final Report Due: Thursday, December 6, 2017

 

The project will involve concentrated work in one research project related to data science.  Below I list several possible project ideas.

You may also choose your own topic if you can convince me that what you want to do is interesting.  However, I discourage projects derived from your own thesis research, projects from other classes, Kaggle challenges, or reproductions of previously published work.  I want to see original ideas, ideally requiring the assembly of an interesting data set and making the data do something interesting.   Slightly beating the F-score of some published paper on their data set does not count as interesting.

While I anticipate that much of the work will be done as the deadline approaches, it is important to get started early enough to discover insurmountable roadblocks in data acquisition or problem

definition before it is too late.  The project proposal and progress reports have been instituted to ensure people get serious well in advance of the final deadline.

Each group is responsible to turn in 3-5 page project proposals/literature search and progress reports as of the dates above.  I will award roughly 50% of the grade for each project on the strength of the preliminary reports.  This is to encourage starting early and to make sure that you and I both know what you are to do before it is too late to avoid trouble. Each group will have to turn in a final written report/WWW site and have a quick meeting with during finals week — during which I will ask if all members participated equally.

My hope is that one or more of these projects will lead to published work.  This has been the case most times I have taught such a course.  The projects I believe are best have been starred (**), with other good projects (*).  I particularly recommend them to students thinking about doing research under me.

    1. Your Own Project! (**) — Propose a modelling/forecasting effort or statistical analysis related to a subject you find particularly interesting.   I am willing to be quite open minded here, but I will need to see (a) that you have access to a large and interesting enough data set to work with, and (b) a clear idea of a formal evaluation criteria so you can prove to me that you are doing a good job.  It is much better to talk about these with me early rather than have me tell you I don’t like your project when I grade your proposal.  (groups of 2-3 students).
    2. New Quant Shop Challenges — If you enjoyed the Quant Shop videos, perhaps put together a team to do a new one!   Note you will have to do your own video shooting and editing, so you better have at least one team member who will commit to making that happen.  Possible new challenges would be (a) grade prediction — for a set of students predict what grades they will receive in a particular set of courses, (b) gay or straight? —  given access to data like Twitter feeds or Facebook graph, determine the preference among groups of students, (c) bankruptcy prediction — which NYSE or NASDAQ company will be next to go under, (d) consumer confidence index or unemployment rate. (groups of 2-4 students)
    3. Railroad Network Mapping and Analysis (*) — Build a data set of all active railway freight and passenger tracks in the U.S. (and ideally larger parts of the world) including such information as geographic positioning, gauge, capacity, traffic level, etc.  Once we have a good network there are other projects which can be done to measure traffic flows between points.   Data can be found concerning the track positioning and the frequency of grade crossings — start to play with it by the time of your proposal. What other data on railroads and traffic can you build?   Can we make a rail data portal? (groups of 2-3 students — perfect for masters students, possible future MS project).
    4. How Hard are People Working? (*) — Can you find data sources that represent proxies for how hard people are working, like Github commits, and use this to measure productivity as a function of time and place?   How did the results of the election affect productivity in red and blue states?   How do natural disasters affect productivity?  (groups of 2-3 students).
    5. Automatically Building Book Indices (**) — Book indices are pointers into important parts of the text, associated with keywords.   It would be good to have an automatic index builder, which takes as input a text and a desired index size, and builds it automatically. Resources for this would be long papers on ARxiv, which have latex source and indexes.  I assume several thousand have indexes.  Scanned books might also work, but the pointers do not go to the precise position (only the page).  The problem is split in several separate subtasks: (a) which terms are most index-worthy, (b) how do we normalize the forms, (c) where do we put the pointers do. A tool to index the book might we well received by authors, and maybe generalize to meta indexes for scanning across related documents. (groups of 2-4 students)
    6. Towards a Purely Empirical T-test  (*) — My belief is that arbitrary pairs of data value correlate to a much higher extent than should occur by chance.   Any two things that increase with time also correlate with each other, say the average price of cars and life expectancy.  Thus standard measures like permutation tests and other statistical tests would be more likely to be impressed with such a correlation than they should.    Take a large set of time series or other key-value data sets, and compute the distribution of similarity.   Vary the matching key, file order, and divergence in data source to produce a measure of significance for each sample size/correlation range pair (groups of 2-3 students).
    7. How Good is that game/App? (**) — App ratings are usually based on reviews, but that means that the vast majority of apps in the app store are invisible because no one ever reviewed them.  Can we train a program to evaluate either the code of an app or a video of its play and decide how good it is?  Analysis of apps with real ratings provide training data. The hard part is producing a play video for a game for analysis in an automated fashion. Google deep learning papers training players for old Atari games is a good starting point. (groups of 2-3 students).
    8. Do Popular Songs Endure? (**) — As a fan of 1960’s popular music, I have noticed (I think) that a surprising number of the most enduring songs (as measured by current sales or airplay) were not at the top of the charts when then came out.    By analysis of pop charts and other data, measure to what extent is this observation is true.   Can you build a model to predict the long term popularity of a recording? (groups of 2-3 students)
    9. The Balance of Rivalries (*) — I am curious about is the streakiness of head-to-head matchups over a long time period.   Clearly the runs in which league wins the All-Star game or the results of Lehigh-Lafayette football are much longer than expected under a binomial distribution.  What about a Hurst random walk, with some level of memory? Given a series of an event, can we fit a Hurst parameter to best model the run length distribution?  And are the runs reverting — in any long enough rivalry do both sides come dominate in different periods?  (groups of 1-2 students)
    10. Live Football Bet Pricing (**) — Build a live model to support real-time American football bet pricing.  Given the initial spread and the game situation (as streamed by events online) produce a principled estimate for the current value of the bet based on the starting odds and current game situation.   This should be calibrated based on data from years of past games.   This model must get integrated into a live online demo, connected on live game data from games in progress.  (groups of 2-3 masters students, possible MS project).
    11. Why do Good Teams Come from Behind? (*) — good teams come from behind because they know how to win, or because class will out over a long enough time?  In particular can we model the advantage of teams by what the probability is that the score on a typical drive, and then randomly simulate how often teams come from behind on games with n drives? (groups of 2-3 students)
    12. Monitoring Ships at Sea (*) — AIS is system for locating ships.   How do we get this transponder data and what can we do with it?  Set up an AIS reciever and then access http://www.aishub.net/ https://marinecadastre.gov/ais/  Can you predict the next destination of a given ship?  I am interested in live data, trained on history (groups of 2-3 students).
    13. Google Trend Analytics — The recent Stephens-Davidowitz book “Everybody Lies” makes clever use of Google Trends, Google Correlate, and Google Ad Words data.  See sethsd.com for examples.   Propose new interesting questions on a topic of interest to you that can be addressed using this data (groups of 2-3 students。
    14. Poach-o-matic — Build a system that takes data from various sources (NSF Fastlane, Google Scholar, Google search for homepages, Rate My Professor, salary data from states for state universities) and identifies the most valuable faculty in a given department or research area.   The goal here is a live system which spiders on a weekly or monthly basis, and maintains a database for analysis with publishable reports. (groups of 2-3 students)
    15. The Stony Brook Glad All Over Machine (**) — Build a karaoke-like interface over web and facebook to film and record people singing “Glad All Over” by the Dave Clark Five, and integrate them into one recording and video.   This will involve clustering people who look and sound alike, and time-warping/merging the audio from different people so they sing as harmony.    Can we score which singers are best in a meaningful way? A three minute recording presumably has room for 3*60*20 = 3600 frames, so we can only get to 1 million singers by merging faces/views. (groups of 2-4 students)
    16. Ethnic Analysis of Restaurants and Twitter (**) — My student Junting Ye developed a classifier to identify the nationality and ethnicity of names: www.name-prism.com. Work with him to do studies on the analysis names in different communities, including (1) customers eating in different ethnic restaurants, (2) which is more culturally inclusive, movie, music or sport? Attention is needed when dealing with different subtypes, e.g. Bollywood or Hollywood, and how to normalize them so that to compare with other cultural topic. (3) can we try to spot social bots (following groups of Russians and Indians) and users followed by many social bots (fake followings)? (groups of 2-3 students)
    17. Dating Documents (*) — Given a short text (say, 500-2000 words), can you predict what year it was written?  We are particularly interested in books and articles written over the past 200 years.  How accurately can you date authorship?  My students and I have a manuscript/system that typically comes within 20 years of the right answer, but can you do better?   Google n-grams and book corpora like Guttenberg provide the data to do this.  (groups of 2-3 students)
    18. Multilingual Language Transliteration — Transliteration is spelling a word in one language out to another so it can be read.   My student Yanqing Chen and I built a Polyglot transliteration system that I think was quite good, but never published it.  Is there an idea how you can push it out the door? (group of 2-3 Ph.D students)
    19. Working the refs —  A fascinating study in Nate Silver’s 538 shows that complaining works much better than it should.   Can you find other domains where complaining provably gives you a better chance: in law, business, or anything with an appeals process?  (groups of 2-3 students       https://fivethirtyeight.com/features/nfl-coaches-yell-at-refs-because-it-freakin-works/
    20. Trade value analysis in baseball and other sports — Players are routinely traded between teams, for other players of presumably equal value.  Can you develop a system to estimate the future value of two sides in a trade?    Starting from the assumption that past trades reflect equal value provides a training set.  (groups of 2-3 students)

 

 

Resource identification in data science — Prepare multiple “hitchhiker’s guide” entries related to data science topics. I have a format/length in mind, akin to the back of my Algorithm Design Manual.  A list of topics is available below.  (multiple students)

Hitchhiker’s Guide to Data Science

  • Mathematical Preliminaries
    1. Centrality measures
    2. Variabilities measures
    3. Correlation analysis
    4. Autocorrelation analysis (FFT)
  • Data Preparation
    1. Spidering and scraping
    2. Name unification
    3. Imputing missing values
    4. Character code conversion
    5. Financial time series modeling
    6. Outlier detection
    7. Crowdsourcing
  • Scores and Rankings
    1. Z-scores and normalization
    2. Linear scoring functions (learning to rank)
    3. Elo rankings
    4. Consensus rankings
    5. Diagraph-based rankings
    6. PageRank
  • Statistical Analysis
    1. Classical statistical distributions (sampling)
    2. Random sampling from arbitrary distributions
    3. Statistical significance testing (e.g. T-test)
    4. Kolmogorov-Smirnov test
    5. Permutation testing
  • Visualization
    1. Visualizing tables
    2. Dot/line plots
    3. Scatter plots
    4. Projecting higher-dimensional data
    5. Histograms
    6. Data maps
    7. Interactive visualization
  • Mathematical Modeling
    1. Blackbox modeling systems
    2. Evaluating classifiers
    3. Evaluating regression systems
    4. Simulation modeling
    5. Measurements of fit vs. complexity (e.g. AIC)
  • Linear Algebra
    1. Matrix manipulation libraries
    2. Matrix multiplication
    3. Matrix inversion
    4. Matrix factorization
    5. Eigenvalues/eigenvectors
    6. Singular value decomposition
    7. Principle component analysis
  • Regression
    1. Linear regression
    2. Non-linear regression
    3. Gradient descent search
    4. Ridge/LASSO regression
    5. Logistic regression
  • Distance/Network methods
    1. Distance metrics
    2. Nearest neighbor search methods
    3. NN classification
    4. Locality sensitive hashing
    5. Graph libraries
    6. K-means clustering
    7. Agglomerative clustering
    8. Spectral (Cut-based) clustering
  • Machine Learning
    1. Naive Bayes
    2. Decision tree classifiers
    3. Boosting methods
    4. Support vector machines
    5. Deep learning
    6. Word embeddings
  • Big Data
    1. Filtering and sampling
    2. Grid search
    3. Map-Reduce systems

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.