My Very Own Airflow Cluster

written by Zach Morrissey on 2018-06-25

As someone who writes a lot of one-off batch scripts, the rise of DAG (Directed Acyclic Graph) task engines has made my life a lot better. I've used Luigi at work for about 2 years now. For feeding the backend of this website with data, I decided to set up an Airflow cluster.

Airflow Screenshot

As opposed to this paradigm:

  • Write a script.
  • Set it up in cron.
  • Check the logs whenever something goes wrong.

You get a whole lot more bang for your buck, with similar amounts of work. DAG paradigm is more like:

  • Write a script.
  • Write an airflow DAG, where you call that script.
  • Set a schedule for the script.
  • Check on it from the built-in Web UI.
  • Stop and restart tasks whenever you want to.
  • View the console output from the Web UI.

It really helps certain types of batch processes scale past a certain point, while simplifying the process of managing and deploying them. You can relatively easily set up Airflow to schedule thousands of tasks without all that much more configuration past what you'd normally write.

The Good and Bad Reasons to Use Airflow

There's a number of gotchas you should know prior to using one of the DAG engines, as they all fall into the same traps.

  • cron is incredibly difficult to beat for reliability. There's a reason it's ubiquitous. DAG engines are usually for tasks that need to scale past what cron is built for. If you're just trying to run a single script on the regular, stick with that instead.
  • DAG engines are meant for batch programs. If you need anything real-time, look elsewhere.
  • Usually there's a non-trivial amount of setup involved at the beginning.
  • Sometimes scaling can be a pain. For Airflow, you'll need to setup a task queue like Celery and a number of different nodes (web server, database, etc) once you scale past a certain point.
  • There's a relatively mature Docker image that you can reuse if you don't want to configure it yourself.

Diving In

There are plenty of ways that you can configure Airflow. There are varying levels of complexity for varying levels of requirements, specified by the Executor types (i.e. what does what in your scenario).

  • SequentialExecutor - Uses SQLite as a backend, and executes tasks for testing/debugging. This is generally what you'd run on your own machine if you're just testing a workflow.
  • LocalExecutor - Threaded local executor that uses just a DB connection as a backend.
  • CeleryExecutor - For distributing tasks out to worker nodes. This uses Celery as a queue. Requires you to set up other nodes as worker nodes.

Here's diagram of how it works.

Airflow architecture

For my purposes, I went with LocalExecutor, and set up the configuration in the outlined area above. I set up two nodes for this in my Proxmox cluster.

  1. Airflow web server. This was just based on a Ubuntu 16.04 instance.
  2. Postgres backend. For the LocalExecutor option, this is the only additional node required.

The LocalExecutor option (configured for using Postgres as a backend in this instance) has probably the highest payoff-to-effort ratio, compared to how difficult it can be to set up the CeleryExecutor. Once you've created the database, verified the connection works, and run airflow initdb, you're good to go.

Now you can start making DAGs.

DAGS, Operators, Upstream, Huh?

There's a lot of lingo to be learned when using Airflow.

  • DAG - This is a single workflow, wherein you can arrange tasks and dependencies.
  • Operator - This is a single unit of work that can be instantiated multiple times to achieve a particular output/goal/etc. There are things like BashOperator for executing shell scripts, or PythonOperator for python files, etc.
  • DAG Run - A single execution of a DAG of tasks. Each operator in the DAG is instantiated and executed as it's dependencies are completed.

Example DAG

Fittingly, the DAG is a graph, both directed and acyclic. Task execution flows in only one direction.

An Airflow DAG Instance

  • The arrows indicate execution order.
  • An arrow from task 1 to task 2 implies task 1 is a dependency for task 2. If task 1 fails, task 2 will not execute.
  • Nodes can have multiple dependencies, like node 4 above.
  • Nodes can have multiple downstream tasks, like nodes 1 and 3.
  • Task execution timing is non-deterministic, so there's no guarantee for whether sibling tasks 2 or 3 would execute first.

In Code

Here's some example statements that would make a dag resembling the one in the graphic above.

default_args = {owner='airflow'}  # server user account where this is run
dag = DAG(
    'example', default_args=default_args)
t1 = BashOperator(task_id='t1', bash_command='echo "task 1"', dag=dag)
t2 = BashOperator(task_id='t2', bash_command='echo "task 2"', dag=dag)
t3 = BashOperator(task_id='t3', bash_command='echo "task 3"', dag=dag)
t4 = BashOperator(task_id='t4', bash_command='echo "task 4"', dag=dag)
t5 = BashOperator(task_id='t5', bash_command='echo "task 5"', dag=dag)

# now arrange the tasks in the dag
t2.set_upstream(t1)
t3.set_upstream(t1)
t4.set_upstream(t2)
t4.set_upstream(t3)
t5.set_upstream(t3)

Now that this is defined, you can instantiate a DAG Run from the webserver.

Notes On Running Airflow

There have been a lot of justifiable claims against Airflow's deployment story; it can be very frustrating. Scaling Airflow is confusing and fault-prone. For those who don't want to deal with the headache of managing the installation on their own, Google offers a hosted Airflow services called Cloud Composer, or you can look into alternatives like Luigi or Mara. While I don't fully love Airflow, I still haven't found something better.

Headaches aside, it's still worth using a DAG engine because of the great improvements in both the quality and visibility of batch processing jobs.

Features to Avoid

Since Airflow has been in the Apache Incubator for the past year or so, it's been getting a lot of features rapidly. I choose to avoid these ones:

  • X-coms. The biggest downside of Airflow compared to Luigi is that it doesn't force you to write idempotent tasks in the same way. It's easy to accidentally end up writing something that has outputs which don't reproduce. If you treat a task like a function with a single input and output, it works best.
  • Branching Operators - If you need to choose between multiple sets of logic, it's best just to dynamically set your task dependencies at runtime, as opposed to creating a branching task.
  • Bitshift Operators as Upstream/Downstream - Airflow recently introduced a way of setting upstream/downstream tasks by using bitshift operators like << or >>. This is bad. Explicit is good! set_upstream() is not particularly verbose.

In Conclusion

If you're frustrated with cron or other simple ways of deploying batch processes, give it a shot. Be aware of the shortcomings, though.

The Switch to Webpack 4

written by Zach Morrissey on 2018-06-09

Gulp to Webpack Transition

I've always been interested in using Webpack to generate this site, but the heavily config-oriented earlier versions of Webpack (2.x, 3.x) were a hassle. The explicit pipelines from Gulp were always significantly easier. Yet, after using Nuxt.js for a recent project–which works exceedingly well with Webpack–I decided that it was time to jump on the bandwagon.

Reasons for Switching

Gulp is very good at creating flexible tasks to be run on a regular basis. Webpack and Gulp, however, are two separate tools with completely different aims.

The Good

  • Outside of NPM itself, Webpack is the most popular build tool today, with about a 45% increase in users who have used it and would like to use it again.
  • Webpack has a built-in integration with Lektor, it made for a better development story once all the kinks were worked out.
  • Webpack is a module bundler by default. If you orient your assets in a way that Webpack can consume them, you'll end up bundling their outputs nicely into a single file like you expected.
  • Webpack 4 has 'zero config' options. The configuration file is a well-known pain point for Webpack, as it tends to create a lot of headaches very quickly. Being able to replace my ailing, convoluted Webpack 2/3 config files with defaults was appealing to me.
  • The move to 'convention over configuration' is approaching. Sane defaults, opinionated by common methods, and tooling built for the majority of use cases.
  • Includes splitting into 'prod' and 'dev' modes by default. It was always possible, but this is a major part of the webpack API now.

The Bad

  • Webpack documentation is confusing, sparse, and littered with examples that either make no sense or have been broken by version upgrades. Finding working examples is very difficult.
  • NPM module configuration for Webpack is a major source of frustration, because libraries are getting replaced & deprecated so quickly that what may have worked today is not guaranteed to work a few months from now.
  • NPM package installation and management is still more difficult than it seems like it should be. Since NPM packages take a more Unix-like 'single function' approach, it drastically increases the amount of dependencies per project (this website has 754 dependencies right now).
  • Transitioning from a working Webpack 2/3 config to Webpack 4 (either using config or zero-config options) is difficult.

What does it look like?

Previously - Gulp

Gulp pipelines usually have 1 or more inputs of the same filetype, and then produce explicit outputs using pipes.

// Process main scss for the site.
gulp.task('main-css', function() {
    return gulp.src('./src/index.scss')
        .pipe(rename('main.min.css'))
        .pipe(source_maps.init())
        .pipe(sass({ outputStyle: 'expanded' }))
        .pipe(clean_css())
        .pipe(source_maps.write('.'))
        .pipe(gulp.dest('../assets/css/'));
});

Gulp tasks operate on inputs and outputs, so you don't necessarily need to set things up so that they rely on one another.

Now - Webpack

Whereas Webpack configs look to an entrypoint (normally ./src/index.js by convention) and bundle all of the dependencies for that file.

const path = require('path');
const MiniCssExtractPlugin = require('mini-css-extract-plugin');

module.exports = {
  mode: "production",
  output: {
    path: path.dirname(__dirname) + '/assets/static/gen',
    filename: "[name].js"
  },
  plugins: [
    new MiniCssExtractPlugin({
      filename: "main.css",
      chunkFilename: "[name].css"
    })
  ],
  module: {
    rules: [
      {
        test: /\.scss$/,
        use: [
          MiniCssExtractPlugin.loader,
          'style-loader',
          'sass-loader'
        ]
      }
    ]
  }
};

In prior versions of Webpack it was required to make many more things explicit, compared to now where you can opt to use conventional inputs/outputs instead of specifying them.

Lessons Learned

The biggest struggle in using Webpack is bridging the gap between the concepts and the implementation.

  • The concepts are well established and internally consistent. Check out the documentation if you're curious about what any of the terms / verbs used in Webpack-speak mean.
  • The implementation is where things are difficult. It's very tough to translate the concepts espoused in how Webpack should operate to how it actually does operate.
    • Things like 'entry' or 'loaders' are simple enough to understand, but are difficult to translate into working functionality.
    • Error messages need improvement. It's very tough to take information there and use it.

That said, considering the difficulties, module bundling with Webpack is still worth it even after all of that. Simplifying your build process from separate pipelines using Gulp/Grunt/etc into a single build process using a dependency graph is a drastic improvement in paradigm.

If you're just trying to process some Sass files, I would stick to Gulp.

Discovering PICO-8

written by Zach Morrissey on 2018-05-03

The PICO-8 is a 'fantasy console', which is basically a virtual machine for programming videogames with a set amount of memory, disk space, screen size, etc. I discovered it after I recently played the excellent Celeste for Nintendo Switch and learning that it started life as a PICO-8 game. This thing looks damn fun to program for, and it comes batteries-included for most tasks so you can get up and running in no time.

PICO 8 Illustration From Lexaloffle

Highlights

  • Lua scripting.
  • 16-color palette.
  • 128 x 128 pixel screen size.
  • 32k cartridge sizes.
  • 4-channel sound.

On top of all that, the VM that runs the PICO-8 cartridge has been ported to hardware! It's being included in all $9 C.H.I.P. computers by default now, so you can actually get a hardware implementation of the PICO-8 fantasy console for use in your own home.

People make some suuuuuuuuuuper cool stuff with it.

Let's Give It a Shot

Starting out, I checked out this instructional youtube playlist by TeamworkCast. It helped me get an idea of what I wanted to create with PICO-8 (alongside being super entertaining). So now I was in a place where I had to figure out:

  • Game programming patterns. How do people make games, well,
  • Designing movement and obstacles.
  • Sounds & sound editing.

Super easy, right? Well, no. Turns out people spend their entire lives getting good at these things. I've been doing a lot of research into how to get better at it though.

Resources Used

The things that helped me learn PICO-8, by subject area:

  • PICO-8 API by Neko250. A neat little reference for helping figure out the calls you can make from PICO-8 itself.
  • /r/pico8 subreddit - An exceedingly helpful community of people who are both PICO-8 game designers and enthusiasts.
  • PICO-8 Cheat Sheet - A quick reference made by the users of /r/pico8.
  • Lua for Programmers Cheat Sheet - Admittedly, grasping with the overwhelming simplicity of Lua was a little frustrating at first, but having a slew of examples helped.
  • Awesome PICO-8 - One of the curated awesome lists that are found on github. This one in particular has a lot of good resources for starting up.

The PICO-8 Interface

The first place you'll land is the command line in PICO-8, which is used to launch all of the other screens. It looks exactly like you'd expect it to.

PICO 8 Command Line

There's also a built-in IDE for developing games! This is where you can write your lua code, edit sprites, create background tiles and more.

PICO 8 Integrated Development Environment

Multiple other pages in the editor exist for purposes of creating assets for these two pages: A sprite editor, a tile editor, a SFX editor, and a music arrangement screen. Combining all of these, you can make pretty convincing graphics & sound pretty quickly.

Experimenting

Since I don't have the depth of experience to be able to make a legitimate video game yet, I'm focusing on doing exceedingly simple stuff for the time being.

There are basically only three functions to know:

  • _update() - Update the game state. Change objects position, behavior, etc. to match the game state.
  • _draw() - Give directions for what to draw on screen.
  • _init() - Initialize game state (only if you need to).

Using these, I ended up trying to do what most people try to do their first time using any 2D graphics program: make a ball bounce across the screen.

PICO-8 code to bounce a ball.

PICO-8 bouncing ball

It works! But hey - this isn't interactive, so we can't quite call it a game yet. Here's how we'll do that.

  • Add some listening for button presses using btnp(). This determines whether or not a button was pressed that was not pressed during the prior frame.
  • Change the velocity based on that button. This time we're adding a random positive velocity to the ball.
  • Display some message or feedback so that I can tell I implemented it correctly.

PICO-8 bouncing ball

Nice. But now, since this is a videogame we want to update a few more things:

  • Run in 60FPS by using the _update60() function instead of the update function. This makes our game run a lot smoother.
  • Add more natural slowdowns to the bounces. As the ball bounces off the floor, it should lose some of its velocity.
  • Add some more x-axis movement. If the player hits the button, have it send the ball off in some random velocity along the x-axis.

PICO-8 bouncing ball

What's Next? (To be continued in Part 2)

Now that I've got the bones in place, I'm planning on creating a barebones clone of Flappy Bird for Part II of this blog post.

What's the way to get there?

  • Change the sprites to be objects. The ball itself should have an internal state that captures which sprite should be displayed at any point, and the collisions between it and everything else.
  • Functional decomposition. It's easy to write a lot of spaghetti code in pulling together simple demos; the way to scale this is to change the major parts to objects.
  • Create wall-generation logic. Flappy Bird works on a randomly-generated basis, so I'll need to create and manage a list of obstacles for the player to surpass.

Check it out on the next version of this post!

Hosting a Static Site on S3 & Cloudfront

written by Zach Morrissey on 2018-04-25

Considering this site is built with a static site generator, I needed to find somewhere to host it. Amazon's S3 service, as it turns out, actually has an option to host a website out of it, which is ridiculously convenient for those who are comfortable using AWS services. Prior to this, I had purchased my domain through Namecheap.

Requirements

S3 + Cloudfront met all of my pretty simple set of requirements that I wanted to enforce:

  • HTTPS - Site has to be secured with a valid certificate.
  • Custom Domain - For professionalism and consistency.
  • Easy Deployment - Should be able to push files easily, or have a CI/CD tool listen to my git repo to push things up.

Hosting in S3 seemed to be a very straightforward way of achieving these things, so I decided to go with it.

Setting Up S3 for Static Web Hosting

Setting up S3 for hosting a static site is as simple as checking a box. First create a bucket for your site.

Static Bucket Settings

On that menu, select your default landing document (something along the lines of index.html) and error document (i.e. 404.html or something like that).

Website Options

Setting Bucket Permissions

Once you've created your static site bucket and configured it for hosting, you need to set permissions on the bucket to allow GET requests.

Under the permissions tab, edit your permissions to allow s3:GetObject permissions to public.

S3 Bucket Policy Editor

This will make this bucket public. Do not keep sensitive data of any sort in it once you've made this change.

Don't Need HTTPS? You're Done!

At this point, if you're not interested in setting up your site to work with HTTPS, then you can call it quits at this point. S3 will provide you with a public URL that you can reach to see the site now:

http://<your_bucket_name_here>.s3-website-<region>.amazonaws.com/

You can point your domain at this URL and boom, you're done.

Creating a Cloudfront Distribution

If you do want HTTPS, and you should want HTTPS, on your site, you'll need to create a Cloudfront distribution, Amazon's CDN, in order to provide secure connections.

Create a new web distribution, with the following options selected:

  • Origin Domain Name - Amazon will automatically populate your S3 bucket in the dropdown, but you should use the full S3 website URL: <your_bucket_name_here>.s3-website-<region>.amazonaws.com
  • Viewer Protocol Policy - Choose what you want to support, but I chose 'Redirect HTTP to HTTPS' here.
  • Alternate Domain Names (CNAMEs) - This is how you'll link your domain to the Cloudfront distribution, so you should put your domain in here.
  • Default Root Object - This should be the same as the landing document you specified in your S3 bucket website option (i.e.index.html).
  • Price Class - One noted option if you want to save on hosting costs is to select the US/Canada/Europe only price class.
  • Custom SSL Certificate - By default, the distribution uses the Cloudfront certificate, but you should request your own if you're hosting on a custom domain. Which brings us to our next section...

Request an SSL Certificate

In order to secure your site with HTTPS, you can request a certificate through AWS Certificate Manager for your domain. You'll have to be able to prove you own the domain either through DNS validation or by email validation. I used email validation.

There's a number of email addresses that AWS will send out a confirmation email to, listed here. Your domain provider should have options for forwarding email if you don't have it set up through them.

Once you're done with the certificate business, you can submit that Cloudfront distribution, which will take some time to set up.

Registering Your Cloudfront Distribution with Your Domain Provider

Once your cloudfront distribution is set up, you should now have a URL you can access with your site like this:

a983qewr90asf.cloudfront.net

Once you've got that URL, all you need to do is set up CNAME records in your domain provider, which will allow your domain to redirect to the Cloudfront distribution. For my site, I wanted to set it up using the bare domain, i.e. zmsy.co instead of www.zmsy.co.

I set up a CNAME record and a URL redirect in this way:

Type Host Value
CNAME @ a983qewr90asf.cloudfront.net
URL Redirect @ https://a983qewr90asf.cloudfront.net

That's a Wrap!

You'll be up and running once your domain is redirecting correctly. Now, once you push objects up to S3, they should populate out to the Cloudfront distribution automatically (albeit sometimes at a random schedule, since it's a CDN that operates on its own time). Plenty of toolkits have S3 as a built-in endpoint, so setting up deployment either through a CLI or from a CI/CD server is relatively straightforward.

At the time I wrote this, Github Pages wasn't offering HTTPS but have just announced support, and was primarily built using Jekyll. I'll be checking that one out soon too.

Automated Fantasy Baseball Research Using Python, Pandas, and Seaborn

written by Zach Morrissey on 2018-03-15

Winning Fantasy Baseball in Style

The advantages of doing my pre-season due diligence by scouting out the best fantasy baseball players often evaporates about 3 rounds into the draft. Since I can't win in the draft, why not automate my in-season scouting efforts?

Getting Data

For a good system, there are three inputs:

  1. Roster data to know who is available.
  2. Projections to see who is slated to do well.
  3. Mid-season player statistics to so who's currently doing well.

1 -- Roster Data

Our team uses ESPN for our league, which happens to allow access to webpages so long as you take the time to make the league public. Getting roster data is fairly straightforward; all you need to do is scrape the rosters page for that league ID. This is an example of that:

import requests
import csv
from bs4 import BeautifulSoup

# rip your league id from your espn url
league_id = 1234567890
LEAGUE_URL = "http://games.espn.com/flb/leaguerosters?leagueId={}"
html = requests.get(LEAGUE_URL.format(league_id)).text

# parse the page to return the rosters table
soup = BeautifulSoup(html)
rosters = soup.find_all("table", {'class': 'playerTableTable'})

# loop through the rosters table and get all of the associated values
players = []
for roster in rosters:
    team_name = roster.find("a").text
    players_html = roster.find_all("td", {'class': 'playertablePlayerName'})
    for player in players_html:

        # parse player info, remove asterisks for injured players.
        player_name = player.text.split(",")[0]
        player_name = player_name.replace("*", "")

        # add to output list
        players.append([player_name, team_name])

# write out to a csv
with open("rosters.csv", "w", newline='') as out_file:
    writer = csv.writer(out_file)
    writer.writerow(("player", "team"))
    writer.writerows(players)

2 -- Projections

Player projections are common source of scouting information for our league. Since we do not use the standard scoring rules, the projections from ESPN are mostly worthless to us (not that they're particularly valuable otherwise). For my own purposes, I like the Depth Charts projection system that they use, which is a combination of two other common projection systems (ZiPS and Steamer) weighted by an in-house projection for playing time.

Getting Data with cURL

There were two issues I was trying to solve for.

  1. Not enough players - URL parameters & form fields didn't seem to contain an ability to get more than 50 players at once.
  2. Confusing ASP.net forms - These use built-in Microsoft libraries that send back somewhere between 50-60kb of base64-encoded page state data, which is a lot of added noise for my purposes.

Using the network tab, I found that the link for 'page size' allowed me to request higher amounts of players than the values in the form dropdown.

Fangraphs Page Size Dropdown

  1. Navigate to the chrome developer tools network tab.
  2. Filter to method:POST so you only get the form post and none of the other resource requests.
  3. Right click on the post and select copy > copy as cURL (bash).

Chrome Network Tab for Fangraphs Page

I ripped that data out into two files, get_fangraphs.sh (send the request), and fangraphs_form_data.txt. Since the data is over a certain volume with curl, you need to store it in its own file and load it using the -d flag.

# get batters projections html
curl 'https://www.fangraphs.com/projections.aspx?pos=all&stats=bat&type=rfangraphsdc' \
    -H 'Accept-Encoding: gzip, deflate, br' \
    -H 'Accept-Language: en-US,en;q=0.9' \
    -H 'Upgrade-Insecure-Requests: 1' \
    -H 'Content-Type: application/x-www-form-urlencoded' \
    -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8' \
    -d @batters_form_data.txt --compressed \
    > batters_projections.html

# get pitching projections
curl 'https://www.fangraphs.com/projections.aspx?pos=all&stats=pit&type=rfangraphsdc' \
    -H 'Accept-Encoding: gzip, deflate, br' \
    -H 'Accept-Language: en-US,en;q=0.9' \
    -H 'Upgrade-Insecure-Requests: 1' \
    -H 'Content-Type: application/x-www-form-urlencoded' \
    -H 'Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8' \
    -d @pitchers_form_data.txt --compressed \
    > pitchers_projections.html

In the (batters|pitchers)_form_data.txt file, there was one adjustment to make to get a higher number than 50 entries back: __EVENTARGUMENT=FireCommand:ProjectionBoard1$dg1$ctl00;PageSize;50 - change this 50 to something like 2500 or so.

Voila! You've now got a page with the HTML table containing all of the projections information there is to want. You can then plug use pandas' read_html() function to parse this output for you.

3 -- Actuals

In addition to projections, having in-season results helps improve our judgment and add context to the projections. There are players whom are specifically favored by the projection systems (namely, players with long histories), and players who are treated rather unkindly (small sample sizes).

Getting the ongoing statistics on a per-player basis turned out to be much easier than anything related to the projections. With a little manipulation of URL parameters, you can easily create a page for scraping that contains all of the necessary information in one go.

Creating a short script to parse out player rows and add them to another csv for consumption later on turned out to be short work once I figured this bit out. Here's the script I used (uses requests and beautifulsoup, similar to above):

import requests
import csv
import datetime
from bs4 import BeautifulSoup

# static urls
season = datetime.datetime.now().year
PITCHERS_URL = "https://www.fangraphs.com/leaders.aspx?pos=all&stats=pit&lg=all&qual=0&type=c,36,37,38,40,-1,120,121,217,-1,24,41,42,43,44,-1,117,118,119,-1,6,45,124,-1,62,122,13&season={season}&month=0&season1={season}&ind=0&team=0&rost=0&age=0&filter=&players=0&page=1_100000".format(season=season)
BATTERS_URL = "https://www.fangraphs.com/leaders.aspx?pos=all&stats=bat&lg=all&qual=0&type=8&season={season}&month=0&season1={season}&ind=0&team=0&rost=0&age=0&filter=&players=0&page=1_10000".format(season=season)

# request the data
pitchers_html = requests.get(PITCHERS_URL).text
batters_html = requests.get(BATTERS_URL).text

def parse_array_from_fangraphs_html(input_html, out_file_name):
    """
    Take a HTML stats page from fangraphs and parse it out to a CSV file.
    """
    # parse input
    soup = BeautifulSoup(input_html, "lxml")
    table = soup.find("table", {"class": "rgMasterTable"})

    # get headers
    headers_html = table.find("thead").find_all("th")
    headers = []
    for header in headers_html:
        headers.append(header.text)
    print(headers)

    # get rows
    rows = []
    rows_html = table.find("tbody").find_all("tr")
    for row in rows_html:
        row_data = []
        for cell in row.find_all("td"):
            row_data.append(cell.text)
        rows.append(row_data)

    # write to CSV file
    with open(out_file_name, "w") as out_file:
        writer = csv.writer(out_file)
        writer.writerow(headers)
        writer.writerows(rows)

parse_array_from_fangraphs_html(batters_html, 'batters_actuals.csv')
parse_array_from_fangraphs_html(pitchers_html, 'pitchers_actuals.csv')

Normally, I would use pandas read_html function for parsing HTML tables like this (which uses lxml internally, same as BeautifulSoup), but the encoding of this table caused enough havoc that I decided to write the parser myself. If you're in a similar scenario, try that method first!

Reading the Data Into Pandas

Pandas is very capable of producing these sorts of analyses using tabular data in a straightforward, repeatable manner. Our league uses a customized stat list to base our categories on, so we have to derive many of these after getting actuals/projections.

Batting Stats

  • HR: Home Runs
  • K: Strikeouts
  • RC: Runs Created (AB * OBP * SLG)
  • OBP: On Base Percentage
  • SLG: Slugging Percentage

Pitching Stats

  • HR: Home Runs Against
  • K: Strikeouts
  • ERA: Earned Run Average
  • QS: Quality Starts
  • OBA: On-Base Percentage Against

There are plenty of different scoring methods used in fantasy. If you want to try this on yours, you can edit the code to use your own categories instead.

Pandas

Bringing this all into pandas is relatively easy, because of its insanely useful read_csv() function. Not only does this do your opening/reading for you, it almost always gets the data types correct as well.

# read in all outputs
df_rost = pd.read_csv('rosters.csv')
dfb_act = pd.read_csv('batters_actuals.csv')
dfp_act = pd.read_csv('pitchers_actuals.csv')

# create a function to parse out percentage strings to floats
def parse_pctg(value):
    return float(value.split()[0]) / 100

# apply that to all percentage values in the dataframes
dfb_act['BB%'] = dfb_act['BB%'].apply(lambda x: parse_pctg(x))
dfb_act['K%'] = dfb_act['K%'].apply(lambda x: parse_pctg(x))
dfp_act['K%'] = dfp_act['K%'].apply(lambda x: parse_pctg(x))
dfp_act['BB%'] = dfp_act['BB%'].apply(lambda x: parse_pctg(x))
dfp_act['K-BB%'] = dfp_act['K-BB%'].apply(lambda x: parse_pctg(x))
dfp_act['LOB%'] = dfp_act['LOB%'].apply(lambda x: parse_pctg(x))

with open('batters_projections.html', 'r') as bhtml:
    btxt = bhtml.read()
    dfb_proj = pd.read_html(btxt)[-1]  # read_html returns ALL tables, we just want the last one.
    dfb_proj.dropna(axis=1, inplace=True)

with open('pitchers_projections.html', 'r') as phtml:
    ptxt = phtml.read()
    dfp_proj = pd.read_html(btxt)[-1]
    dfp_proj.dropna(axis=1, inplace=True)

# list of data sets: df_rost, dfb_act, dfb_proj, dfp_act, dfp_proj

This means we've got 5 inputs - one for rosters, two for projections, and two for actuals. Time to get crankin' on the analysis part!

Joining and Visualizing in Pandas/Seaborn

First we need to join the datasets together into two dataframes that will make things easier to use: batters and pitchers. Roster information goes into both.

# merge batters into one 'dfb' Dataframe
dfb = pd.merge(dfb_proj, df_rost, how='left', on='Name', suffixes=('.p', '.r'))
dfb = pd.merge(dfb, dfb_act, how='left', on='Name', suffixes=('', '.a'))

# merge pitchers into one 'dfp' dataframe
dfp = pd.merge(dfp_proj, df_rost, how='left', on='Name', suffixes=('.p', '.r'))
dfp = pd.merge(dfp, dfp_act, how='left', on='Name', suffixes=('', '.a'))

Also - If you use Fangraphs data a lot, it's worth considering becoming a member and supporting them a bit!