Mozilla, Data Visualization, Mission Control

Better or worse: by what measure?

Oct 26th, 2017

Ok, after a series of posts extolling the virtues of my current project, it’s time to take a more critical look at some of its current limitations, and what we might do about them. In my introductory post, I talked about how Mission Control can let us know how “crashy” a new release is, within a short interval of it being released. I also alluded to the fact that things appear considerably worse when something first goes out, though I didn’t go into a lot of detail about how and why that happens.

It just so happens that a new point release (56.0.2) just went out, so it’s a perfect opportunity to revisit this issue. Let’s take a look at what the graphs are saying (each of the images is also a link to the dashboard where they were generated):

ZOMG! It looks like 56.0.2 is off the charts relative to the two previous releases (56.0 and 56.0.1). Is it time to sound the alarm? Mission control abort? Well, let’s see what happens the last time we rolled something new out, say 56.0.1:

We see the exact same pattern. Hmm. How about 56.0?

Yep, same pattern here too (actually slightly worse).

What could be going on? Let’s start by reviewing what these time series graphs are based on. Each point on the graph represents the number of crashes reported by telemetry “main” pings corresponding to that channel/version/platform within a five minute interval, divided by the number of usage hours (how long users have had Firefox open) also reported in that interval. A main ping is submitted under a few circumstances:

  • The user shuts down Firefox
  • It’s been about 24 hours since the last time we sent a main ping.
  • The user starts Firefox after Firefox failed to start properly
  • The user changes something about Firefox’s environment (adds an addon, flips a user preference)

A high crash rate either means a larger number of crashes over the same number of usage hours, or a lower number of usage hours over the same number of crashes. There are several likely explanations for why we might see this type of crashy behaviour immediately after a new release:

  • A Firefox update is applied after the user restarts their browser for any reason, including their browser crash. Thus a user whose browser crashes a lot (for any reason), is more prone to update to the latest version sooner than a user that doesn’t crash as much.
  • Inherently, any crash data submitted to telemetry after a new version is released will have a low number of usage hours attached, because the client would not have had a chance to use it much (because it’s so new).

Assuming that we’re reasonably satisfied with the above explanation, there’s a few things we could try to do to correct for this situation when implementing an “alerting” system for mission control (the next item on my todo list for this project):

  • Set “error” thresholds for each crash measure sufficiently high that we don’t consider these high initial values an error (i.e. only alert if there is are 500 crashes per 1k hours).
  • Only trigger an error threshold when some kind of minimum quantity of usage hours has been observed (this has the disadvantage of potentially obscuring a serious problem until a large percentage of the user population is affected by it).
  • Come up with some expected range of what we expect a value to be for when a new version of firefox is first released and ratchet that down as time goes on (according to some kind of model of our previous expectations).

The initial specification for this project called for just using raw thresholds for these measures (discounting usage hours), but I’m becoming increasingly convinced that won’t cut it. I’m not a quality control expert, but 500 crashes for 1k hours of use sounds completely unacceptable if we’re measuring things at all accurately (which I believe we are given a sufficient period of time). At the same time, generating 20–30 “alerts” every time a new release went out wouldn’t particularly helpful either. Once again, we’re going to have to do this the hard way…

If this sounds interesting and you have some react/d3/data visualization skills (or would like to gain some), learn about contributing to mission control.

Shout out to chutten for reviewing this post and providing feedback and additions.

Mozilla, Data Visualization, Mission Control

Mission Control: Ready for contributions

Oct 20th, 2017

One of the great design decisions that was made for Treeherder was a strict seperation of the client and server portions of the codebase. While its backend was moderately complicated to get up and running (especially into a state that looked at all like what we were running in production), you could get its web frontend running (pointed against the production data) just by starting up a simple node.js server. This dramatically lowered the barrier to entry, for Mozilla employees and casual contributors alike.

I knew right from the beginning that I wanted to take the same approach with Mission Control. While the full source of the project is available, unfortunately it isn’t presently possible to bring up the full stack with real data, as that requires privileged access to the athena/parquet error aggregates table. But since the UI is self-contained, it’s quite easy to bring up a development environment that allows you to freely browse the cached data which is stored server-side (essentially: git clone https://github.com/mozilla/missioncontrol.git && yarn install && yarn start).

In my experience, the most interesting problems when it comes to projects like these center around the question of how to present extremely complex data in a way that is intuitive but not misleading. Probably 90% of that work happens in the frontend. In the past, I’ve had pretty good luck finding contributors for my projects (especially Perfherder) by doing call-outs on this blog. So let it be known: If Mission Control sounds like an interesting project and you know React/Redux/D3/MetricsGraphics (or want to learn), let’s work together!

I’ve created some good first bugs to tackle in the github issue tracker. From there, I have a galaxy of other work in mind to improve and enhance the usefulness of this project. Please get in touch with me (wlach) on irc.mozilla.org #missioncontrol if you want to discuss further.

Mozilla, Data Visualization, Mission Control

Mission Control

Oct 6th, 2017

Time for an overdue post on the mission control project that I’ve been working on for the past few quarters, since I transitioned to the data platform team.

One of the gaps in our data story when it comes to Firefox is being able to see how a new release is doing in the immediate hours after release. Tools like crashstats and the telemetry evolution dashboard are great, but it can take many hours (if not days) before you can reliably see that there is an issue in a metric that we care about (number of crashes, say). This is just too long — such delays unnecessarily retard rolling out a release when it is safe (because there is a paranoia that there might be some lingering problem which we we’re waiting to see reported). And if, somehow, despite our abundant caution a problem did slip through it would take us some time to recognize it and roll out a fix.

Enter mission control. By hooking up a high-performance spark streaming job directly to our ingestion pipeline, we can now be able to detect within moments whether firefox is performing unacceptably within the field according to a particular measure.

To make the volume of data manageable, we create a grouped data set with the raw count of the various measures (e.g. main crashes, content crashes, slow script dialog counts) along with each unique combination of dimensions (e.g. platform, channel, release).

Of course, all this data is not so useful without a tool to visualize it, which is what I’ve been spending the majority of my time on. The idea is to be able to go from a top level description of what’s going on a particular channel (release for example) all the way down to a detailed view of how a measure has been performing over a time interval:

This particular screenshot shows the volume of content crashes (sampled every 5 minutes) over the last 48 hours on windows release. You’ll note that the later version (56.0) seems to be much crashier than earlier versions (55.0.3) which would seem to be a problem except that the populations are not directly comparable (since the profile of a user still on an older version of Firefox is rather different from that of one who has already upgraded). This is one of the still unsolved problems of this project: finding a reliable, automatable baseline of what an “acceptable result” for any particular measure might be.

Even still, the tool can still be useful for exploring a bunch of data quickly and it has been progressing rapidly over the last few weeks. And like almost everything Mozilla does, both the source and dashboard are open to the public. I’m planning on flagging some easier bugs for newer contributors to work on in the next couple weeks, but in the meantime if you’re interested in this project and want to get involved, feel free to look us up on irc.mozilla.org #missioncontrol (I’m there as ‘wlach’).

Mozilla

Functional is the future

Aug 28th, 2017

Just spent well over an hour tracking down a silly bug in my code. For the mission control project, I wrote this very simple API method that returns a cached data structure to our front end:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
def measure(request):
    channel_name = request.GET.get('channel')
    platform_name = request.GET.get('platform')
    measure_name = request.GET.get('measure')
    interval = request.GET.get('interval')
    if not all([channel_name, platform_name, measure_name]):
        return HttpResponseBadRequest("All of channel, platform, measure required")
    data = cache.get(get_measure_cache_key(platform_name, channel_name, measure_name))
    if not data:
        return HttpResponseNotFound("Data not available for this measure combination")
    if interval:
        try:
            min_time = datetime.datetime.now() - datetime.timedelta(seconds=int(interval))
        except ValueError:
            return HttpResponseBadRequest("Interval must be specified in seconds (as an integer)")

        # Return any build data in the interval
        empty_buildids = set()
        for (build_id, build_data) in data.items():
            build_data['data'] = [d for d in build_data['data'] if d[0] > min_time]
            if not build_data['data']:
                empty_buildids.add(build_id)

        # don't bother returning empty indexed data
        for empty_buildid in empty_buildids:
            del data[empty_buildid]

    return JsonResponse(data={'measure_data': data})

As you can see, it takes 3 required parameters (channel, platform, and measure) and one optional one (interval), picks out the required data structure, filters it a bit, and returns it. This is almost what we wanted for the frontend, unfortunately the time zone information isn’t quite what we want, since the strings that are returned don’t tell the frontend that they’re in UTC format — they need a ‘Z’ appended to them for that.

After a bit of digging, I found out that Django’s json serializer will only add the Z if the tzinfo structure is specified. So I figured out a simple pattern for adding that (using the dateutil library, which we are fortunately already using):

1
2
from dateutil.tz import tzutc
datetime.datetime.fromtimestamp(mydatestamp.timestamp(), tz=tzutc())

I tested this quickly on the python console and it seemed to work great. But when I added the code to my function, the unit tests mysteriously failed. Can you see why?

1
2
3
4
5
6
7
8
for (build_id, build_data) in data.items():
    # add utc timezone info to each date, so django will serialize a
    # 'Z' to the end of the string (and so javascript's date constructor
    # will know it's utc)
    build_data['data'] = [
        [datetime.datetime.fromtimestamp(d[0].timestamp(), tz=tzutc())] + d[1:] for
        d in build_data['data'] if d[0] > min_time
    ]

Trick question: there’s actually nothing wrong with this code. But if you look at the block in context (see the top of the post), you see that it’s only executed if interval is specified, which it isn’t necessarily. The first case that my unit tests executed didn’t specify interval, so fail they did. It wasn’t immediately obvious to me why this was happening, so I went on a wild-goose chase of trying to figure out how the Django context might have been responsible for the unexpected output, before realizing my basic logic error.

This was fairly easily corrected (my updated code applies the datetime-mapping unconditionally to set of optionally-filtered results) but perfectly illustrates my issue with idiomatic python: while the language itself has constructs like map and reduce that support the functional programming model, the language strongly steers you towards writing things in an imperative style that makes costly and annoying mistakes like this much easier to make. Yes, list and dictionary comprehensions are nice and compact but they start to break down in the more complex cases.

As an experiment, I wrote up what this function might look like in a pure functional style with immutable data structures:

1
2
3
4
5
6
7
8
def transform_and_filter_data(build_data):
    new_build_data = copy.copy(build_data)
    new_build_data['data'] = [
        [datetime.datetime.fromtimestamp(d[0].timestamp(), tz=tzutc())] + d[1:] for
        d in build_data['data'] if d[0] > min_time
    ]
    return new_build_data
transformed_build_data = {k: v for k, v in {k: transform_and_filter_data(v) for k, v in data}.items() if len(v['data']) > 0}

A work of art it isn’t — and definitely not “pythonic”. Compare this to a similar piece of code written in Javascript (ES6) with lodash (using a hypothetical tzified function):

1
2
3
4
5
6
7
let transformedBuildData = _.filter(_.mapValues(data, (buildData) => ({
    ...buildData,
    data: buildData.data
      .filter(datum => datum[0] > minTimestamp)
      .map(datum => [tzcified(datum[0])].concat(datum.slice(1)))
  })),
  (data, buildId) => data.data.length > 0);

A little bit easier to understand, but more importantly (to me anyway) it comes across as idiomatic and natural in a way that the python version just doesn’t. I’ve been happily programming Python for the last 10 years, but it’s increasingly feeling time to move on to greener pastures.

Mozilla, mozregression

mozregression’s new mascot

Jul 3131, 2017

Spent a few hours this morning on a few housekeeping issues with mozregression. The web site was badly in need of an update (it was full of references to obsolete stuff like B2G and codefirefox.com) and the usual pile of fixes motivated a new release of the actual software. But most importantly, mozregression now has a proper application icon / logo, thanks to Victoria Wang!

One of the nice parts about working at Mozilla is the flexibility it offers to just hack on stuff that’s important, whether or not it’s part of your formal job description. Maintaining mozregression is pretty far outside my current set of responsibilities (or even interests), but I keep it going because it’s a key tool used by developers team here and no one else seems willing to take it over. Fortunately, tools like appveyor and pypi keep the time suckage to a mostly-reasonable level.

Mozilla

Taking over an npm package: sanity prevails

Jul 13th, 2017

Sometimes problems are easier to solve than expected.

For the last few months I’ve been working on the front end of a new project called Mission Control, which aims to chart lots of interesting live information in something approximating realtime. Since this is a greenfield project, I thought it would make sense to use the currently-invogue framework at Mozilla (react) along with our standard visualization library, metricsgraphics.

Metricsgraphics is great, but its jquery-esque api is somewhat at odds with the react way. The obvious solution to this problem is to wrap its functionality in a react component, and a quick google search determined that several people have tried to do exactly that, the most popular one being one called (obviously) react-metrics-graphics. Unfortunately, it hadn’t been updated in quite some time and some pull requests (including ones implementing features I needed for my project) weren’t being responded to.

I expected this to be pretty difficult to resolve: I had no interaction with the author (Carter Feldman) before but based on my past experiences in free software, I was expecting stonewalling, leaving me no choice but to fork the package and give it a new name, a rather unsatisfying end result.

But, hey, let’s keep an open mind on this. What does google say about unmaintained npm packages? Oh what’s this? They actually have a policy?

tl;dr: You email the maintainer (politely) and CC support@npmjs.org about your interest in helping maintain the software. If you’re unable to come up with a resolution on your own, they will intervene.

So I tried that. It turns out that Carter was really happy to hear that Mozilla was interested in taking over maintenance of this project, and not only gave me permission to start publishing newer versions to npm, but even transferred his repository over to Mozilla (so we could preserve issue and PR history). The project’s new location is here:

https://github.com/mozilla/react-metrics-graphics

In hindsight, this is obviously the most reasonable outcome and I’m not sure why I was expecting anything else. Is the node community just friendlier than other areas I’ve worked in? Have community standards improved generally? In any case, thank you Carter for a great piece of software, hopefully it will thrive in its new home. :P

Counting

The vastness

Jul 8th, 2017

Had a good all hands with the rest of Mozilla in San Francisco (at least those able and willing to attend due to the current political situation in the U.S.). I stayed a few extra days to hang out with some of my friends who had moved to S.F. On Sunday we went to Muir Woods, where I took this picture:

It occurred to me at the time that I took that photo that pretty much every sensory receptor in my optic nerve was registering the signal of some kind of life. Thousands of beings (trees, clover, moss, lichens) in turn made up of trillions upon trillions of tiny beings (cells, bacteria) all conscious and interacting with each other in ways that I can barely begin to understand.

Mozilla, Docker

Using Docker to run automated tests

Jun 2nd, 2017

A couple months ago, I joined the Mozilla Data Platform team, to work on our Telemetry and automated data collection services. This has been an interesting transition for me, and a natural jumping off point from my work on Perfherder. Now, instead of manipulating mere 10s of gigabytes worth of fairly regular data, I’m working with 100s of terrabytes of noisy data with a much larger number of dimensions. :P It’s been interesting so far.

One of the first things I decided to work on was improving our unit testing story around a few of our primary packages for data analysis/etl: python_moztelemetry (a library we use for running custom spark jobs against Telemetry data) and telemetry-batch-view (a set of scala jobs we run against the main telemetry data store to create a useful set of aggregations that are easily queried with tools like redash).

It turns out that these tools interact with several larger / more involved pieces than I’m used to dealing with (such as hbase and thrift). For continuous integration/automation, we already had a set of travis scripts to install and reproduce the environment needed to test these parts, but there was no straightforward way to do this locally. My third time through creating an Ubuntu virtual machine environment to reproduce this environment locally (long story), I figured it was finally time for me to investigate using something to automate that setup procedure and make it easier for new developers to get into these projects.

I hadn’t used it much before, but Docker seemed like a fairly obvious choice. Small, simple, and Linuxy? Sign me up.

I’m pretty happy with how things turned out, but there were a few caveats. Docker is more of a general purpose tool for building environments for running things, whether that be an apache webserver or a jabber messaging doohickey (whereas e.g. something like travis is basically a domain-specific language for creating and running automated tests). There were a few tricks I needed to employ to make the whole testing process smooth in both cases, which I’ll document here for posterity:

  1. You can ADD a set of files / directories to a docker environment inside your Dockerfile, but if you want your set of tests to pick up any changes made since the environment was created, you really should mount your testing directory inside the container using the -v option.
  2. If you need to download/install a piece of software when building the docker container, use the RUN directive instead of ADD. This will speed up rebuilding the container while you’re iterating on it (because you can take advantage of the Docker layers cache).
  3. You almost certainly want to create a script (example) to streamline all the steps of running the tests: this will make running the tests easier for anyone wanting to contribute to your project and reduce the amount of documentation that you will have to write.

The relevant files and documentation are in the repositories linked above.

Mozilla, Treeherder, Taskcluster

Easier reproduction of intermittent test failures in automation

Apr 5th, 2017

As part of the Stockwell project, I’ve been hacking on ways to make it easier for developers to diagnose failure of our tests in automation. It’s often very difficult to reproduce an intermittent failure we see in Treeherder locally since the environment is so different, but historically it has been a big hassle to get access to the machines we use in automation for various reasons.

One option that rolled out last year was the so-called one-click loaner, which enabled developers to sign out an virtual machine instance identical to the ones used to run unit tests (at least if the tests are running on Taskcluster, which is increasingly often the case), then execute their particular case with whatever extra debugging options they would find useful. This is a big step forward, but it’s still quite a bit of hassle, since it requires a bunch of manual work on the part of the developer to interact with the instance.

What if we could just re-run the particular test an arbitrary number of times with whatever options we wanted, simply by clicking on a few buttons on Treeherder? I’ve been exploring this for the first few months of 2017 and I’ve come up with a prototype which I think is ready for people to start playing with.

The user interface to this is pretty straightforward. Just find a job you want to retrigger in Treeherder:

Then select the ’…’ option in the panel below and press “Custom Action…”:

You should get a small piece of JSON to edit, which corresponds to the configuration for the retriggered job:

The main field to edit is “path”. You should set this to the name of the test you want to try retriggering. For example dom/animation/test/css-transitions/test_animation-ready.html. You can also set custom Firefox preferences and environment variables, to turn on different types of debugging.

Unfortunately as usual with a new feature at Mozilla, there are a bunch of limitations and caveats:

  • This depends on functionality that’s only in Taskcluster, so buildbot jobs are exempt.
  • No support for Android yet. In combination with the above limitation, this implies that this functionality only works on Linux (at least until other platforms are moved to Taskcluster, which hopefully isn’t that far off).
  • Browser chrome tests failing in mysterious ways if run repeatedly (bug 1347654)
  • Only reftest and mochitest are currently supported. XPCShell support is blocked by the lack of support in its harness for running a job repeatedly (bug 1347696). Web Platform Tests need the requisite support in mozharness for just setting up the tests without running them — the same issue that prevents us from debugging such tests with a one-click loaner (bug 1348833).

Aside from fixing the above limitations, the following features would also be really nifty to have:

  • Ability to trigger a custom job as part of a try push (i.e. not needing to retrigger off an existing job)
  • Run these jobs under rr, and provide a way to login and interactively debug when the problem is actually reproduced.

I am actually in the process of moving to another team @ Mozilla (more on that in another post), so I probably won’t have a ton of time to work on the above — but I’d be happy to help anyone who’s interested in developing this idea further.

A special shout out to the Taskcluster team for helping me with the development of this feature: in particular the action task implementation from Jonas Finnemann Jensen that made it possible to develop this feature in the first place.

Mozilla, Treeherder

Cancel all the things

Feb 7th, 2017

I just added a feature to Treeherder which lets you cancel a set of jobs (say, from a botched try push) much more easily. I’m hopeful that this will be helpful in keeping our resource usage on try more under control.

It uses the “pinboard” feature of Treeherder which very few people are familiar with, so I made a very short video tutorial on how to make use of this feature and put it on the Joy of Automation channel:

Happy cancelling!