Mozilla, Treeherder

Training an autoclassifier

Nov 28th, 2016

Here at Mozilla, we’ve accepted that a certain amount of intermittent failure in our automated testing of Firefox is to be expected. That is, for every push, a subset of the tests that we run will fail for reasons that have nothing to do with the quality (or lack thereof) of the push itself.

On the main integration branches that developers commit code to, we have dedicated staff and volunteers called sheriffs who attempt to distinguish these expected failures from intermittents through a manual classification process using Treeherder. On any given push, you can usually find some failed jobs that have stars beside them, this is the work of the sheriffs, indicating that a job’s failure is “nothing to worry about”:

This generally works pretty well, though unfortunately it doesn’t help developers who need to test their changes on Try, which have the same sorts of failures but no sheriffs to watch them or interpret the results. For this reason (and a few others which I won’t go into detail on here), there’s been much interest in having Treeherder autoclassify known failures.

We have a partially implemented version that attempts to do this based on structured (failure line) information, but we’ve had some difficulty creating a reasonable user interface to train it. Sheriffs are used to being able to quickly tag many jobs with the same bug. Having to go through each job’s failure lines and manually annotate each of them is much more time consuming, at least with the approaches that have been tried so far.

It’s quite possible that this is a solvable problem, but I thought it might be an interesting exercise to see how far we could get training an autoclassifier with only the existing per-job classifications as training data. With some recent work I’ve done on refactoring Treeherder’s database, getting a complete set of per-job failure line information is only a small SQL query away:

1
2
3
4
5
select bjm.id, bjm.bug_id, tle.line from bug_job_map as bjm
  left join text_log_step as tls on tls.job_id = bjm.job_id
  left join text_log_error as tle on tle.step_id = tls.id
  where bjm.created > '2016-10-31' and bjm.created < '2016-11-24' and bjm.user_id is not NULL and bjm.bug_id is not NULL
  order by bjm.id, tle.step_id, tle.id;

Just to give some explanation of this query, the “bug_job_map” provides a list of bugs that have been applied to jobs. The “text_log_step” and “text_log_error” tables contain the actual errors that Treeherder has extracted from the textual logs (to explain the failure). From this raw list of mappings and errors, we can construct a data structure incorporating the job, the assigned bug and the textual errors inside it. For example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
{
"bug_number": 1202623,
"lines": [
  "browser_private_clicktoplay.js Test timed out -",
  "browser_private_clicktoplay.js Found a tab after previous test timed out: http:/<number><number>:<number>/browser/browser/base/content/test/plugins/plugin_test.html -",
  "browser_private_clicktoplay.js Found a browser window after previous test timed out -",
  "browser_private_clicktoplay.js A promise chain failed to handle a rejection:  - at chrome://mochikit/content/browser-test.js:<number> - TypeError: this.SimpleTest.isExpectingUncaughtException is not a function",
  "browser_privatebrowsing_newtab_from_popup.js Test timed out -",
  "browser_privatebrowsing_newtab_from_popup.js Found a browser window after previous test timed out -",
  "browser_privatebrowsing_newtab_from_popup.js Found a browser window after previous test timed out -",
  "browser_privatebrowsing_newtab_from_popup.js Found a browser window
  after previous test timed out -"
  ]
}

Some quick google searching revealed that scikit-learn is a popular tool for experimenting with text classifications. They even had a tutorial on classifying newsgroup posts which seemed tantalizingly close to what we needed to do here. In that example, they wanted to predict which newsgroup a post belonged to based on its content. In our case, we want to predict which existing bug a job failure should belong to based on its error lines.

There are obviously some differences in our domain: test failures are much more regular and structured. There are lots of numbers in them which are mostly irrelevant to the classification (e.g. the “expected 12 pixels different, got 10!” type errors in reftests). Ordering of failures might matter. Still, some of the techniques used on corpora of normal text documents for training a classifier probably map nicely onto what we’re trying to do here: it seems plausible that weighting words which occur more frequently less strongly against ones that are less common would be helpful, for example, and that’s one thing their default transformers does.

In any case, I built up a small little script to download a subset of the downloaded data (from November 1st to November 23rd), used it as training data for a classifier, then tested that against another subset of test failures between November 24th and 28th.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import os
from sklearn.datasets import load_files
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.linear_model import SGDClassifier


training_set = load_files('training')
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(training_set.data)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
clf = SGDClassifier(loss='hinge', penalty='l2',
                    alpha=1e-3, n_iter=5, random_state=42).fit(X_train_tfidf, training_set.target)

num_correct = 0
num_missed = 0

for (subdir, _, fnames) in os.walk('testing/'):
    if fnames:
        bugnum = os.path.basename(subdir)
        print bugnum, fnames
        for fname in fnames:
            doc = open(os.path.join(subdir, fname)).read()
            if not len(doc):
                print "--> (skipping, empty)"
            X_new_counts = count_vect.transform([doc])
            X_new_tfidf = tfidf_transformer.transform(X_new_counts)
            predicted_bugnum = training_set.target_names[clf.predict(X_new_tfidf)[0]]
            if bugnum == predicted_bugnum:
                num_correct += 1
                print "--> correct"
            else:
                num_missed += 1
                print "--> missed (%s)" % predicted_bugnum
print "Correct: %s Missed: %s Ratio: %s" % (num_correct, num_missed, num_correct / float(num_correct + num_missed))

With absolutely no tweaking whatsoever, I got an accuracy rate of 75% on the test data. That is, the algorithm chose the correct classification given the failure text 1312 times out of 1959. Not bad for a first attempt!

After getting that working, I did some initial testing to see if I could get better results by reusing some of the error ETL summary code in Treeherder we use for bug suggestions, but the results were pretty much the same.

So what’s next? This seems like a wide open area to me, but some initial areas that seem worth exploring, if we wanted to take this idea further:

  1. Investigate cases where the autoclassification failed or had a near miss. Is there a pattern here? Is there something simple we could do, either by tweaking the input data or using a better vectorizer/tokenizer?
  2. Have a confidence threshold for using the autoclassifier’s data. It seems likely to me that many of the cases above where we got the wrong were cases where the classifier itself wasn’t that confident in the result (vs. others). We can either present that in the user interface or avoid classifications for these cases altogether (and leave it up to a human being to make a decision on whether this is an intermittent).
  3. Using the structured log data inside the database as input to a classifier. Structured log data here is much more regular and denser than the free text that we’re using. Even if it isn’t explicitly classified, we may well get better results by using it as our input data.

If you’d like to experiment with the data and/or code, I’ve put it up on a github repository.

Mozilla, Treeherder, Performance

Slow Treeherder, Fast Treeherder

Oct 3131, 2016

Just wanted to talk about some recent performance improvements we’ve made recently to Treeherder:

  • Bug 1311511: Changed the repository endpoint so we don’t do 40 redundant database queries (this was generally innocuous, but could delay loading by 400ms if the database was under heavy load).
  • Bug 1310016: Persisted database connections across requests (this can save ~40–50ms per request, of which there can be 5–10 when loading a Treeherder page).
  • Bug 1308782: Don’t download job type and group information from the server to get a “sorting order” for the job lists. This was never necessary, but it’s gotten exponentially more painful as people have added job types to Treeherder (job type information is now around a megabyte of JSON these days). This saves 5–10 seconds on a typical page load.

There’s more to come, but with these changes Treeherder should be faster for everyone to load. It should be particularly noticeable on try pushes, where the last item was by far the largest bottleneck. Here’s a youtube video of the changes:

The original is on the left. The newer, faster Treeherder is on the right. Pay particular attention to how much faster the job information populates.

Moral of the story? Optimization can be helpful, but it’s better if you can avoid doing the work altogether.

Mozilla, Infraherder

Herding Automation Infrastructure

Aug 17th, 2016

For every commit to Firefox, we run a battery of builds and automated tests on the resulting source tree to make sure that the result still works and meets our correctness and performance quality criteria. This is expensive: every new push to our repository implies hundreds of hours of machine time. However, automated quality control is essential to ensure that the product that we’re shipping to users is something that we can be proud of.

But what about evaluating the quality of the product which does the building and testing? Who does that? And by what criteria would we say that our automation system is good or bad? Up to now, our procedures for this have been rather embarassingly adhoc. With some exceptions (such as OrangeFactor), our QA process amounts to motivated engineers doing a one-off analysis of a particular piece of the system, filing a few bugs, then forgetting about it. Occasionally someone will propose turning build and test automation for a specific platform on or off in mozilla.dev.planning.

I’d like to suggest that the time has come to take a more systemic approach to this class of problem. We spend a lot of money on people and machines to maintain this infrastructure, and I think we need a more disciplined approach to make sure that we are getting good value for that investment.

As a starting point, I feel like we need to pay closer attention to the following characteristics of our automation:

  • End-to-end times from push submission to full completion of all build and test jobs: if this gets too long, it makes the lives of all sorts of people painful — tree closures become longer when they happen (because it takes longer to either notice bustage or find out that it’s fixed), developers have to wait longer for try pushes (making them more likely to just push directly to an integration branch, causing the former problem…)
  • Number of machine hours consumed by the different types of test jobs: our resources are large (relatively speaking), but not unlimited. We need proper accounting of where we’re spending money and time. In some cases, resources used to perform a task that we don’t care that much about could be redeployed towards an underresourced task that we do care about. A good example of this was linux32 talos (performance tests) last year: when the question was raised of why we were doing performance testing on this specific platform (in addition to Linux64), no one could come up with a great justification. So we turned the tests off and reconfigured the machines to do Windows performance tests (where we were suffering from a severe lack of capacity).

Over the past week, I’ve been prototyping a project I’ve been calling “Infraherder” which uses the data inside Treeherder’s job database to try to answer these questions (and maybe some others that I haven’t thought of yet). You can see a hacky version of it on my github fork.

Why implement this in Treeherder you might ask? Two reasons. First, Treeherder already stores the job data in a historical archive that’s easy to query (using SQL). Using this directly makes sense over creating a new data store. Second, Treeherder provides a useful set of front-end components with which to build a UI with which to visualize this information. I actually did my initial prototyping inside an ipython notebook, but it quickly became obvious that for my results to be useful to others at Mozilla we needed some kind of real dashboard that people could dig into.

On the Treeherder team at Mozilla, we’ve found the New Relic software to be invaluable for diagnosing and fixing quality and performance problems for Treeherder itself, so I took some inspiration from it (unfortunately the problem space of our automation is not quite the same as that of a web application, so we can’t just use New Relic directly).

There are currently two views in the prototype, a “last finished” view and a “total” view. I’ll describe each of them in turn.

Last finished

This view shows the counts of which scheduled automation jobs were the “last” to finish. The hypothesis is that jobs that are frequently last indicate blockers to developer productivity, as they are the “long pole” in being able to determine if a push is good or bad.

Right away from this view, you can see the mochitest devtools 9 test is often the last to finish on try, with Windows 7 mochitest debug a close second. Assuming that the reasons for this are not resource starvation (they don’t appear to be), we could probably get results into the hands of developers and sheriffs faster if we split these jobs into two seperate ones. I filed bugs 1294489 and 1294706 to address these issues.

Total Time

This view just shows which jobs are taking up the most machine hours.

Probably unsurprisingly, it seems like it’s Android test jobs that are taking up most of the time here: these tests are running on multiple layers of emulation (AWS instances to emulate Linux hardware, then the already slow QEMU-based Android simulator) so are not expected to have fast runtime. I wonder if it might not be worth considering running these tests on faster instances and/or bare metal machines.

Linux32 debug tests seem to be another large consumer of resources. Market conditions make turning these tests off altogether a non-starter (see bug 1255890), but how much value do we really derive from running the debug version of linux32 through automation (given that we’re already doing the same for 64-bit Linux)?

Request for comments

I’ve created an RFC for this project on Google Docs, as a sort of test case for a new process we’re thinking of using in Engineering Productivity for these sorts of projects. If you have any questions or comments, I’d love to hear them! My perspective on this vast problem space is limited, so I’m sure there are things that I’m missing.

Mozilla, Perfherder

Perfherder Quarter of Contribution Summer 2016: Results

Aug 10th, 2016

Following on the footsteps of Mike Ling’s amazing work on Perfherder in 2015 (he’s gone on to do a GSOC project), I got two amazing contributors to continue working on the project for a few weeks this summer as part of our quarter of contribution program: Shruti Jasoria and Roy Chiang.

Shruti started by adding a feature to the treeherder/perfherder backend (ability to enable or disable a new performance framework on a tentative basis), then went on to make all sorts of improvements to the Treeherder / Perfherder frontend, fixing bugs in the performance sheriffing frontend, updating code to use more modern standards (including a gigantic patch to enable a bunch of eslint rules and fix the corresponding problems).

Roy worked all over the codebase, starting with some simple frontend fixes to Treeherder, moving on to fix a large number of nits in Perfherder’s alerts view. My personal favorite is the fact that we now paginate the list of alerts inside this view, which makes navigation waaaaay back into history possible:

alert pagination

You can see a summary of their work at these links:

Thank you Shruti and Roy! You’ve helped to make sure Firefox (and Servo!) performance remains top-notch.

Mozilla

Quarter of Contribution: June / July 2016 edition

May 27th, 2016

Just wanted to announce that, once again, my team (Mozilla Engineering Productivity) is just about to start running another quarter of contribution — a great opportunity for newer community members to dive deep on some of the projects we’re working on, brush up on their programming and problem solving skills, and work with experienced mentors. You can find more information on this program here.

I’ve found this program to be a really great experience on both sides — it’s an opportunity for contributors to really go beyond the “good first bug” style of patches to having a really substantial impact on some of the projects that we’re working on while gaining lots of software development skills that are useful in the real world.

Once again, I’m going to be mentoring one or two people on the Perfherder project, a tool we use to measure and sheriff Firefox performance. If you’re inclined to work on some really interesting data analysis and user interface problems in Python and JavaScript, please have a look at the project page and get in touch. :)

Mozilla, Perfherder

Are We Fast Yet and Perfherder

Mar 30th, 2016

Historically at Mozilla, we’ve had a bunch of different systems running to benchmark Firefox’s performance. The two most broadly-scoped are Talos (which runs as part of our build process, and emphasizes common real-world use cases, like page loading) and Are We Fast Yet (which runs seperately, and emphasizes JavaScript performance and benchmarks).

As many of you know, most of my focus over the last year-and-a-bit has been developing a system called Perfherder, which aims to make monitoring and acting on performance data easier. A great introduction to Perfherder is my project of the month post.

The initial focus of Perfherder has been Talos, which is deeply integrated into our automation and also maintained by Engineering Productivity (my group). However, the intention was always to allow anyone in the Mozilla community to submit performance data for Firefox and sheriff it, much like Treeherder has supported the submission of test result data from third parties (e.g. autophone, Firefox UI tests). There are more commonalities than differences in how we do performance sheriffing with Are We Fast Yet (which currently has its own web interface) and Perfherder, so it made sense to see if we could pool resources.

So, over the last couple of months, Joel Maher and I have been in discussions with Hannes Verschore, current maintainer of Are We Fast Yet (AWFY) to see what could be done. It looks like it is possible for Perfherder to provide most of what AWFY needs, though there are a few exceptions. I thought for the benefit of others, it might be useful to outline what’s done, what’s coming next, and what might not be implemented (at least not any time soon).

What’s done

  • Get AWFY submitting data to Perfherder and allow it to be sheriffed seperately from Talos. This is working on treeherder stage, and you can already examine the alert data.

What’s in progress (or in the near-term pipeline)

  • Allow custom alerting behaviour (bug 1254595). For example, we want to alert on subtests for AWFY while still summarizing the results. This is something we don’t currently support.
  • Allow creating an alert manually (bug 1260791). Sadly, our regression detection algorithm is not perfect. AWFY already supports this, we should too. This is something we also want for Talos.
  • Make regression-filing templates non-talos-specific (bug 1260805). Currently we have a convenience template for filing bugs for performance regressions, but this is currently specific to various things about Talos (job running instructions, links to documentation, etc.). We should make it configurable so other projects like AWFY can take advantage of this functionality.

Under consideration

  • Some kind of support for bisecting a push to figure out which patch caused a regression. AWFY currently supports this, but it’s a fairly difficult thing to add to Perfherder (much of which is built upon Treeherder’s per-push result model). Maybe this is something we should do, but it would be a significant amount of effort.
  • Proprietary benchmarks: AWFY runs one benchmark the results for which we can’t make public. Adding “private” jobs or results to Treeherder is likely a big can of worms, but it might be something we want to do eventually.

Probably won’t fix

  • Supporting comparative measurements between Firefox and other browsers. This is an important task, but doesn’t really fit into the model of Perfherder, which is intimately tied to the revision data associated with Firefox. To do this would require detailed tracking of Chrome on the same basis, and I don’t think that’s really a place where we want to go. We should definitely monitor for general trends, but I think that is best done with a seperate system.

Mozilla, Perfherder

Platform engineering project of the month: Perfherder

Mar 14th, 2016

[ originally posted on mozilla.dev.platform ]

Hello from Platform Engineering Operations! Once a month we highlight one of our projects to help the Mozilla community discover a useful tool or an interesting contribution opportunity.

This month’s project is Perfherder!

What is Perfherder?

Perfherder is a generic system for visualizing and analyzing performance data produced by the many automated tests we run here at Mozilla (such as Talos, “Are we fast yet?” or “Are we slim yet?”). The chief goal of the project is to make sure that performance of Firefox gets better, not worse over time. It does this by:

  • Tracking the performance generated by our automated tests, allowing them to be visualized on a graph.
  • Providing a sheriffing dashboard which allows for incoming alerts of performance regressions to be annotated and triaged - bugs can be filed based on a template and their resolution status can be tracked.

In addition to its own user interface, Perfherder also provides an API on the backend that other people can use to build custom performance visualizations and dashboards. For example, the metrics group has been working on a set of release quality indices for performance based on Perfherder data:

https://metrics.mozilla.com/quality-indices/

How it works

Perfherder is part of Treeherder, building on that project’s existing support for tracking revision and test job information. Like the rest of Treeherder, Perfherder’s backend is written in Python, using the Django web framework. The user interface is written as an AngularJS application.

Learning more

For more information on Perfherder than you ever wanted to know, please see the wiki page:

https://wiki.mozilla.org/EngineeringProductivity/Projects/Perfherder

Can I contribute?

Yes! We have had some fantastic contributions from the community to Perfherder, and are always looking for more. This is a great way to help developers make Firefox faster (or use less memory). The core of Perfherder is relatively small, so this is a great chance to learn either Django or Angular if you have a small amount of Python and/or JavaScript experience.

We have set aside a set of bugs that are suitable for getting started here:

https://bugzilla.mozilla.org/buglist.cgi?list_id=12722722&resolution=---&status_whiteboard_type=allwordssubstr&query_format=advanced&status_whiteboard=perfherder-starter-bug

For more information on contributing to Perfherder, please see the contribution section of the above wiki page:

https://wiki.mozilla.org/EngineeringProductivity/Projects/Perfherder#Contribution

Mozilla, Talos

Talos suites now visible from trychooser

Feb 13th, 2016

It’s a small thing, but I submitted a patch to trychooser last week which adds a tooltip indicating the actual Talos tests that are run as part of the various jobs that you can schedule as part of a try push. It’s in production as of now:

Previously, the only way to do this was to dig into the actual buildbot code, which was more than a little annoying.

If you think your patch might have a good chance of regressing performance, please do run the Talos tests before you check in. It’s much less work for all of us when these things are caught before integration and back outs are no fun for anyone. We really need better documentation for this stuff, but meanwhile if you need help with this, please ask in the #perf channel on irc.mozilla.org

zen

Albert Low

Feb 7th, 2016

I was saddened to find out last week that the person who introduced me to Zen practice three years ago, Albert Low, has passed away. Albert was the teacher of the Montreal Zen Center, which I was a member of for a brief period (6 months) in 2014 before I moved to Toronto and started practicing at the center here.

Albert’s instruction was the gateway to a practice that has had a profound impact on my life. More than anything, he helped me understand Zen as something that one could incorporate directly into daily life. I will remain forever grateful.

BIXI, Nixi

NIXI is moving too

Jan 8th, 2016

As my blog goes to github pages, so do my other side projects. I just moved nixi, my bikestation finder project, to github pages. Its new location:

http://wlach.github.io/nixi

I opted not to move over the domain: it would have cost extra money, time and hassle and I couldn’t justify it for the very, very small number of people that still use this site (yes, there are a few, including myself!). For now, nixi.ca will redirect to the github pages site until I decommision my linode server, probably at the end of January (end of Feburary at the latest).

This transition brings some other changes with it:

  • Now using the citybik.es API directly, instead of proxing through an intermediary server. This was necessitated by the switch to github pages, but I suspect this will be more reliable than what we were doing before. Thanks citybik.es!
  • Removed all analytics and facebook integration. As with the domain, it didn’t seem worth bringing over. Also, it’s nice to give people at least marginally more privacy than they had before where possible.

I still think nixi is worlds more usable than most bikesharing maps, even if it’s not an actively maintained project of mine any more. Here’s hoping it lasts many more years in its new incarnation.