Statistics and dataset diagnostics

As a demonstration, let us first load the REDD dataset (which has already been converted to HDF5 format):

Basic stats for a whole building

proportion_of_energy_submetered reports the proportion of energy in a building that is submetered where 0 = no energy submetered and 1 = all energy submetered:

Diagnosing problems with data

There are two reasons why data might not be recorded:

  1. The appliance and appliance monitor were unplugged from the mains (hence the appliance is off when the appliance monitor is off).
  2. The appliance monitor is misbehaving (hence we have no reliable information about the state of the appliance).

nilmtk has a number of functions to help find periods where samples for one or more sensors were not recorded.

By default, plot_missing_samples_using_rectangles plots rectangles indicating the presence of a gap in the data, where a ‘gap’ is defined by the max_sample_period argument. If two consecutive samples are more than max_sample_period apart then that’s a gap! The default is 4 x sample_period. The plot below shows that the two mains channels are inactive for most of the second half of May 2011:

The advantages of plot_missing_samples_using_rectangles are:

  • clearly shows large gaps
  • shows all data so can be zoomed in to your heart’s content

The disadvantages are:

  • The choice of max_sample_period is somewhat subjective
  • Because it plots lots of rectangles, it can be slow to plot.

To overcome both of these disadvantages, we have a sister function:

Here, the darkness of the blue colour indicates the proportion of samples lost, where dark blue means all samples are lost, light blue means some samples are lost and white means no samples are lost. In comparison to the plot_missing_samples_using_rectangles plot, the plot_missing_samples_using_bitmap function shows us that the circuits in REDD always lose >20% of their samples, but these dropouts are spread evenly.

Exploring a single appliance

Let’s get a more precise understanding of the dropout rate of a REDD circuit by getting the dropout rate per day:

And a histogram of power consumption:

So we now know that the oven spends a lot of its time consuming about 2-50 Watts but it appears to be properly ‘on’ when it’s consuming over 1600 watts. So let’s use 1000 watts as the on power threshold.

And some more stats:

And we can plot some histograms to get an understanding of the behaviour of an appliance. Let’s see the usage of the appliance hour-by-hour over an average day:

Not surprisingly, the oven is used most often around lunch and dinner times.

Or the behaviour day-by-day over an average week:

We can see that not much cooking was done in the middle of the week.

Let’s find out length of time that the oven tends to be active for across the dataset.