add notes regarding real world gotchas

2016-05-18 14:29:40 +00:00 · 2016-05-18 14:29:40 +00:00 · eb5d3bf8c5
commit eb5d3bf8c5
parent c0a018ed26
2 changed files with 37 additions and 4 deletions
--- a/README.md
+++ b/README.md
@ -1,14 +1,47 @@
-## k-Shape
+# k-Shape
 Python implementation of [k-Shape](http://www.cs.columbia.edu/~jopa/kshape.html),
 a new fast and accurate unsupervised Time Series cluster algorithm
-### Usage
+## Usage
 ```
 from kshape import kshape, zscore
-time_series = [[1,2,3,4], [0,1,2,3], [-1,1,-1,1], [1,2,2,3]]
+time_series = [[1,2,3,4], [0,1,2,3], [0,1,2,3], [1,2,2,3]]
 cluster_num = 2
 clusters = kshape(zscore(time_series), cluster_num)
 #=> [(array([-0.42860026, -1.15025211,  1.38751707, -0.42860026,  0.61993557]), [3]),
 #    (array([-1.56839539, -0.40686255,  0.84042433,  0.67778452,  0.45704908]), [0, 1, 2])]
 ```
 Returns list of tuples with the clusters found by kshape. The first value of the
 tuple is zscore normalized centroid. The second value of the tuple is the index
 of assigned series to this cluster.
 The results can be examined by drawing graphs of the zscore normalized values
 n/aand the corresponding centroid.
 ## Gotchas when working with real-world time series
 - If the data is available from different sources with same frequency but at different points in time, it needs to be aligned.
 - In the following a tab seperated file is assumed, where each column is a different observation;
  gapps in columns happen, when only a certain value at this point in time was obtained.
 ```
 import pandas as pd
 # assuming the time series are stored in a tab seperated file, where `time` is
 # the name of the column containing the timestamp
 df = pd.read_csv(filename, sep="\t", index_col='time', parse_dates=True)
 df = df.fillna(method="bfill", limit=1e9)
 # drop rows with the same time stamp
 df = df.groupby(level=0).first()
 ```
 - kshape also expect no time series with a constant observation value or 'n/a'
 ```
 time_series = []
 for f in df.columns:
  if not df[f].isnull().any() and df[f].var() != 0:
    time_series.append[df[f]]
 ```
--- a/example.py
+++ b/example.py
@ -1,6 +1,6 @@
 from kshape import kshape, zscore
-time_series = [[1,2,3,4], [0,1,2,3], [-1,1,-1,1], [1,2,2,3]]
+time_series = [[1,2,3,4,5], [0,1,2,3,4], [3,2,1,0,-1], [1,2,2,3,3]]
 cluster_num = 2
 clusters = kshape(zscore(time_series), cluster_num)
 print(clusters)