Sunday, September 27, 2015

GT data mining demonstration - finances

Prediction of the daily US $ up/down change



Edith Ohri, edith@fabhighq.com




The goal


In this demo the goal has been to predict with 55% accuracy, the next day's $ direction (if it is going to be UP or DOWN).

To attain this, one is required to establish objective rules and formulas that are independent of the specific input.

  

How it works


a -  Deciding on input (from already existing available sources)
b -  Defining with GT the patterns of behavior (= groups) and cause-effect formulas.
c -   The above consist an expert system that is used then for early alerts and real time decisions.
d -  The expert system can improve itself and periodic reviews it rules.

Note: GT's formulas can be integrated in the control of almost any product.


The data set

The data include 760 daily records over two and a half years period, and 7 variables: Date, Open price, Close price, High, Low, and an index named RSI (Relative Strength Indicator - it compares the magnitude of recent gains to recent losses in an attempt to determine overbought and oversold. When it goes above 70 or below 30, it indicates that a stock is overbought or oversold and vulnerable to a trend reversal)
Rem.: Trade Volume information could not be attained in this demo.

On top of the 7 basic variables, another 30 or more calculated variables were added, such as Trends, Week Days etc.

The Test set includes 122 records from the end of the period.


Figure 1  Input records


The GT Learning results

First thing is creating a lower hurdle, which is "the best results that can one can achieve without the GT algorithm.
Here the lower hurdle was 55.7% right predictions in the Test set, and 56.6% in the Learning set.

Rem.: the good results are credited to the discovery of typical Weekdays' Close price changes.

Reaching beyond the assigned target

The assigned target of 55% prediction success was achieved, but it can be further improved with the GT Patterns-of-Behavior definition.

It is well known low (and quite intuitive one) which says that a greater precision can be always attained by adjusting the prediction factors to the subgroups of a given dataset. Following is a short demonstration of this low, by employing the special abilities of GT algorithm.


GT Results* 

(* Initial results, for this demonstration)

Count  of true/false predictions:
Right -   59%
Wrong - 41%


    • A 3% rate of improvement in right prediction was achieved in just the beginning of the GT process.
    • In a full data mining and input that includes detailed transactions, further significant improvement can be expected.

Improvement tips

      
1.    Include non-linear variables if there are, for example "RSI" – a non linear Relative Strength index, that describes the pressure on prices due to excess Demand or Supply.

2.    Split the data to hierarchical patterns of behavior.
  
3.    Avoid "overfitting" by assuming new subsets of data once exhausting their information.        
  

 Conclusion of example demo

   
GT proves effective in predicting the daily USD trend.
Finding the patterns (clusters) enables separate prediction to each segment and a greater precision.



GT success is in its Industrial & Management Engineering roots

  
a.      Its first application was on-the-job where the assignment was much practical, to improve the line work-flow, not to invent a theoretical model.

b.      Industrial Engineers are almost never expert in the area of application, therefore the model needed to be strengthened with scientific internal validations.

c.      As often done in IE the development was carried out without investors. That fact enabled a very long incubation period and the evolvement of important personal experience.

d.      The IE practical approach led to focusing on "discovery of hidden patterns", instead of the more academic approach that prioritizes correlations and the speed of execution.

e.      Full cycle product costs of implementation are considered, no hard sell wizardry.

f.       Real work forced starting the algorithm ahead of time, which turned out to help greatly to avoid conventional misconceptions...

g.      Product development means primarily its work method substantiation, not its market-share.

h.      From IE perspective it is only natural to offer an option of SaaS.

i.        IE should always adhere to the actual implementation on top of business musts.

j.        High-tech or not "we do business the old way, we earn it".




~~~~

Edith Ohri, Home of GT data mining

Tuesday, September 15, 2015

Digging in financial data

Using any data for in-depth conclusions

Lessons from a  GT study* of 1,000 NYSE companies from year 2000, just before the dot-com bubble crash. 

---------------
https://docs.google.com/file/d/0B1tc2-
duf3_4YzM2M2M2OWMtZjAwNS00Y2FlLWJhOWUtOTc3ZjM3NTY1YzVm/edit?usp=sharing


Conclusion 1
A pattern of behavior can be as small as a fraction of 1% of the total number of events.
In this study, GT found a tiny subgroup containing only 4 out of 1000 "exception" companies. It consists of 4 banks with the exception feature of very high net profit - twice as much as others in the financial sector. An explanation to their unusual high performances was offered 8 years(!) later, during the 2008 credit /derivatives crisis, when the 4 banks' names were mentioned in news headlines.

Conclusion 2
Large data sets require a general view on top of the detailed one.Here the general view fit almost exactly the common Industries definition. There is only one difference, yet a most significant one, some giant corporations are found to behave like financial institutes rather than their own Industries. This observation strengthen our understanding of the 2008 crisis.

Conclusion 3
Big data is about using AVAILABLE unsupervised data, without cleaning as commonly suggested.
This study is based on free data from http://www.ics.uci.edu. The data quality seems insufficient for research: there is no historical "depth", no shares value information, and the sample does not reflect the subgroups. Yet, GT turned quite good results. It means that data are useful even if partial and unsupervised!