Australian (ASX) Stock Market Forum

Reply to thread

I have now updated the script with some much needed changes/additions. 


Firstly, the Date column is now the index in the dataframe and dates are sorted from oldest to most current. If you get an error message like:


ValueError: time data '2017-03-23' does not match format '%d/%m/%Y'


Then it means the date format on your data set does not match what is expected by the script and you may need to change the following function:


def date_parser(date_string):

____date = pd.to_datetime(date_string, format="%d/%m/%Y")   # dd/mm/yyyy

____#date = pd.to_datetime(date_string, format="%d-%m-%Y")   # dd-mm-yyyy

____#date = pd.to_datetime(date_string, format="%Y/%m/%d")   # yyyy/mm/dd

____#date = pd.to_datetime(date_string, format="%Y-%m-%d")   # yyyy-mm-dd

____return date


As you can see, I have it set to dd/mm/yyyy for the purpose of my test data sets posted above.


Alternatively in function call pd.read_csv(...parse_dates=[DATEHEADER], date_parser=date_parser), you can replace entry "date_parser=date_parser" with the generic parser like so:


pd.read_csv(...parse_dates=[DATEHEADER], infer_datetime_format=True)


and this will attempt to cleverly parse the date format no matter what is thrown at it. 


The generic parser works with my generated "quotes.csv" file from my yahoo download thread where the format is yyyy-mm-dd, but when applied to my test data sets posted above where the format is dd/mm/yyyy, the generic version treats it as mm/dd/yyyy which I believe is a us format. So always pays to check generic parser is working correctly when taking the lazy approach to parsing dates.


Secondly, the following common EOD trading TA functions have been added to the script to make things more interesting (and slower):


SMA(df, column_pos, n, sh=0):  #Simple Moving Average in n periods shifted/looking sh periods back

EMA(df, column_pos, n, sh=0):  #Exponential Moving Average in n periods shifted/looking sh periods back

HHV(df, column_pos, n, sh=0):  #Highest High Value in n periods shifted/looking sh periods back (upper band of Donchian channel if sh=1)

LLV(df, column_pos, n, sh=0):  #Lowest Low Value in n periods shifted/looking sh periods back (lower band of Donchian channel if sh=1)

CUpDown(df, column_pos, sh):   #Number of Consectutive periods up(>0), down(<0) or if breakeven, set to 0.

KER(df, column_pos, n=20):     #Kaufman's Efficiency Ratio

MOM(df, column_pos, n):        #Momentum

ROC(df, column_pos, n):        #Rate of Change

StdDev(df, column_pos, n):     #Standard Deviation

BBANDS(df, column_pos, n, sd): #Bollinger Bands

MACD(df, n_fast=12, n_slow=26, n_sign=9): #MACD, MACD Signal and MACD Hist

CCI(df, n=20):                 #Commodity Channel Index

STO(df, fsn=14, ssn=3):        #Stochastic Oscillator

RSI(df, n=14):                 #Relative Strength Index

TO(df):                        #Turnover (close x volume)

AvgTO(df,n):                   #AverageTurnover (average(close) x average(volume))

OBV(df):                       #On-Balance Volume

AccDist(df):                   #Accumulation Distribution

ChaOsc(df, n1=3, n2=10):       #Chaikin Oscillator (includes Accumulation Distribution)

MFI(df, n=14):                 #Money Flow Index (aka volume-weighted RSI)

ATR(df, n=14):                 #Average True Range

ATRW(df, n=14):                #Average True Range Welles Wilder

ADX(df, n=14):                 #Average Directional Movement Index

PSAR(df, afi=0.01, afm=0.2):   #Parabolic Sar

Aroon(df, n=25):               #Aroon


Note: There are some Deprecated versions (Prefixed "Depr") of some of the above functions for old linux installations or python versions. You may need to use these if you get an error message like:


"AttributeError: 'Series' object has no attribute 'rolling'" or "AttributeError: 'Series' object has no attribute 'ewm'"


All these TA functions require a dataframe passed in. Calculations are performed on the dataframe columns and one or more columns containing the calculations are then appended to the dataframe and the modified dataframe is returned. So a simple moving average of the close will add an additional column like so:


__________Open____High_____Low______Close_____Volume_SMA(5,0)_3

Date                                                                        

2012-04-02  23.500000  23.590000  23.049999  23.059999   8970121         NaN

2012-04-03  23.070000  23.209999  22.910000  22.990000   7933211         NaN

2012-04-04  22.799999  23.049999  22.790001  23.010000   7213844         NaN

2012-04-05  22.950001  23.080000  22.770000  23.040001   5952048         NaN

2012-04-10  22.799999  23.059999  22.799999  22.980000   6560878   23.016000

2012-04-11  22.799999  22.850000  22.690001  22.770000   7470594   22.958000

2012-04-12  22.940001  22.980000  22.750000  22.980000   4699164   22.956000

2012-04-13  23.180000  23.200001  22.870001  22.980000   5641305   22.950000


So the last column is a 5 period SMA shifted 0 periods applied to column position 3, which is the close (Open=0,High=1,Low=2,Close=3,Volume=4).

It was necessary to name the column like this in case we want to apply 2 SMA's and avoid error messages like:


"ValueError: columns overlap but no suffix specified"


This happens when 2 or more columns share the same name. Can be very frustrating the first time you see this message.


You will notice when data is read from pd.read_csv(), I have kept all floating point dataframe columns to type "PRICETYPE". Where global variable PRICETYPE is defined at the top of the script as PRICETYPE = np.float64.  I have also done this in my TA functions using astype or dtype. So if you want to save on "memory", you could change PRICETYPE to something like np.float32. One little problem with setting to less than np.float64 is that even though the data read in from pd.read_csv() is set to this type, all pandas calculations seem to be performed in np.float64. This is so annoying because converting a 32 bit closing price to a 64 bit closing price introduces a very small floating point error which can cause problems when lots of calculations are performed. 32 bit floating point calculations are sufficient for financial analysis. You don't need to waste space and time doing 64 bit calculations. I don't know why pandas does this!!!


I have attached file sptdf20180809.zip which contains the updated script file sptdf.py. When executed, it does the following:


1) iterates through all securities stored in memory read in from pd.read_csv using test file data in above posts

2) performs calculations "SMA, ATRW, ADX and AROON" (may need to put on a cuppa while you wait!)

3) displays a dataframe of results for ANZ between 2 dates and also store the results in test.txt and test.html using df.to_csv() and df.to_html() respectively.


As you can see the main code isn't too much. It's when we have to iterate through a dataframe that takes a lot of code.


So what I have covered so far is:


1) Read financial securities data in from either a single csv file containing columns security symbol,date,open,high,low,close,volume or multiple csv files where file name is the security symbol containing columns date,open,high,low,close and volume.

2) Created a dictionary of dataframes storing all the financial data (read in from file/(s)) in memory.

3) Created common EOD technical analysis functions to perform calculations on a security's dataframe.

4) Iterate through dataframes doing simple stuff which could lead to amazing stuff provided we ignore performance LOL!


Top