Normal
I have now updated the script with some much needed changes/additions. Firstly, the Date column is now the index in the dataframe and dates are sorted from oldest to most current. If you get an error message like:ValueError: time data '2017-03-23' does not match format '%d/%m/%Y'Then it means the date format on your data set does not match what is expected by the script and you may need to change the following function:def date_parser(date_string):____date = pd.to_datetime(date_string, format="%d/%m/%Y") # dd/mm/yyyy____#date = pd.to_datetime(date_string, format="%d-%m-%Y") # dd-mm-yyyy____#date = pd.to_datetime(date_string, format="%Y/%m/%d") # yyyy/mm/dd____#date = pd.to_datetime(date_string, format="%Y-%m-%d") # yyyy-mm-dd____return dateAs you can see, I have it set to dd/mm/yyyy for the purpose of my test data sets posted above.Alternatively in function call pd.read_csv(...parse_dates=[DATEHEADER], date_parser=date_parser), you can replace entry "date_parser=date_parser" with the generic parser like so:pd.read_csv(...parse_dates=[DATEHEADER], infer_datetime_format=True)and this will attempt to cleverly parse the date format no matter what is thrown at it. The generic parser works with my generated "quotes.csv" file from my yahoo download thread where the format is yyyy-mm-dd, but when applied to my test data sets posted above where the format is dd/mm/yyyy, the generic version treats it as mm/dd/yyyy which I believe is a us format. So always pays to check generic parser is working correctly when taking the lazy approach to parsing dates.Secondly, the following common EOD trading TA functions have been added to the script to make things more interesting (and slower):SMA(df, column_pos, n, sh=0): #Simple Moving Average in n periods shifted/looking sh periods backEMA(df, column_pos, n, sh=0): #Exponential Moving Average in n periods shifted/looking sh periods backHHV(df, column_pos, n, sh=0): #Highest High Value in n periods shifted/looking sh periods back (upper band of Donchian channel if sh=1)LLV(df, column_pos, n, sh=0): #Lowest Low Value in n periods shifted/looking sh periods back (lower band of Donchian channel if sh=1)CUpDown(df, column_pos, sh): #Number of Consectutive periods up(>0), down(<0) or if breakeven, set to 0.KER(df, column_pos, n=20): #Kaufman's Efficiency RatioMOM(df, column_pos, n): #MomentumROC(df, column_pos, n): #Rate of ChangeStdDev(df, column_pos, n): #Standard DeviationBBANDS(df, column_pos, n, sd): #Bollinger BandsMACD(df, n_fast=12, n_slow=26, n_sign=9): #MACD, MACD Signal and MACD HistCCI(df, n=20): #Commodity Channel IndexSTO(df, fsn=14, ssn=3): #Stochastic OscillatorRSI(df, n=14): #Relative Strength IndexTO(df): #Turnover (close x volume)AvgTO(df,n): #AverageTurnover (average(close) x average(volume))OBV(df): #On-Balance VolumeAccDist(df): #Accumulation DistributionChaOsc(df, n1=3, n2=10): #Chaikin Oscillator (includes Accumulation Distribution)MFI(df, n=14): #Money Flow Index (aka volume-weighted RSI)ATR(df, n=14): #Average True RangeATRW(df, n=14): #Average True Range Welles WilderADX(df, n=14): #Average Directional Movement IndexPSAR(df, afi=0.01, afm=0.2): #Parabolic SarAroon(df, n=25): #AroonNote: There are some Deprecated versions (Prefixed "Depr") of some of the above functions for old linux installations or python versions. You may need to use these if you get an error message like:"AttributeError: 'Series' object has no attribute 'rolling'" or "AttributeError: 'Series' object has no attribute 'ewm'"All these TA functions require a dataframe passed in. Calculations are performed on the dataframe columns and one or more columns containing the calculations are then appended to the dataframe and the modified dataframe is returned. So a simple moving average of the close will add an additional column like so:__________Open____High_____Low______Close_____Volume_SMA(5,0)_3Date 2012-04-02 23.500000 23.590000 23.049999 23.059999 8970121 NaN2012-04-03 23.070000 23.209999 22.910000 22.990000 7933211 NaN2012-04-04 22.799999 23.049999 22.790001 23.010000 7213844 NaN2012-04-05 22.950001 23.080000 22.770000 23.040001 5952048 NaN2012-04-10 22.799999 23.059999 22.799999 22.980000 6560878 23.0160002012-04-11 22.799999 22.850000 22.690001 22.770000 7470594 22.9580002012-04-12 22.940001 22.980000 22.750000 22.980000 4699164 22.9560002012-04-13 23.180000 23.200001 22.870001 22.980000 5641305 22.950000So the last column is a 5 period SMA shifted 0 periods applied to column position 3, which is the close (Open=0,High=1,Low=2,Close=3,Volume=4).It was necessary to name the column like this in case we want to apply 2 SMA's and avoid error messages like:"ValueError: columns overlap but no suffix specified"This happens when 2 or more columns share the same name. Can be very frustrating the first time you see this message.You will notice when data is read from pd.read_csv(), I have kept all floating point dataframe columns to type "PRICETYPE". Where global variable PRICETYPE is defined at the top of the script as PRICETYPE = np.float64. I have also done this in my TA functions using astype or dtype. So if you want to save on "memory", you could change PRICETYPE to something like np.float32. One little problem with setting to less than np.float64 is that even though the data read in from pd.read_csv() is set to this type, all pandas calculations seem to be performed in np.float64. This is so annoying because converting a 32 bit closing price to a 64 bit closing price introduces a very small floating point error which can cause problems when lots of calculations are performed. 32 bit floating point calculations are sufficient for financial analysis. You don't need to waste space and time doing 64 bit calculations. I don't know why pandas does this!!!I have attached file sptdf20180809.zip which contains the updated script file sptdf.py. When executed, it does the following:1) iterates through all securities stored in memory read in from pd.read_csv using test file data in above posts2) performs calculations "SMA, ATRW, ADX and AROON" (may need to put on a cuppa while you wait!)3) displays a dataframe of results for ANZ between 2 dates and also store the results in test.txt and test.html using df.to_csv() and df.to_html() respectively.As you can see the main code isn't too much. It's when we have to iterate through a dataframe that takes a lot of code.So what I have covered so far is:1) Read financial securities data in from either a single csv file containing columns security symbol,date,open,high,low,close,volume or multiple csv files where file name is the security symbol containing columns date,open,high,low,close and volume.2) Created a dictionary of dataframes storing all the financial data (read in from file/(s)) in memory.3) Created common EOD technical analysis functions to perform calculations on a security's dataframe.4) Iterate through dataframes doing simple stuff which could lead to amazing stuff provided we ignore performance LOL!
I have now updated the script with some much needed changes/additions.
Firstly, the Date column is now the index in the dataframe and dates are sorted from oldest to most current. If you get an error message like:
ValueError: time data '2017-03-23' does not match format '%d/%m/%Y'
Then it means the date format on your data set does not match what is expected by the script and you may need to change the following function:
def date_parser(date_string):
____date = pd.to_datetime(date_string, format="%d/%m/%Y") # dd/mm/yyyy
____#date = pd.to_datetime(date_string, format="%d-%m-%Y") # dd-mm-yyyy
____#date = pd.to_datetime(date_string, format="%Y/%m/%d") # yyyy/mm/dd
____#date = pd.to_datetime(date_string, format="%Y-%m-%d") # yyyy-mm-dd
____return date
As you can see, I have it set to dd/mm/yyyy for the purpose of my test data sets posted above.
Alternatively in function call pd.read_csv(...parse_dates=[DATEHEADER], date_parser=date_parser), you can replace entry "date_parser=date_parser" with the generic parser like so:
pd.read_csv(...parse_dates=[DATEHEADER], infer_datetime_format=True)
and this will attempt to cleverly parse the date format no matter what is thrown at it.
The generic parser works with my generated "quotes.csv" file from my yahoo download thread where the format is yyyy-mm-dd, but when applied to my test data sets posted above where the format is dd/mm/yyyy, the generic version treats it as mm/dd/yyyy which I believe is a us format. So always pays to check generic parser is working correctly when taking the lazy approach to parsing dates.
Secondly, the following common EOD trading TA functions have been added to the script to make things more interesting (and slower):
SMA(df, column_pos, n, sh=0): #Simple Moving Average in n periods shifted/looking sh periods back
EMA(df, column_pos, n, sh=0): #Exponential Moving Average in n periods shifted/looking sh periods back
HHV(df, column_pos, n, sh=0): #Highest High Value in n periods shifted/looking sh periods back (upper band of Donchian channel if sh=1)
LLV(df, column_pos, n, sh=0): #Lowest Low Value in n periods shifted/looking sh periods back (lower band of Donchian channel if sh=1)
CUpDown(df, column_pos, sh): #Number of Consectutive periods up(>0), down(<0) or if breakeven, set to 0.
KER(df, column_pos, n=20): #Kaufman's Efficiency Ratio
MOM(df, column_pos, n): #Momentum
ROC(df, column_pos, n): #Rate of Change
StdDev(df, column_pos, n): #Standard Deviation
BBANDS(df, column_pos, n, sd): #Bollinger Bands
MACD(df, n_fast=12, n_slow=26, n_sign=9): #MACD, MACD Signal and MACD Hist
CCI(df, n=20): #Commodity Channel Index
STO(df, fsn=14, ssn=3): #Stochastic Oscillator
RSI(df, n=14): #Relative Strength Index
TO(df): #Turnover (close x volume)
AvgTO(df,n): #AverageTurnover (average(close) x average(volume))
OBV(df): #On-Balance Volume
AccDist(df): #Accumulation Distribution
ChaOsc(df, n1=3, n2=10): #Chaikin Oscillator (includes Accumulation Distribution)
MFI(df, n=14): #Money Flow Index (aka volume-weighted RSI)
ATR(df, n=14): #Average True Range
ATRW(df, n=14): #Average True Range Welles Wilder
ADX(df, n=14): #Average Directional Movement Index
PSAR(df, afi=0.01, afm=0.2): #Parabolic Sar
Aroon(df, n=25): #Aroon
Note: There are some Deprecated versions (Prefixed "Depr") of some of the above functions for old linux installations or python versions. You may need to use these if you get an error message like:
"AttributeError: 'Series' object has no attribute 'rolling'" or "AttributeError: 'Series' object has no attribute 'ewm'"
All these TA functions require a dataframe passed in. Calculations are performed on the dataframe columns and one or more columns containing the calculations are then appended to the dataframe and the modified dataframe is returned. So a simple moving average of the close will add an additional column like so:
__________Open____High_____Low______Close_____Volume_SMA(5,0)_3
Date
2012-04-02 23.500000 23.590000 23.049999 23.059999 8970121 NaN
2012-04-03 23.070000 23.209999 22.910000 22.990000 7933211 NaN
2012-04-04 22.799999 23.049999 22.790001 23.010000 7213844 NaN
2012-04-05 22.950001 23.080000 22.770000 23.040001 5952048 NaN
2012-04-10 22.799999 23.059999 22.799999 22.980000 6560878 23.016000
2012-04-11 22.799999 22.850000 22.690001 22.770000 7470594 22.958000
2012-04-12 22.940001 22.980000 22.750000 22.980000 4699164 22.956000
2012-04-13 23.180000 23.200001 22.870001 22.980000 5641305 22.950000
So the last column is a 5 period SMA shifted 0 periods applied to column position 3, which is the close (Open=0,High=1,Low=2,Close=3,Volume=4).
It was necessary to name the column like this in case we want to apply 2 SMA's and avoid error messages like:
"ValueError: columns overlap but no suffix specified"
This happens when 2 or more columns share the same name. Can be very frustrating the first time you see this message.
You will notice when data is read from pd.read_csv(), I have kept all floating point dataframe columns to type "PRICETYPE". Where global variable PRICETYPE is defined at the top of the script as PRICETYPE = np.float64. I have also done this in my TA functions using astype or dtype. So if you want to save on "memory", you could change PRICETYPE to something like np.float32. One little problem with setting to less than np.float64 is that even though the data read in from pd.read_csv() is set to this type, all pandas calculations seem to be performed in np.float64. This is so annoying because converting a 32 bit closing price to a 64 bit closing price introduces a very small floating point error which can cause problems when lots of calculations are performed. 32 bit floating point calculations are sufficient for financial analysis. You don't need to waste space and time doing 64 bit calculations. I don't know why pandas does this!!!
I have attached file sptdf20180809.zip which contains the updated script file sptdf.py. When executed, it does the following:
1) iterates through all securities stored in memory read in from pd.read_csv using test file data in above posts
2) performs calculations "SMA, ATRW, ADX and AROON" (may need to put on a cuppa while you wait!)
3) displays a dataframe of results for ANZ between 2 dates and also store the results in test.txt and test.html using df.to_csv() and df.to_html() respectively.
As you can see the main code isn't too much. It's when we have to iterate through a dataframe that takes a lot of code.
So what I have covered so far is:
1) Read financial securities data in from either a single csv file containing columns security symbol,date,open,high,low,close,volume or multiple csv files where file name is the security symbol containing columns date,open,high,low,close and volume.
2) Created a dictionary of dataframes storing all the financial data (read in from file/(s)) in memory.
3) Created common EOD technical analysis functions to perform calculations on a security's dataframe.
4) Iterate through dataframes doing simple stuff which could lead to amazing stuff provided we ignore performance LOL!
Hello and welcome to Aussie Stock Forums!
To gain full access you must register. Registration is free and takes only a few seconds to complete.
Already a member? Log in here.