Australian (ASX) Stock Market Forum

Reply to thread

Realised in my last post that the database load time was so high because I included the print statements in the timing. Having commented the print statements out and a couple of other useless variables, I only got it down to 0.6 seconds which was roughly 3.5 times slower. I am now blaming doing the string manipulations (such as create_string_buffer and os.path.join) 2000 times for the performance difference.


I have now attached file "sptdf20170820.zip" which contains python only file "sptdf.py" and is to be extracted in the same directory as the csv file(/s). If you run the script and get an error like this:


Traceback (most recent call last):

  File "sptdf.py", line 11, in <module>

    import pandas as pd

ImportError: No module named pandas


it means you don't have the pandas package installed and need to do so for script to work correctly.


On linux, I successfully executed the following:


$ sudo apt-get install python-pandas (for python3 pandas put python3-pandas)


On Windows, I did the following:


E:\>pip install --upgrade pip

Requirement already up-to-date: pip in c:\python34\lib\site-packages

E:\>pip install pandas

Collecting pandas

  Downloading pandas-0.20.3-cp34-cp34m-win_amd64.whl (8.1MB)

    100% |################################| 8.1MB 95kB/s

Collecting pytz>=2011k (from pandas)

  Downloading pytz-2017.2-py2.py3-none-any.whl (484kB)

    100% |################################| 491kB 398kB/s

Collecting numpy>=1.7.0 (from pandas)

  Downloading numpy-1.13.1-cp34-none-win_amd64.whl (7.6MB)

    100% |################################| 7.6MB 101kB/s

Collecting python-dateutil>=2 (from pandas)

  Downloading python_dateutil-2.6.1-py2.py3-none-any.whl (194kB)

    100% |################################| 194kB 443kB/s

Collecting six>=1.5 (from python-dateutil>=2->pandas)

  Downloading six-1.10.0-py2.py3-none-any.whl

Installing collected packages: pytz, numpy, six, python-dateutil, pandas

Successfully installed numpy-1.13.1 pandas-0.20.3 python-dateutil-2.6.1 pytz-201

7.2 six-1.10.0


Of course if you have Anaconda you hopefully don't need to do any of this!


Now to avoid creating 2 scripts, "sptdf.py" checks if file "quotes.csv" exists.

If it does we use the single file solution else we use the multiple files solution.


Here are the results when I run the script "sptdf.py" (pure python) on linux:


[ATTACH=full]72320[/ATTACH]


Database load time (3 seconds) is actually pretty good for python. I had previously tried the read and split approach and the plain csv package but they both took almost 2 minutes. So using pands.read_csv is definately the way to go for python. On a database with a much larger history it's about 4.5 times slower (even better).


I simulated fetching symbols with prices by copying each dataframe iterated over which is a fair thing to do. 

Although 10 times slower, on a database with a much larger history, it's about 2 times slower which is excellent.


Calculating a simple moving average for about 2000 securities takes about 2 seconds, which is about 40 times slower than my shared lib solution. On a database with a much larger history, it's about 20 times slower, but we are only using a 5 period simple moving average. This is quite terrible because I use the same python moving average function for my shared lib solution. Probably need to also test more complicated indicators on a more historical database to get a better view on performance.


I guess now, the challenge is to first code a more efficient simple moving average that takes far less than 2 seconds on current small database. Can anyone else think of a better way to do this? The solution must be coded in a way that the same coding can be applied to more complicated indicators such as ADX or Ichimoko. We don't want to use built in libraries because we must be able to code what ever custom indicator we want, hence the need to access individual elements and not whole vectors/arrays.


At the moment my answer is Pandas too slow and needs full vector operations for good script performance, but hey its early days.


Cheers,


Andrew.


Top