Normal
Realised in my last post that the database load time was so high because I included the print statements in the timing. Having commented the print statements out and a couple of other useless variables, I only got it down to 0.6 seconds which was roughly 3.5 times slower. I am now blaming doing the string manipulations (such as create_string_buffer and os.path.join) 2000 times for the performance difference.I have now attached file "sptdf20170820.zip" which contains python only file "sptdf.py" and is to be extracted in the same directory as the csv file(/s). If you run the script and get an error like this:Traceback (most recent call last): File "sptdf.py", line 11, in <module> import pandas as pdImportError: No module named pandasit means you don't have the pandas package installed and need to do so for script to work correctly.On linux, I successfully executed the following:$ sudo apt-get install python-pandas (for python3 pandas put python3-pandas)On Windows, I did the following:E:\>pip install --upgrade pipRequirement already up-to-date: pip in c:\python34\lib\site-packagesE:\>pip install pandasCollecting pandas Downloading pandas-0.20.3-cp34-cp34m-win_amd64.whl (8.1MB) 100% |################################| 8.1MB 95kB/sCollecting pytz>=2011k (from pandas) Downloading pytz-2017.2-py2.py3-none-any.whl (484kB) 100% |################################| 491kB 398kB/sCollecting numpy>=1.7.0 (from pandas) Downloading numpy-1.13.1-cp34-none-win_amd64.whl (7.6MB) 100% |################################| 7.6MB 101kB/sCollecting python-dateutil>=2 (from pandas) Downloading python_dateutil-2.6.1-py2.py3-none-any.whl (194kB) 100% |################################| 194kB 443kB/sCollecting six>=1.5 (from python-dateutil>=2->pandas) Downloading six-1.10.0-py2.py3-none-any.whlInstalling collected packages: pytz, numpy, six, python-dateutil, pandasSuccessfully installed numpy-1.13.1 pandas-0.20.3 python-dateutil-2.6.1 pytz-2017.2 six-1.10.0Of course if you have Anaconda you hopefully don't need to do any of this!Now to avoid creating 2 scripts, "sptdf.py" checks if file "quotes.csv" exists.If it does we use the single file solution else we use the multiple files solution.Here are the results when I run the script "sptdf.py" (pure python) on linux:[ATTACH=full]72320[/ATTACH]Database load time (3 seconds) is actually pretty good for python. I had previously tried the read and split approach and the plain csv package but they both took almost 2 minutes. So using pands.read_csv is definately the way to go for python. On a database with a much larger history it's about 4.5 times slower (even better).I simulated fetching symbols with prices by copying each dataframe iterated over which is a fair thing to do. Although 10 times slower, on a database with a much larger history, it's about 2 times slower which is excellent.Calculating a simple moving average for about 2000 securities takes about 2 seconds, which is about 40 times slower than my shared lib solution. On a database with a much larger history, it's about 20 times slower, but we are only using a 5 period simple moving average. This is quite terrible because I use the same python moving average function for my shared lib solution. Probably need to also test more complicated indicators on a more historical database to get a better view on performance.I guess now, the challenge is to first code a more efficient simple moving average that takes far less than 2 seconds on current small database. Can anyone else think of a better way to do this? The solution must be coded in a way that the same coding can be applied to more complicated indicators such as ADX or Ichimoko. We don't want to use built in libraries because we must be able to code what ever custom indicator we want, hence the need to access individual elements and not whole vectors/arrays.At the moment my answer is Pandas too slow and needs full vector operations for good script performance, but hey its early days.Cheers,Andrew.
Realised in my last post that the database load time was so high because I included the print statements in the timing. Having commented the print statements out and a couple of other useless variables, I only got it down to 0.6 seconds which was roughly 3.5 times slower. I am now blaming doing the string manipulations (such as create_string_buffer and os.path.join) 2000 times for the performance difference.
I have now attached file "sptdf20170820.zip" which contains python only file "sptdf.py" and is to be extracted in the same directory as the csv file(/s). If you run the script and get an error like this:
Traceback (most recent call last):
File "sptdf.py", line 11, in <module>
import pandas as pd
ImportError: No module named pandas
it means you don't have the pandas package installed and need to do so for script to work correctly.
On linux, I successfully executed the following:
$ sudo apt-get install python-pandas (for python3 pandas put python3-pandas)
On Windows, I did the following:
E:\>pip install --upgrade pip
Requirement already up-to-date: pip in c:\python34\lib\site-packages
E:\>pip install pandas
Collecting pandas
Downloading pandas-0.20.3-cp34-cp34m-win_amd64.whl (8.1MB)
100% |################################| 8.1MB 95kB/s
Collecting pytz>=2011k (from pandas)
Downloading pytz-2017.2-py2.py3-none-any.whl (484kB)
100% |################################| 491kB 398kB/s
Collecting numpy>=1.7.0 (from pandas)
Downloading numpy-1.13.1-cp34-none-win_amd64.whl (7.6MB)
100% |################################| 7.6MB 101kB/s
Collecting python-dateutil>=2 (from pandas)
Downloading python_dateutil-2.6.1-py2.py3-none-any.whl (194kB)
100% |################################| 194kB 443kB/s
Collecting six>=1.5 (from python-dateutil>=2->pandas)
Downloading six-1.10.0-py2.py3-none-any.whl
Installing collected packages: pytz, numpy, six, python-dateutil, pandas
Successfully installed numpy-1.13.1 pandas-0.20.3 python-dateutil-2.6.1 pytz-201
7.2 six-1.10.0
Of course if you have Anaconda you hopefully don't need to do any of this!
Now to avoid creating 2 scripts, "sptdf.py" checks if file "quotes.csv" exists.
If it does we use the single file solution else we use the multiple files solution.
Here are the results when I run the script "sptdf.py" (pure python) on linux:
[ATTACH=full]72320[/ATTACH]
Database load time (3 seconds) is actually pretty good for python. I had previously tried the read and split approach and the plain csv package but they both took almost 2 minutes. So using pands.read_csv is definately the way to go for python. On a database with a much larger history it's about 4.5 times slower (even better).
I simulated fetching symbols with prices by copying each dataframe iterated over which is a fair thing to do.
Although 10 times slower, on a database with a much larger history, it's about 2 times slower which is excellent.
Calculating a simple moving average for about 2000 securities takes about 2 seconds, which is about 40 times slower than my shared lib solution. On a database with a much larger history, it's about 20 times slower, but we are only using a 5 period simple moving average. This is quite terrible because I use the same python moving average function for my shared lib solution. Probably need to also test more complicated indicators on a more historical database to get a better view on performance.
I guess now, the challenge is to first code a more efficient simple moving average that takes far less than 2 seconds on current small database. Can anyone else think of a better way to do this? The solution must be coded in a way that the same coding can be applied to more complicated indicators such as ADX or Ichimoko. We don't want to use built in libraries because we must be able to code what ever custom indicator we want, hence the need to access individual elements and not whole vectors/arrays.
At the moment my answer is Pandas too slow and needs full vector operations for good script performance, but hey its early days.
Cheers,
Andrew.
Hello and welcome to Aussie Stock Forums!
To gain full access you must register. Registration is free and takes only a few seconds to complete.
Already a member? Log in here.