Tag Archives: python

Python Class vs Instance variables

Recently I had the pleasure of learning about Python class vs instance variables. Coming from other programming languages, such as Java, this was quite different for me. So what are they?

I was working on my Monero scraper, so I will just use that as the example, since that is where I had the fun as well..

Class variables

Monero is a blockchain. A blockchain consists of linked blocks, which contain transactions. Each transaction further contains various attributes, most relevant here being tx_in and tx_out type elements. These simply describe actual Monero coins being moved in / out of a wallet in a trasaction.

So I made a Transaction class to contain this information. Like this:

class Transaction:
    fee = None
    block_height = None
    tx_ins: List[TxIn] = []
    tx_outs: List[TxOut] = []

    def __init__(self, fee, block_hight):
        self.fee = fee
        self.block_height = block_height

I figured this should match a traditional Java class like this:

public class Transaction {
    int fee;
    int blockHeight;
    List<TxIn> txIns = new ArrayList();
    List<TxOut> txOuts = new ArrayList();

    public Transaction(int fee, int blockHeight) {
        this.fee = fee;
        this.blockHeigh = blockHeigh;
    }
}

Of course, it turned out I was wrong. A class variable in Python is actually more like a static variable in Java. So, in the above Python code, all the variables in the Transaction class are shared by all Transaction objects. Well, actually only the lists are in this case. But more on that in a bit.

Here is an example to illustrate the case:

t1 = Transaction(1, "aa", 1, 1, 1)
t1.tx_ins.append(TxIn(1, 1, 1, 1))
t2 = Transaction(1, "aa", 1, 1, 1)
t2.tx_ins.append(TxIn(1, 1, 1, 1))

print(t1.tx_ins)
print(t2.tx_ins)

I was expecting the above to print out a list with a single item for each transaction. Since I only added one to each. But it actually prints two:

[<monero.transaction.TxIn object at 0x109ceee10>, <monero.transaction.TxIn object at 0x11141abd0>]
[<monero.transaction.TxIn object at 0x109ceee10>, <monero.transaction.TxIn object at 0x11141abd0>]

There was something missing for me here, which was understanding the instance variables in Python.

Instance variables

So what makes an instance variable an instance variable?

My understanding is, the difference is setting it in the constructor (the __init__()) method:

class Transaction:
    fee = None
    block_height = None
    tx_ins: List[TxIn] = []
    tx_outs: List[TxOut] = []

    def __init__(self, fee, block_hight):
        self.fee = fee
        self.block_height = block_height
        self.tx_ins = []
        self.tx_outs = []

Compared to the previous example, the only difference in the above is that the list values are assigned (again) in the __init__ method. Here is the result of the previous test with this setup:

[<monero.transaction.TxIn object at 0x107447e50>]
[<monero.transaction.TxIn object at 0x108ea2d10>]

So now it works as I intended, each transaction holding its own set of tx_inand tx_out. Since hey became instance variables.

I used the above Transaction structure when scraping the Monero blockchain. Because I originally had the tx_ins and tx_outs initialized as lists at class variable level, adding new values to these lists actually just kept growing the shared (class variable) lists forever. Which was not the intent.

Because I expected each Transaction object to have a new, empty set of lists. Of course, they didn’t, but rather the values just accumulated in the shared (class variable) lists. An I inserted the transactions one at a time into a database, the number of tx_ins and tx_outs for later transactions in the blockchain kept growing and growing, as they now contained also all the values of previous transactions. Hundreds of millions of inserted rows later..

After fixing the variables to be instance variables, the results and counts make sense again.

Gotcha

Even with the above fix to use instance variables for the lists, I still found myself an issue. I typoed the variable name in the constructor:

class Transaction:
    fee = None
    block_height = None
    tx_ins: List[TxIn] = []
    tx_outs: List[TxOut] = []

    def __init__(self, fee, block_hight):
        self.fee = fee
        self.block_height = block_height
        self.tx_inx = []
        self.tx_outs = []

In the above I typoed self.tx_inx instead of self.tx_ins. Because the class level tx_ins is already initialized as an empty list, it gave no errors but the objects kept accumulating as before for the tx_ins part. Brilliant.

So I ended up with the following approach (for now):

class Transaction:
    fee = None
    block_height = None
    tx_ins: List[TxIn] = None
    tx_outs: List[TxOut] = None

    def __init__(self, fee, block_hight):
        self.fee = fee
        self.block_height = block_height
        self.tx_inx = []
        self.tx_outs = []

This way, if I typo the instance variable in the __init__ method, the class variable stays uninitialized, and I will get a runtime error in trying to use it (as the class variable value is None).

The main reason I am doing the variable initialization like I am here, is to get the variables defined for IDE autocompletion, and to be able to add type hints to them, for further IDE assistance and checking. There might be other ways to do it, but this is what I figured so far..

When I was looking into this, I also found this Stack Overflow post on the topic. It points to other optional ways to specify type hints for instance variables (e.g., typing the parameters to the constructor). There is also some pointer to Python Enhancement Proposal PEP526. Which references other PEP’s, but let’s not go into all of those..

I cannot say to have 100% assurance of all possibilities related to these annotation, and instance vs class variables, but I think I got a pretty good idea.. If you have any pointers to what I missed or misinterpreted, please leave a comment 🙂

Remote Execution in PyCharm

Editing and Running Python Code on a Remote Server in PyCharm

Recently I was looking at an option to run some code on a remote server, while editing it locally. This time on AWS, but generally ability to do so on any remote server would be nice. I found that PyCharm has this nice option to use a Python SSH interpreter. Give it some SSH credentials, and point it to the Python interpreter on the remote machine, and you should be ready to go. Nice pic about it:

Overview

Sounds cool, and actually works really well. Even supports debugging. A related issue I ran into for pipenv also mentions profiling, pip package management, etc. Great. No, I haven’t tried all the advanced stuff yet, but at least the basics worked great.

Basic Remote Use

I made this simple program to test this feature:

print("hello world")
with open("bob.txt", "w") as bob:
    bob.write("hello.txt")

print("oops")

The point is to print text to the console and create a file. I am looking to see that running this remotely will show me the prints locally, and create the file remotely. This would confirm to me that the execution happens remotely, while I edit, control execution, and see the results locally.

Running this locally prints "hello world" followed by "oops" and a file named "hello.txt" appears. Great.

To try remotely, I need to set up a remote Python interpreter in PyCharm. This can be done via project preferences:

Add interpreter

Or by clicking the interpreter in the status bar:

Statusbar interpreter

On a local configuration this shows the Python interpreter (or pipenv etc.) on my computer. In remote configuration it asks for many options such as remote server IP and credentials. All the run/debugging traffic between local and remote machines is then automatically transferred over SSH tunnels by PyCharm. To start, select SSH interpreter as type when adding new interpreter:

SSH interpreter

Just enter the remote IP/URL address, and username. Click next to enter also password/keyfile. PyCharm will try to connect and see this all works. On the final page of the remote interpreter dialog, it asks for the interpreter path:

Remote Python config

This is referring to the python executable on the remote machine. A simple which python3 does the trick. This works to run the code using the system python on the remote machine.

To run this remote configuration, I just press the run button as usual in PyCharm. With this, PyCharm uploads my project files to the remote server over SSH, starts the interpreter there for the given configuration, and transports back to my local host the console output of the execution. For me it looks exactly the same as running it locally. This is the output of running the above configuration:

ssh://ec2-user@18.195.211.65:22/usr/bin/python3 -u /tmp/pycharm_project_411/hello_world.py
hello world
oops

The first line shows some useful information. It shows that it is using the SSH interpreter with the given IP and username, with the configured Python path. It also shows the directory where it has uploaded my project files. In this case it is "/tmp/pycharm_project_411". This is the path defined in Project Interpreter settings in the Path Mappings part, as illustrated higher above in image (with too many red arrows) in this post. OK, the attached image further above has a different number due to playing with different projects but anyway. To see the files and output:

[ec2-user@ip-172-31-3-125 ~]$ cd /tmp/pycharm_project_411/
[ec2-user@ip-172-31-3-125 pycharm_project_411]$ ls
bob.txt  hello_world.py

This is the file listing from the remote server. PyCharm has uploaded the "hello_world.py" file, since this was the only file I had in my project (under project root as configured for synch in path mappings). There is a separate tab on PyCharm to see these uploads:

Remote synch

After syncing the files, PyCharm has executed the configuration on the remote host, which defined to run the hello_world.py file. And this execution has create the file "bob.txt" as it should (on remote host). The output files go in this remote target directory, as it is the working directory for the running python program.

Another direction to synchronize is from the remote host to local. Since PyCharm provides intelligent coding assistance and navigation on local system, it needs to know and install the libraries used by the executed code. For this reason, it installs all the packages installed in the remote host Python environment. Something to keep in mind. I suppose it must install some type of a local virtual environment for this. Haven’t needed to look deeper on that yet.

Using a Remote Pipenv

The above discusses the usage of standard Python run configuration and interpreter. Something I have found useful for Python environemnts is pipenv.

So can we also do a remote execution of a remote pipenv configuration? The issue I linked earliner contains solutions and discussion on this. Basically, the answer is, yes we can. Just have to find the pipenv files on the remote host and configure the right one as the remote interpreter.

For more complex environments, such as those set up with pipenv, a bit more is required. The issue I linked before had some actual instructions on how to do this:

Remote pipenv config

I made a directory "t" on the remote host, and initialized pipenv there. Installed a few dependencies. So:

  • mkdir t
  • cd t
  • pipenv install pandas

And there we have the basic pipenv setup on the remote host. To find the pipenv dir on remote host (t is the dir where pipenv was created above):

[ec2-user@ip-172-31-3-125 t]$ pipenv --venv
/home/ec2-user/.local/share/virtualenvs/t-x5qHNh_c

To see what it contains:

[ec2-user@ip-172-31-3-125 t]$ ls /home/ec2-user/.local/share/virtualenvs/t-x5qHNh_c
bin  include  lib  lib64  src
[ec2-user@ip-172-31-3-125 t]$ ls /home/ec2-user/.local/share/virtualenvs/t-x5qHNh_c/bin
activate       activate.ps1      chardetect        pip     python     python-config
activate.csh   activate_this.py  easy_install      pip3    python3    wheel
activate.fish  activate.xsh      easy_install-3.7  pip3.7  python3.7

To get python interpreter name:

[ec2-user@ip-172-31-3-125 t]$ pipenv --py
/home/ec2-user/.local/share/virtualenvs/t-x5qHNh_c/bin/python

This is just a link to python3:

[ec2-user@ip-172-31-3-125 t]$ ls -l /home/ec2-user/.local/share/virtualenvs/t-x5qHNh_c/bin/python
lrwxrwxrwx 1 ec2-user ec2-user 7 Nov  7 20:55 /home/ec2-user/.local/share/virtualenvs/t-x5qHNh_c/bin/python -> python3

Use that to configure this pipenv as remote executor, as shown above already:

Remote pipenv config

UPDATE:

Besides automated sync, I found the Pycharm IDE has features for manual upload to / download from the remote server. Seems quite useful.

First of all, the root of the remote deployment dir is defined in Deployment Configuration / Root Path. Under Deployment / Options, you can also disable the automated remote sync. Just set "Update changed files automatically to the default server" to "never". Here I have set the root dir to "/home/ec2-user". Which means the temp directory I discussed above actually is created under /home/ec2-user/tmp/pycharm_project_703/…

Deployment config

With the remote configuration defined, you can now view files on the remote server. First of all, enable the View->Tools Windows->Remote Host. This opens up the Remote Host view on the right hand side of the IDE window. The following shows a screenshot of the PyCharm IDE with this window open. The popup window (as also shown) lets you also download/upload files between the remote host and the localhost:

Deployment view

In a similar way, we can also upload local files to the remote host using the context menu for the files:

Upload to remote

One can also select entire folders for upload / download. The root path on the remote host used for all this is the one I discussed above (e.g., /home/ec2-user as defined higher above).

Conclusions

I haven’t used this feature on a large scale yet, but it seems very useful. The issue I keep linking discusses one option of using it to run data processing on a large desktop system from a laptop. I also find it interesting for just running experiments in parallel on a separate machine, or for using cloud infrastrucure while developing.

The issue also has some discussion with potential pipenv management from PyCharm coming in 2020.1 or 2020.2 version. Just speculation, of course. But until then one can set up the virtualenv using pipenv on remote host and just use the interpreter path above to set up the SSH Interpreter. This works to run the code inside the pipenv environment.

Some issues I ran into included PyCharm apparently only keeping a single state mapping in memory for remote and local file diffs. PyCharm synchronizes files very well, and identifies changes to upload new files. But if I change the remote host address, it seems to still think it has the same delta. Not a big issue, but something to keep in mind as always.

UPDATE: The manual sync I added a description for it actually quite nice way to bypass the issues on automated sync. Of course it is manual, and using it for uploading everything all the time in a big project is not useful. But for me and my projects it has been nice so far..

That’s all.

Using Kafka with Python and (unit) testing my stuff in it..

This time I needed to post some data through Apache Kafka from some of my existing Python code. People tell me Python is such a lovely thing and everything is great when you do Python. Well, I do find the core language and environment to be great for fast scripting and prototyping. But the documentation for all the libraries etc.. oh dear

There is a nice Python Kafka client, which I decided to use (what else was I going to do anyway.. 🙂 ?): https://github.com/mumrah/kafka-python.

But those docs. Those evil docs. The usage page is just a list of some code pasted using some consumer type and some producer type.

Maybe it is because I use Python3 and things would be different on Python2. I don’t know. But the example code from the documents page:

producer.send_messages("my-topic", "some message")

Just would not work. I had to explicitly transform the strings to bytes. So

producer.send_messages("my-topic", b"some message")

works.. Overall, the SimpleProducer is, well, quite simple to get working anyway. But I also needed a consumer for my tests as I wanted to see something actually gets passed over to Kafka and the message delivered is what is expected. Meaning, I want to see my producer is using Kafka correctly and to test that I need to capture what gets passed through Kafka.

So off to try writing a simple consumer. The python-kafka docs show an example of using KafkaConsumer (as usual it is a code dump). It also tells me to use a keyword-argument called “bootstrap_servers” to define where the consumer connects to the Kafka cluster (or single server in my test case). Of course, the KafkaConsumer has no keyword-argument called “bootstrap_server”. Reading the code it seems I need to use “metadata_broker_list” instead. OK, that seems to work.

But while Googling for various issues I had, I also find there is a class called SimpleConsumer. Whooppee, being a simply guy, I use that then.. My consumer:

def assert_kafka(self, expected_file_name):
    kafka_client = KafkaClient(KAFKA_SERVER)
    consumer = SimpleConsumer(kafka_client, b"my_group", KAFKA_TOPIC.encode("utf8"),iter_timeout=1)
    consumer.seek(1, 0)
    actual = ""
    for msg in consumer:
        actual += msg.message.value.decode('utf8')+"\n"
     expected = pkg_resources.resource_string(__name__, expected_file_name).decode('utf8')
     self.assertEqual(actual, expected)

The KafkaConsumer has a timeout called “consumer_timeout_ms”, which is the number of milliseconds to throw a timeout exception. In the SimpleConsumer the similar timeout is actually called “iter_timeout” and it represents the number of seconds to wait for new messages when iterating the message queue. By setting it to 1 second, we give the producer a chance to provide all messages for the test, and also a chance to pass through any excess messages it should not be passing.

The actual message data passed from Kafka is in the “msg.message.value” given by the consumer iterator. And it is in bytes.. How did I figure all that out? I have no idea.

The “consumer.seek(1,0)” sets the position in the message queue to beginning of the queue (the 0 parameter), and adds 1 to that (the 1 parameter). This allows us to pass the first message sent by the producer. Why would we want to skip that? Kafka requires either explicitly creating the topics that are used, or it can create them automatically when first addressed. Since Python-Kafka does not seem to have explicit support for topic creation, I need to rely on the topic auto-creation. So, I set the producer to start with a specific message send to ensure the topic exists. This is to avoid having to put the try-expect in every place in the code where messages are sent.

This looks something like:

kafka_client = KafkaClient(KAFKA_SERVER)
self.kafka = SimpleProducer(kafka_client)

try:
    self.kafka.send_messages(KAFKA_TOPIC, b"creating topic")
except LeaderNotAvailableError:
    time.sleep(1) #this sleep is needed to give zookeeper time to create the topic

The tests then look something like this:

class TestKafka(unittest.TestCase):
    topic_index = 1

    @classmethod
    def setUpClass(cls):
        KAFKA_SERVER = os.environ["KAFKA_SERVER"]

    def setUp(self):
        KAFKA_TOPIC = "test_topic_" + str(self.topic_index)
        TestKafka.topic_index += 1

    def test_cpu_sys_kafka(self):
        kafka = KafkaLogger()
        kafka.log("hello")
        kafka.close()

        self.assert_kafka('expected.kafka')

Here, I take the environment variable “KAFKA_SERVER” to allow defining the test server address outside the unit test. This is stored in the configuration variable that both the producer code and the test consumer access. The KAFKA_TOPIC variable is similarly used to set a unique topic for each test to avoid tests messing with each other.

To avoid test runs from messing with each other, I used the Kafka server configuration options to make it delete all messages in 5-10 seconds after receiving it. This was the two options below set in the server.properties file:

log.retention.ms=5000
log.retention.check.interval.ms=10000

The first defines that the log is only kept for 5 seconds. The second that checks for logs to delete are done in 10s intervals. This was actually fun to find since mostly the documentation talks about using hours. Some mention doing it at very fine level of minutes.. But it is also possible to do in milliseconds as above. Again, I found that somewhere… 🙂

To summarize, I set up separate test server, set it to delete messages in short order, invoked the producer code to test in order to send the relevant messages, started a consumer to read the messages, collected these and compared to the expected values. Seems to work for now..

As a side note, I had an interesting error to debug. Kept getting a LeaderNotAvailableError when trying to send any messages over Kafka. Thought it was my code but even the command line tools that come with Kafka failed. So what was the problem? I had set up the development area zookeeper in the test area Kafka server.properties but used the test area zookeeper when sending the messages. Whooppee. What did we learn there? Probably nothing, but if we would, it would be to always double and triple check the configuration and IP addresses etc. In my defence, the error was not very descriptive..

Which relates to another note. My tests worked fine on OSX and Windows but setting this up on Ubuntu failed. The “advertised.hostname” property in server.properties needed to be set to an IP address as otherwise Kafka would query the system for host name, which was not routable from the client.. Ended up doing this for the zookeeper address as well, which of course did not do much.

And finally, if the ZooKeeper vs Kafka state gets messed up, deleting /tmp/kafka-logs and /tmp/zookeeper directories might help. I think it fixed one of my issues, but with all the issues I kept having I have no idea anymore.. 🙂

Uploading my little project to pypi (python package index)

Wanted to upload my little project to pypi, the place from which the project can be easily installed using pip install..

First of all, need to create the setup.py file. This is documented in quite a few places. Just a few notes on this:
The “name” attribute is what is used to refer to the project when performing the upload, and what is used on the index.
the “packages” attribute should contain all packages to upload. Not sure if “empty” middle packages count or not. But I put one in anyway.

Now, this should all be just fine but the weirdest error are to be had. The command to perform the upload is “python setup.py sdist upload -r REPONAME”. And most of the websites tell me that the process should ask me for my username and password if I have not defined them anywhere. Anywhere being the “.pypirc” file mentioned in various places. But no, it give me the error “invalid schema definition”. Or something similar.

So to fix this most informative error, I tried various tricks and renaming fields in setup.py etc. None worked. So I tried to delete the project from pypi (don’t worry its not like it was popular and has user(s)). This finally gives the more reasonable error that I need to “.pypirc” file.

So, I need to add the “.pypirc” file. Where to, since Windows has no clearly defined “~” (home) directory. I had to define HOME environment variable and point that someplace, create the “.pypirc” and put my access credentials there. Finally register and upload works…

Maybe next time I will remember this..

python lazy importing

Was wondering how to stop my little Python program from crashing if it fails to import MySql driver, even when the functionality with the MySql driver is not invoked. The Python runtime seems to try to load all modules that are imported as soon as they are referenced. Including if you import the module from another module. Generally this happens always as something will refer to the module or it is a useless module.

The fix is really simple. Instead of


import mysql.connector

at the top of the file, I should just import it where it is first used, such as


def __init__(self):
    import mysql.connector

Seems the term I was looking for in Google was “Python lazy importing” or similar.. whooppee 🙂

python3 mysql connector and pip install procedures

Trying to learn some Python and use some of the nice and shiny libraries available for it.

Problem: Need to use MySQL to store some data. Have to figure out how to connect and use the MySQL DB from Python3.

Several places suggest using a Python package called MySQLdb. Seems fine but does not support Python3 at this time. Installation seems a bit of a pain as well with downloading this and that from different places on different platforms. Also, if there is an “official” version/package I would prefer that. Turns out is an “official” package called MySQL Connector from Oracle. Trying to use that.

Problem again: The “official” site (http://dev.mysql.com/downloads/connector/python/ at this time) hosts some “installer” files to download. So I download one and run the installer. Nothing seems to happen as my Python3 install still does not recognize the package. Why not just “pip3 install xyz”? Such lovely simple thing for me..

So, it appears there is something that can actually be installed with pip. On the internetz I finally find some references to a command as “pip install mysql-connector-python –allow-external mysql-connector-python”. But what does this mean? Well I change pip to pip3 to get Python3 and the “pip3 install mysql-connector-python” seems obvious. The install command with a package name matching the official package. But what is “–allow-external” and why is mysql-connector-python repeated twice?

Some of this info I found in the docs, others on stackoverflow, some is my guessing. So here goes, hoping for corrections where relevant.

The “pip” or “pip3” commands download and install stuff from the Python package index a.k.a. “pypi” (located at https://pypi.python.org/pypi/). The “install” command for “pip” then looks for the given package name in the “pypi” repository and installs one if found. However, only some of the actual install files are located in the “pypi” repository. Some files are just references to an external site. So the flag “–allow-external” tells the “pip” command to also install a package even if it is not in “pypi”. In this case, “pypi” only contains the reference to the package. For security purposes this should still be fine as the “pypi” entry still should contain the checksum of the external file so no malicious stuff should be downloaded with any higher probability than getting one from “pypi”.

And how does it work to find something on “pypi”? There is a search functionality for packages at the “pypi” repository pages. If we type “mysql” in there, we get a hit in the top 10 which has a name “mysql-connector-python 2.0.3”. Looks interesting for this case. If we click on the link, we get an info page. This has information such as author etc that can help (Oracle in this case) to figure out if it is what we are looking for. At the bottom of the page there is a “DOAP record” link. This gives us an XML file which in this case shows the name as “mysql-connector-python”. So this is where I guess the match is made for the “install” command. Then there is a version number for upgrades. Finally, there is a download link with an MD5 checksum embedded in the end. So guessing again, this is where the “–allow-external” goes to download the installed external files and the MD5 checksum is what it uses to check the file is valid.

Whooppee, did we learn something useful? Hope so..