Tag Archives: python

Remote Execution in PyCharm

Editing and Running Python Code on a Remote Server in PyCharm

Recently I was looking at an option to run some code on a remote server, while editing it locally. This time on AWS, but generally ability to do so on any remote server would be nice. I found that PyCharm has this nice option to use a Python SSH interpreter. Give it some SSH credentials, and point it to the Python interpreter on the remote machine, and you should be ready to go. Nice pic about it:

Overview

Sounds cool, and actually works really well. Even supports debugging. A related issue I ran into for pipenv also mentions profiling, pip package management, etc. Great. No, I haven’t tried all the advanced stuff yet, but at least the basics worked great.

Basic Remote Use

I made this simple program to test this feature:

print("hello world")
with open("bob.txt", "w") as bob:
    bob.write("hello.txt")

print("oops")

The point is to print text to the console and create a file. I am looking to see that running this remotely will show me the prints locally, and create the file remotely. This would confirm to me that the execution happens remotely, while I edit, control execution, and see the results locally.

Running this locally prints "hello world" followed by "oops" and a file named "hello.txt" appears. Great.

To try remotely, I need to set up a remote Python interpreter in PyCharm. This can be done via project preferences:

Add interpreter

Or by clicking the interpreter in the status bar:

Statusbar interpreter

On a local configuration this shows the Python interpreter (or pipenv etc.) on my computer. In remote configuration it asks for many options such as remote server IP and credentials. All the run/debugging traffic between local and remote machines is then automatically transferred over SSH tunnels by PyCharm. To start, select SSH interpreter as type when adding new interpreter:

SSH interpreter

Just enter the remote IP/URL address, and username. Click next to enter also password/keyfile. PyCharm will try to connect and see this all works. On the final page of the remote interpreter dialog, it asks for the interpreter path:

Remote Python config

This is referring to the python executable on the remote machine. A simple which python3 does the trick. This works to run the code using the system python on the remote machine.

To run this remote configuration, I just press the run button as usual in PyCharm. With this, PyCharm uploads my project files to the remote server over SSH, starts the interpreter there for the given configuration, and transports back to my local host the console output of the execution. For me it looks exactly the same as running it locally. This is the output of running the above configuration:

ssh://ec2-user@18.195.211.65:22/usr/bin/python3 -u /tmp/pycharm_project_411/hello_world.py
hello world
oops

The first line shows some useful information. It shows that it is using the SSH interpreter with the given IP and username, with the configured Python path. It also shows the directory where it has uploaded my project files. In this case it is "/tmp/pycharm_project_411". This is the path defined in Project Interpreter settings in the Path Mappings part, as illustrated higher above in image (with too many red arrows) in this post. OK, the attached image further above has a different number due to playing with different projects but anyway. To see the files and output:

[ec2-user@ip-172-31-3-125 ~]$ cd /tmp/pycharm_project_411/
[ec2-user@ip-172-31-3-125 pycharm_project_411]$ ls
bob.txt  hello_world.py

This is the file listing from the remote server. PyCharm has uploaded the "hello_world.py" file, since this was the only file I had in my project (under project root as configured for synch in path mappings). There is a separate tab on PyCharm to see these uploads:

Remote synch

After syncing the files, PyCharm has executed the configuration on the remote host, which defined to run the hello_world.py file. And this execution has create the file "bob.txt" as it should (on remote host). The output files go in this remote target directory, as it is the working directory for the running python program.

Another direction to synchronize is from the remote host to local. Since PyCharm provides intelligent coding assistance and navigation on local system, it needs to know and install the libraries used by the executed code. For this reason, it installs all the packages installed in the remote host Python environment. Something to keep in mind. I suppose it must install some type of a local virtual environment for this. Haven’t needed to look deeper on that yet.

Using a Remote Pipenv

The above discusses the usage of standard Python run configuration and interpreter. Something I have found useful for Python environemnts is pipenv.

So can we also do a remote execution of a remote pipenv configuration? The issue I linked earliner contains solutions and discussion on this. Basically, the answer is, yes we can. Just have to find the pipenv files on the remote host and configure the right one as the remote interpreter.

For more complex environments, such as those set up with pipenv, a bit more is required. The issue I linked before had some actual instructions on how to do this:

Remote pipenv config

I made a directory "t" on the remote host, and initialized pipenv there. Installed a few dependencies. So:

  • mkdir t
  • cd t
  • pipenv install pandas

And there we have the basic pipenv setup on the remote host. To find the pipenv dir on remote host (t is the dir where pipenv was created above):

[ec2-user@ip-172-31-3-125 t]$ pipenv --venv
/home/ec2-user/.local/share/virtualenvs/t-x5qHNh_c

To see what it contains:

[ec2-user@ip-172-31-3-125 t]$ ls /home/ec2-user/.local/share/virtualenvs/t-x5qHNh_c
bin  include  lib  lib64  src
[ec2-user@ip-172-31-3-125 t]$ ls /home/ec2-user/.local/share/virtualenvs/t-x5qHNh_c/bin
activate       activate.ps1      chardetect        pip     python     python-config
activate.csh   activate_this.py  easy_install      pip3    python3    wheel
activate.fish  activate.xsh      easy_install-3.7  pip3.7  python3.7

To get python interpreter name:

[ec2-user@ip-172-31-3-125 t]$ pipenv --py
/home/ec2-user/.local/share/virtualenvs/t-x5qHNh_c/bin/python

This is just a link to python3:

[ec2-user@ip-172-31-3-125 t]$ ls -l /home/ec2-user/.local/share/virtualenvs/t-x5qHNh_c/bin/python
lrwxrwxrwx 1 ec2-user ec2-user 7 Nov  7 20:55 /home/ec2-user/.local/share/virtualenvs/t-x5qHNh_c/bin/python -> python3

Use that to configure this pipenv as remote executor, as shown above already:

Remote pipenv config

Conclusions

I haven’t used this feature on a large scale yet, but it seems very useful. The issue I keep linking discusses one option of using it to run data processing on a large desktop system from a laptop. I also find it interesting for just running experiments in parallel on a separate machine, or for using cloud infrastrucure while developing.

The issue also has some discussion with potential pipenv management from PyCharm coming in 2020.1 or 2020.2 version. Just speculation, of course. But until then one can set up the virtualenv using pipenv on remote host and just use the interpreter path above to set up the SSH Interpreter. This works to run the code inside the pipenv environment.

Some issues I ran into included PyCharm apparently only keeping a single state mapping in memory for remote and local file diffs. PyCharm synchronizes files very well, and identifies changes to upload new files. But if I change the remote host address, it seems to still think it has the same delta. Not a big issue, but something to keep in mind as always.

That’s all.

Using Kafka with Python and (unit) testing my stuff in it..

This time I needed to post some data through Apache Kafka from some of my existing Python code. People tell me Python is such a lovely thing and everything is great when you do Python. Well, I do find the core language and environment to be great for fast scripting and prototyping. But the documentation for all the libraries etc.. oh dear

There is a nice Python Kafka client, which I decided to use (what else was I going to do anyway.. 🙂 ?): https://github.com/mumrah/kafka-python.

But those docs. Those evil docs. The usage page is just a list of some code pasted using some consumer type and some producer type.

Maybe it is because I use Python3 and things would be different on Python2. I don’t know. But the example code from the documents page:

producer.send_messages("my-topic", "some message")

Just would not work. I had to explicitly transform the strings to bytes. So

producer.send_messages("my-topic", b"some message")

works.. Overall, the SimpleProducer is, well, quite simple to get working anyway. But I also needed a consumer for my tests as I wanted to see something actually gets passed over to Kafka and the message delivered is what is expected. Meaning, I want to see my producer is using Kafka correctly and to test that I need to capture what gets passed through Kafka.

So off to try writing a simple consumer. The python-kafka docs show an example of using KafkaConsumer (as usual it is a code dump). It also tells me to use a keyword-argument called “bootstrap_servers” to define where the consumer connects to the Kafka cluster (or single server in my test case). Of course, the KafkaConsumer has no keyword-argument called “bootstrap_server”. Reading the code it seems I need to use “metadata_broker_list” instead. OK, that seems to work.

But while Googling for various issues I had, I also find there is a class called SimpleConsumer. Whooppee, being a simply guy, I use that then.. My consumer:

def assert_kafka(self, expected_file_name):
    kafka_client = KafkaClient(KAFKA_SERVER)
    consumer = SimpleConsumer(kafka_client, b"my_group", KAFKA_TOPIC.encode("utf8"),iter_timeout=1)
    consumer.seek(1, 0)
    actual = ""
    for msg in consumer:
        actual += msg.message.value.decode('utf8')+"\n"
     expected = pkg_resources.resource_string(__name__, expected_file_name).decode('utf8')
     self.assertEqual(actual, expected)

The KafkaConsumer has a timeout called “consumer_timeout_ms”, which is the number of milliseconds to throw a timeout exception. In the SimpleConsumer the similar timeout is actually called “iter_timeout” and it represents the number of seconds to wait for new messages when iterating the message queue. By setting it to 1 second, we give the producer a chance to provide all messages for the test, and also a chance to pass through any excess messages it should not be passing.

The actual message data passed from Kafka is in the “msg.message.value” given by the consumer iterator. And it is in bytes.. How did I figure all that out? I have no idea.

The “consumer.seek(1,0)” sets the position in the message queue to beginning of the queue (the 0 parameter), and adds 1 to that (the 1 parameter). This allows us to pass the first message sent by the producer. Why would we want to skip that? Kafka requires either explicitly creating the topics that are used, or it can create them automatically when first addressed. Since Python-Kafka does not seem to have explicit support for topic creation, I need to rely on the topic auto-creation. So, I set the producer to start with a specific message send to ensure the topic exists. This is to avoid having to put the try-expect in every place in the code where messages are sent.

This looks something like:

kafka_client = KafkaClient(KAFKA_SERVER)
self.kafka = SimpleProducer(kafka_client)

try:
    self.kafka.send_messages(KAFKA_TOPIC, b"creating topic")
except LeaderNotAvailableError:
    time.sleep(1) #this sleep is needed to give zookeeper time to create the topic

The tests then look something like this:

class TestKafka(unittest.TestCase):
    topic_index = 1

    @classmethod
    def setUpClass(cls):
        KAFKA_SERVER = os.environ["KAFKA_SERVER"]

    def setUp(self):
        KAFKA_TOPIC = "test_topic_" + str(self.topic_index)
        TestKafka.topic_index += 1

    def test_cpu_sys_kafka(self):
        kafka = KafkaLogger()
        kafka.log("hello")
        kafka.close()

        self.assert_kafka('expected.kafka')

Here, I take the environment variable “KAFKA_SERVER” to allow defining the test server address outside the unit test. This is stored in the configuration variable that both the producer code and the test consumer access. The KAFKA_TOPIC variable is similarly used to set a unique topic for each test to avoid tests messing with each other.

To avoid test runs from messing with each other, I used the Kafka server configuration options to make it delete all messages in 5-10 seconds after receiving it. This was the two options below set in the server.properties file:

log.retention.ms=5000
log.retention.check.interval.ms=10000

The first defines that the log is only kept for 5 seconds. The second that checks for logs to delete are done in 10s intervals. This was actually fun to find since mostly the documentation talks about using hours. Some mention doing it at very fine level of minutes.. But it is also possible to do in milliseconds as above. Again, I found that somewhere… 🙂

To summarize, I set up separate test server, set it to delete messages in short order, invoked the producer code to test in order to send the relevant messages, started a consumer to read the messages, collected these and compared to the expected values. Seems to work for now..

As a side note, I had an interesting error to debug. Kept getting a LeaderNotAvailableError when trying to send any messages over Kafka. Thought it was my code but even the command line tools that come with Kafka failed. So what was the problem? I had set up the development area zookeeper in the test area Kafka server.properties but used the test area zookeeper when sending the messages. Whooppee. What did we learn there? Probably nothing, but if we would, it would be to always double and triple check the configuration and IP addresses etc. In my defence, the error was not very descriptive..

Which relates to another note. My tests worked fine on OSX and Windows but setting this up on Ubuntu failed. The “advertised.hostname” property in server.properties needed to be set to an IP address as otherwise Kafka would query the system for host name, which was not routable from the client.. Ended up doing this for the zookeeper address as well, which of course did not do much.

And finally, if the ZooKeeper vs Kafka state gets messed up, deleting /tmp/kafka-logs and /tmp/zookeeper directories might help. I think it fixed one of my issues, but with all the issues I kept having I have no idea anymore.. 🙂

Uploading my little project to pypi (python package index)

Wanted to upload my little project to pypi, the place from which the project can be easily installed using pip install..

First of all, need to create the setup.py file. This is documented in quite a few places. Just a few notes on this:
The “name” attribute is what is used to refer to the project when performing the upload, and what is used on the index.
the “packages” attribute should contain all packages to upload. Not sure if “empty” middle packages count or not. But I put one in anyway.

Now, this should all be just fine but the weirdest error are to be had. The command to perform the upload is “python setup.py sdist upload -r REPONAME”. And most of the websites tell me that the process should ask me for my username and password if I have not defined them anywhere. Anywhere being the “.pypirc” file mentioned in various places. But no, it give me the error “invalid schema definition”. Or something similar.

So to fix this most informative error, I tried various tricks and renaming fields in setup.py etc. None worked. So I tried to delete the project from pypi (don’t worry its not like it was popular and has user(s)). This finally gives the more reasonable error that I need to “.pypirc” file.

So, I need to add the “.pypirc” file. Where to, since Windows has no clearly defined “~” (home) directory. I had to define HOME environment variable and point that someplace, create the “.pypirc” and put my access credentials there. Finally register and upload works…

Maybe next time I will remember this..

python lazy importing

Was wondering how to stop my little Python program from crashing if it fails to import MySql driver, even when the functionality with the MySql driver is not invoked. The Python runtime seems to try to load all modules that are imported as soon as they are referenced. Including if you import the module from another module. Generally this happens always as something will refer to the module or it is a useless module.

The fix is really simple. Instead of


import mysql.connector

at the top of the file, I should just import it where it is first used, such as


def __init__(self):
    import mysql.connector

Seems the term I was looking for in Google was “Python lazy importing” or similar.. whooppee 🙂

python3 mysql connector and pip install procedures

Trying to learn some Python and use some of the nice and shiny libraries available for it.

Problem: Need to use MySQL to store some data. Have to figure out how to connect and use the MySQL DB from Python3.

Several places suggest using a Python package called MySQLdb. Seems fine but does not support Python3 at this time. Installation seems a bit of a pain as well with downloading this and that from different places on different platforms. Also, if there is an “official” version/package I would prefer that. Turns out is an “official” package called MySQL Connector from Oracle. Trying to use that.

Problem again: The “official” site (http://dev.mysql.com/downloads/connector/python/ at this time) hosts some “installer” files to download. So I download one and run the installer. Nothing seems to happen as my Python3 install still does not recognize the package. Why not just “pip3 install xyz”? Such lovely simple thing for me..

So, it appears there is something that can actually be installed with pip. On the internetz I finally find some references to a command as “pip install mysql-connector-python –allow-external mysql-connector-python”. But what does this mean? Well I change pip to pip3 to get Python3 and the “pip3 install mysql-connector-python” seems obvious. The install command with a package name matching the official package. But what is “–allow-external” and why is mysql-connector-python repeated twice?

Some of this info I found in the docs, others on stackoverflow, some is my guessing. So here goes, hoping for corrections where relevant.

The “pip” or “pip3” commands download and install stuff from the Python package index a.k.a. “pypi” (located at https://pypi.python.org/pypi/). The “install” command for “pip” then looks for the given package name in the “pypi” repository and installs one if found. However, only some of the actual install files are located in the “pypi” repository. Some files are just references to an external site. So the flag “–allow-external” tells the “pip” command to also install a package even if it is not in “pypi”. In this case, “pypi” only contains the reference to the package. For security purposes this should still be fine as the “pypi” entry still should contain the checksum of the external file so no malicious stuff should be downloaded with any higher probability than getting one from “pypi”.

And how does it work to find something on “pypi”? There is a search functionality for packages at the “pypi” repository pages. If we type “mysql” in there, we get a hit in the top 10 which has a name “mysql-connector-python 2.0.3”. Looks interesting for this case. If we click on the link, we get an info page. This has information such as author etc that can help (Oracle in this case) to figure out if it is what we are looking for. At the bottom of the page there is a “DOAP record” link. This gives us an XML file which in this case shows the name as “mysql-connector-python”. So this is where I guess the match is made for the “install” command. Then there is a version number for upgrades. Finally, there is a download link with an MD5 checksum embedded in the end. So guessing again, this is where the “–allow-external” goes to download the installed external files and the MD5 checksum is what it uses to check the file is valid.

Whooppee, did we learn something useful? Hope so..