Securing your queries in flight - Data Privacy with Juremy
December 31, 2023

Juremy’s blazing fast EU corpus search is provided as an online service. This means that your search queries need to make their way to Juremy’s servers first, where we process them, and finally return the search results to your browser or CAT tool.

But how do we keep those queries secure while they are in-flight over the wild and unruly internet? We would like to give you an overview in this article.

Historic overview: securing insecure communication

Sending emails, receiving emails…

Let’s take as example checking your new emails in your email provider’s web interface. When you refresh the list of emails, your web browser sends a request to the email provider’s servers. A particular server answers with the list of newly arrived email subjects, which then show up on the web interface. This usually happens in tenths of a second, and we are hardly aware of the distance travelled by our request and its response.

But for us to communicate with those remote servers, our requests have to travel through a series of machines over the internet (unless you happen to have a long cable plugged directly into your email provider’s data center… not likely, imagine how awkward would it be to haul that cable with you during commute).

The problem is, we don’t control those machines, so we would never know if they were secretly reading all our email content as they were passing through. And that is exactly what the scenario was like in the past.

Secure communication invented!

Luckily, secure communication over the internet was invented, and it gained traction by the early 2000’s. The means for this was the TLS protocol – often, though falling out of fashion recently, indicated by the padlock icon, or you might have heard it mentioned by the names HTTPS or SSL. This protocol armors your requests and responses with a secure envelope, such that only you and the final recipient (the destination server) can open the envelope and read the true message. This protocol makes secure communication, banking and e-commerce, and – most importantly for our purpose – secure Juremy searches possible.

The future brings complications

Maintaining a healthy work-load balance

Utilizing TLS, your browser could establish a direct, secure connection with Juremy’s servers, and communicate without the need to trust any third-party middlemen. And that would work mostly fine in practice too, but for reasons of operational robustness and scalability, as well as increased network security, it is advisable to insert an extra layer of servers between you and the final ones. This middle layer is called the load balancer, and as its name suggests, it is responsible for coping with a large amount of incoming requests, and distributing them to the backing servers in a smart way - for example on a least-loaded basis.

We could be running our own load balancer infrastructure, if we were particularly keen to. But our primary focus is delivering you search features, therefore we outsource this task to Cloudflare.

The curious case of the two padlocks

The following diagram illustrates how your requests reach Juremy over the Internet:

The padlocks indicate the presence of a TLS-secured connection. You can see, that there’s one such secure connection between you and the Cloudflare load balancers, and an other secure connection between Cloudflare and our Juremy servers (hosted within Hetzner’s data centers).

The benefit for us is that Cloudflare handles incoming requests, potentially filtering away abusive traffic. The tradeoff is that in order to do that, Cloudflare needs to be able to remove the secure envelope and access the original message temporarily. Cloudflare will then again secure the messages by putting them into an other secure envelope, before sending them to the Juremy servers.

As Cloudflare is a reputable company, serving about 20% of the websites of the world (among others), this is a low-risk and reasonable tradeoff to make. As they outline in their transparency report , Cloudflare never stores the pass-through content of requests or responses. In our case this means that Cloudflare would never store the Juremy search queries or the resulting search hits.

As a recap, your queries (or results) are never stored persistently while in-flight between your browser and Juremy’s servers.

What about request metadata?

Great question! Request metadata are small pieces of log entries about the fact that a request was made. Very similar to the call history on your phone, the logs don’t store content, but rather timestamp, duration and recipient of the requests.

If one is not careful with how requests to servers are initiated, the metadata log entries could also leak query content. We will cover how Juremy prevents this from happening in the next part of the series!