Snowplow Analytics and log streaming into BigQuery for real-time dashboards?
Hi everyone,
I was very excited when I learned about Fastly’s new log streaming into Google BigQuery feature:

Big Data Meets Edge Computing With Fastly’s Real-Time Google BigQuery...
SAN FRANCISCO, August 17, 2017 – Fastly, the edge cloud platform, has announced a new integration with Google BigQuery that combines the benefits of edge computing with BigQuery’s powerful big data analytics platform. The first in a number of planned...
I am currently working with a large publisher. They don’t use Fastly yet (unfortunately) but are currently rolling out Snowplow Analytics. Like most tracking solutions, Snowplow collects data by requesting a little tracking pixel which’s URL contains a list of key-value parameters, e.g. like this:
/i?stm=1504563596539&e=pv&url=...&page=...&tv=js-2.7.2&tna=...&aid=...&p=web&tz=America%2FNew_York&lang=en-US&cs=UTF-8&...
In general, any analytics solution needs a server-side collector component that receives GET or POST requests and passes the data along to the subsequent processes, e.g. Snowplow’s enrichment component and eventually a data sink.
Based on Fastly’s new feature, my idea is to point the tracking requests to Fastly and stream the logs into BigQuery to build real-time dashboards. However, the tutorial doesn’t mention a way to extract the URL’s payload and strean not only the default parameters to BigQuery but also the custom URL parameters that Snowplow works with.
What do you think about this idea? Does anyone know if it is possible, maybe using VCL?
Thanks,
Ian
-
Hi Ian
Yes, it'll be possible using VCL. From what you've said, all that's needed is to extract the value each of the parameters that you need to store and add them to the log format. All that would be necessary is to know the parameter names beforehand; looking at the protocol this looks like it is the case.
You can use VCL like this in order to achieve it (either via the UI, as a VCL snippet or using Custom VCL):
```
set a header with the value of the 'tv' parameter
set req.http.snowplow-tv = subfield(req.url.qs, "tv", "&"); ```
You can then pick this up in the log by adding it to the format that produces your JSON output. It will use a different format depending on if the logging endpoint is configured to use logging version 1 or 2.
-
Hi Justin,
Thank you very much for your quick response, that's awesome news!
Snowplow requests can contain a large number of URL parameters. Is there a limit regarding number of headers?
As Snowplow's URL parameter values can contain JSON objects, is there a character limit for a header's value?
Or is there an overall limit for the total headers payload, especially in conjunction with log streaming requests?
It wouldn't be a problem if there were certain limits because the idea is to use log streaming only for some few key metrics, e.g.
pageviews
,sessions
andvisitors
. In addition to the log streaming, we would set it up so that the entire URL payload is also retrieved on our upstream server, the so-called Snowplow Collector.Best regards, Ian
-
Hi Ian
Yes, there is a limit on the number of headers. We've listed them here. Max total headers is 96 (including internal ones), so practically you have 80+ to work with. I don't believe there is a limit on a single header but there is a 64K limit on total size of all headers, which is the same as the log line limit.
-
Hi Justin,
Basic log streaming to BigQuery is working great, now I’d like to set up the headers. Because we will have a quite large number of URL parameters that we want to set, I'd like to use a VCL snippet.
When we take a request URL like
https://...global.ssl.fastly.net/i?...&duid=d0f6acca-c6cb-41c5-a768-8f296bd69d8b&...
we can use[\?|\&]duid\=([^&]+)
to get the value of theduid
parameter.I added your code example from above (
set req.http.tv = subfield(req.url.qs, "tv", "&");
) to a VCL snippet that runs with thehit
subroutine (vcl_hit
) but unfortunately the value in BigQuery is(null)
which might be a-1
from VCL?I also tried this:
set req.http.tv = regsub(req.url, "[\?|\&]tv\=([^&]+)", "\1");
Thanks in advance, Ian
-
Hi Ian
Apologies for the delay in getting back to you here. I just tested the solution end-to-end and it's working for me.
One difference is your snippet is in vclhit (which won't run on every request) and mine is in vclrecv.
If changing it to the above doesn't fix it for you, I'd recommend opening a support ticket so we can look into the specifics of your case, and if there's a wider issue that we need to address I'll swing back here and update this post.
Please sign in to leave a comment.
Comments
9 comments