I’m (slowly) putting together a PhD thesis at the moment, as well as occasionally having to present presentations. One thing that has always infuriated me about research is the clutter that arises from multiple programming languages, scripts, graphics outputs, versions of files, presentations and papers. You think to yourself, where is that nice figure I made of [insert great discovery here], or you wonder which version of a script actually led to a particular figure. Of course, decent file organisation is important, but I think part of the problem also comes from figures inevitably getting duplicated depending on their use. A publisher might need the figure in high-res PNG or PDF or EPS, and once a figure is in Powerpoint you often lose track of where it came from. And then there is the increasing push to put stuff online.
D3 is great at generating figures, independent of the data source, and its a natural fit for displaying data on the web. But programmable SVG files seem like something too good to only use on the web. I’ve also come round to the idea that HTML displayed on Chrome in full-screen mode is a better fit for me that Powerpoint. That’s partly due to some of the above problems, but also for the same reasons that people use LaTeX instead of Word. In the end its a personal choice, I’m happy to admit that LaTeX and HTML have their disadvantages compared to all-in-one tools like Word and Powerpoint. I was using my own cobbled-together solution until I discovered I was only reinventing the wheel. There are several solutions out there, but I like reveal.js a lot, and was able to adapt it to look suitably similar (without removing all of its innovative concepts!) to our department’s Powerpoint template pretty easily.
Anyway, now that most of my workflow can use named files for displaying graphics (presentations and text documents) I decided I wanted a way to have my scripts output data for plotting (so I’m programming language neutral), plot it with D3, and as automatically-as-possible generate identical versions in SVG, PDF and PNG. SVG is for the web or posters (i.e. Illustrator), PDF is for LaTeX and PNG is for, well, everything else, for example Powerpoint, which (no man being an island), we can never escape.
This is how it works:
The data sits in a CSV file.
An appropriate HTML file contains the D3 plotting code. So far I use one HTML file per plot; it seems every figure is different enough to warrant that, but of course some bits could be put into a common .js file to avoid duplication.
I run a single Python script with three arguments:
python outputfiles.py [source html file] [html element name] [output prefix]
I then end up with the three files:
So, what’s going on in this Python script?
from subprocess import call
source = sys.argv
element = sys.argv
target = sys.argv
command = "../bin/phantomjs ../lib/extract.js \"" + source + "\" " + element + " >> \"" + target + ".svg\""
svg = open(target + ".svg").read()
fout = open(target + ".png",'w')
svg = open(target + ".svg").read()
fout = open(target + ".pdf",'w')
One very important point: do your SVG styling using
.style("attribute", "value") function calls and not using style sheets. If you use style sheets, your SVG is being styled by the page and the styles are not integrated into the SVG element, which becomes your file, and which will end up looking very different from what you see on the webpage.
Finally, I have everything set up in a slightly pedantic folder structure (I’ve lost count of how many new miracle folder structures I’ve come up with, but this is perhaps the most all-encompassing yet) which looks like:
- Executable code, i.e. phantomjs
- Data, regardless of whether its raw or processed from some script (there’s a big gray area inbetween those seemingly distinguishable two things
- One folder per figure, usually containing an html file for plotting and the files output from the script above. But I also put readymade figures, and videos (see blog post on .webm format***) in here.
- Presentations, in the case of Impress.js (see below), this is a simple HTML file which can reference the figures using simple relative paths
- Scripts for doing stuff – e.g. the Python script for outputting the images and a script for generating screenshots from SUMO. I suppose the distinction between this and lib is that things in lib should be referenced whereas things in here should be run.
- Editable text – LaTeX files live in here
So now, in theory, I can very quickly establish what data the figures in my thesis and presentations are based on, and always have the latest figure for a particular topic to hand. One advantage of keeping as many things as possible in text is that you can use the same editor for all of them, whether it be a text editor, an IDE, or a mobile app (I use PlainText on the iPad).
In case this system resonates with you in any way, you can download a sample archive containing all the necessary bits and pieces (apart from the additions to your Python environment) here. Please note that the PhantomJS/CairoSVG part is only tested on Mac and the PhantomJS binary under
/bin is the Mac version. In any case its probably worthwhile updating PhantomJS, reveal.js and D3.js to the latest versions after you download.