Document Liberation… And justice for all

Document Liberation… And justice for all

Ever been in a situation when no maintained software reads your old files? During Libre Graphics Meeting 2014, Document Foundation announced a new project called Document Liberation

This project unites developers who help users to access data in file formats that are locked to proprietary and even abandoned software.

Essentially it’s a new face of the existing joined team from LibreOffice and re-lab that is already “responsible” for libraries to read and convert Corel DRAW, Microsoft Visio and Publisher, Apple Keynote and Pages files. Implementations in end-user software include (but are not limited to) LibreOffice, Inkscape, Scribus, and Calligra Suite.

The team is currently comprised of Fridrich Štrba, Valёk frob Filippov, David Tardon, Laurent Alonso, and a few past GSoC students. Fridrich, Valёk, and David kindly agreed to answer our questions.

How did the three of you get in touch first?

Fridrich: A Visio import filter idea was among others for Google Summer of Code 2011, same as several years before that. But in 2011, we actually thought that we had some documentation on the file format, because we misinterpreted information on Microsoft’s website. It is during the student application period that a strange individual with a nick “frob” appeared on LibreOffice channels.

From discussions with frob we realized we made a mistake, but fortunately he turned out to be quite knowledgeable regarding binary Visio file format. So we started collaborating, and as the project was advancing pretty successfully, we fell in love with working together.

As for David, he was part of the LibreOffice community and somehow he found the work on filters interesting. Being a nice chap, he just fit well in the team.

David: Yes, I already knew Fridrich from the LibreOffice community. As he added new import libraries, I became Fedora maintainer for every new import library he was working on (because noone else wanted to do it). Obviously I wanted to look at the code a bit, to know how they work.

At some point I started to send Fridrich simple patches, mostly containing fixes for problems found by Coverity. Then, about a year ago, I was looking for something to spend some of my free time on and I decided to write a simple import filter based on libwpd. Initially it was meant to be just a toy, but when Fridrich discovered that I was doing it, he encouraged me to actually publish the code. It is now called libe-book.

For some of the formats I was implementing in libe-book, I wanted a way to see the structure of the file. Since I saw mentions of OLEToy during the GSoC-related discussions on #libreoffice-dev, I looked at it and found it quite suitable for my purposes. At some later point, being urged by Fridrich (see? Nothing would have happened without him :-), I got in touch with Valek, offered him my patches, and he gave me access to the code.

At which point did you realize that launching an official branded entity for the joined LibreOffice/re-lab effort had become a necessity?

Fridrich: The more file formats we support, the more libraries we become available. Eventually it became clear that we were hitting the glass ceiling of what is humanly possible to achieve. This endeavour needed to break out of its current limits in order to be able to grow.

There is a hard limit of 24 man hours per day of work for an individual. Hence the growth must be achieved by increasing the number of involved individuals. We thought that a relatively independent project would be more capable of attracting new developers, especially from other communities that already make use of these libraries. Federating all libraries using the same framework under an umbrella of one project seems like a good idea.

On two occasions work on libraries (libvisio and libmspub) was sponsored within Google Summer of Code program. Are you seeking public or private sponsoring apart from that, or are you satisfied with the current state of affairs?

Fridrich: Saying ‘no’ to a potential sponsor wouldn’t be a good idea. There are two possibilities here: either this stays a spare time hobby of few people, and we can only hope that the crowd will be interested in contributing, or there will be some sort of financial support to assure that some of us can work on this project as a full-time job. So either of the two possibilities go, provided that a sufficient number of people join.

MS Publisher file opened in Scribus

Thanks to libmspub Scribus can open files from Microsoft Publisher - a still somewhat popular DTP choice for SOHO.

Even in GSoC, a lot of time is spent on mentoring and helping the student to understand the file-format and guiding the student to do an implementation of certain quality that is acceptable for the project that are potential users of a given library.

The reaction of free software developers, when asked to add support for some proprietary file format, typically falls under these three type: 1) it’s unfair to support a file format with a closed spec, when the actual app doesn’t support an open one to ensure two-way compatibility, 2) reverse-engineering is nearly impossible, 3) meh, too boring to do. What’s your own experience?

Fridrich: It is always like this with FOSS. Unless someone pays you for doing something, you are the decision maker. We find it fun to try to understand file formats and to extract information from proprietary files. What other think is their business :)

If a developer of an application does not find it time efficient to try to support a file format, so be it. But then, our libraries and their framework allow to support proprietary file formats with minimum effort.

Since we have classes to generate documents in several relevant open file-formats, it is enough to support one of them to be able to profit from our work. For instance, Inkscape uses the SVG output that our libraries generate. This made it possible even for a dummy like me to write an Inkscape import filter for Visio files.

Valek: I don’t think open source is in the position to make a choice to ignore proprietary formats. The majority of the potential users do not care about specifications, they want to use their data. You cannot replace lack of the interoperability with magic new features. And it’s really nice to see the first results of the conversion that prove you solved the puzzle right.

David: Personally, I could only ever use the third option - “Meh”. “Fairness” has nothing to do with this, you are just doing something nice for your users. Especially if/when the original app is not easily accessible or even no longer available. Besides, in my (arguably quite short) experience, reverse-engineering is very much possible, unless the format is specifically designed to make it hard. And it is an interesting pastime if one happens to like puzzles.

So “Meh” is a valid argument :-) In FOSS world, one is typically free to pick what one wants to work on. Personally, I have found working on import filters a welcome distraction from my usual work on LibreOffice.

While LibreOffice is first to get all the new cool toys as provided by libcdr, libvisio, libmspub etc., other projects, such as Inkscape and Scribus, use them too. How much/often do you typically interact with developers of 3rd party software?

David: Till now, not at all.

Valek: It depends. I’ve used to bother people from different projects about issues with WMF/EMF support in their code. From time to time I help with binary formats to LibreOffice devs, Gnumeric, and Calligra. I’m not a programmer, so my patch-providing capacity is very limited.

In some cases such as with Inkscape, you provided an initial patch to read Visio files, in others you merely consulted. Where do you actually draw the line between working on a library and actual end-user software that relies on it?

Fridrich: The situation was slightly different. The import filter in Inkscape actually uses libcdr’s API to generate an array of SVG pages. What I did personally for Inkscape was to take the existing dialogue to chose import of a given PDF page from a multi-page dialogue and adapted this dialogue to be able to preview and choose one of several pages in a multi-page Visio file.

Old Corel DRAW file opened in Inkscape

Inkscape opens Corel DRAW files from v1 to v16. Not even Corel DRAW itself is capable of that.

It was not so much about doing a work for Inkscape, more about myself being curious how to code some UI element. You will not believe, but it was my first attempt to make something displayable on a screen ever. Obviously if you don’t count a couple of MessageBox debugging calls in some windows code.

The line between working on a library and actual end-user software is perfectly clear. The library is what takes a stream of data and produces callbacks corresponding to the document model class. All other things are working on the end-user software. Our goal is to do the former and leave the later to those that know their software much better. For instance, Inkscape can now use Visio stencils, and I would never be able even to figure out how to make it happen.

Many of the “newer” libraries such as libvisio and libcdr do not provide saving/exporting to proprietary file formats. What are the main reasons for that?

Fridrich: First of all, it is one thing to understand a file format enough to be able to extract the relevant information about its content, positions, formating, … And it’s a completely different story to understand a file format enough to generate a file that the original application would load and understand. From technical point of view, the kind of work would be much more complicated, by orders of magnitude. Here the cost in time and effort outweighs all benefits, provided that we would actually able to generate a compliant file.

Another reason is that we believe that a real document freedom goes with a use of truly free standard file formats. Our goal is to allow people to load their documents, not help them to perpetuate their vendor lock-in.

Valek: In places where it makes sense, I’m trying to implement limited ability to save modified version of the files from OLEToy [an application written by the re-lab part of the project to analyze binary files], because it can help with the format research. However such features are made only for research purposes, do not have friendly UI and never part of the libraries for the reasons mentioned by Fridrich.

There are complicated cases for reverse-engineering such as DWG, where significant parts of data are encrypted, or FreeHand, where data structure is extremely unorthodox, to put it mildly. How do you make a decision if working on a particular file format sounds like a sensible idea to you?

Fridrich: The decision is based purely on our understanding of the overall need mixed with our evaluation of doability. At the end, it also depends on what we feel like doing. For instance, Freehand is kind of pet peeve for Valek, so we started to implement the library just to make him happy. It a normal part of human experience to do things like that.

It is still possible to bend our interests. If we see that there is a real community around a file format, where people are not necessarily programmers, but just persons who are ready to dive into generation of test files and want to actually learn something while helping us, we are more inclined to spend time on their file formats. There’s also a possibility that one of them would end up contributing code, and it‘s a strong motivation for us.

Valek: Open the file and look inside. If you can recognize something in its structure — document it as a module in OLEToy. Depending on the complexity, it could be near solved in one run or take multiple attempts other months or years.

If things go well AND there is enough spare time to play with something, then we extend coverage for versions which clearly behind end of life (e.g. first versions of Visio or Corel DRAW). Sometimes it allows to understand modern versions better, yet always it works towards the project’s goal — to give users access to their own data imprisoned in proprietary cells.

David: The main deciding factor is “do I have any files in that format that I cannot currently open on my Fedora?” For example, I started reverse-engineering the file formats of Software602 602Text and later Zoner Draw, because I used both of them in the (remote) past.

Currently you are mostly involved with liberating vector graphics, bitmaps, and rich formatted text, with few exceptions on the OLEToy side (ReCycle loops, NI Kontakt samples, Yamaha’s YEP files) that never made it into libraries. Why is that?

Fridrich: In order to implement a library that reads a file format, you need two composants: the file format, obviously, but also a document model that can receive the information that you extract from the file. Currently we have APIs/document models for text documents, spreadsheets, vector graphics and presentations. That is the main reason of this choice.

It could be possible to handle other document formats if we had an idea about how to model the corresponding APIs. But, on the other hand, is still a host of file-formats that need to be opened and don’t require new APIs.

As for OLEToy, it is an introspection tool that we use to understand the structure of different file-formats as well as to look for the relevant information. It is a tool that we use instead of a written documentation for file format elements that we understood. That is why sometimes there is some file format information in it that did not make it into any library… yet.

Valek: Yamaha YEP files are actually supported by application made by Václav Müller. Nor me, nor David nor Fridrich have access to Yamaha PSR devices, so we are naturally incapable to do something useful with YEP files.

What do you think are the most interesting aspects of the Document Liberation project that would make more developers willing to join you?

Fridrich: We have passion for what we are doing and we are ready to go several extra miles to get someone started. We treat developers who work with us as rock stars. What else would one want in life? :)

When the question you asked is brought up at conferences, I usually say: “Happy users will reward you. You will be the hero of the people who can now read their documents… and they will get on your nerves listing features that are not converted.” There is always this dialectic relationship, but the intellectual satisfaction that comes from working on something people actually use trumps it all.

David: We are not trying to get people to do our work for us; we are encouraging them to work on filters they want. With the benefit that if they do use our librevenge framework and our preferred license, their work will be usable, with very little integration effort, by all projects that already use another library of the same kind.

What are you immediate plans? Working on a particular library? Any possible upcoming GSoC activity?

Valek: One of the ongoing projects is libpagemaker which Brennan Vincent has been working on for a while. Also, Corel released X7 a few days ago. We are going to spend some time to expand libcdr to support it.

David: We might have a GSoC student extending Apple Pages support in libetonyek (which is practically nonexistent at the moment :-) I would like to add support for Keynote 6 to it. I would also like to finish ePub and LRF support in libe-book in time for LibreOffice 4.3.

Was it useful? There's more:

3 Comments

Leave a comment
  1. I’m surprised not to see any InDesign import on your project list.

    Scribus reads some .idml (indesign markup langage) but import needs care
    + is there any project to open the more popular indd (documents) and indb (books) files ?

  2. 2JLuc:
    We started to work on introspection of Indd. This is work in-progress with no any particular schedule defined and at the moment without expectations on required amount of efforts to reach the state where work on import implementation will make sense.

  3. An update for the answer to the last question: CorelDraw x7 support landed into the libcdr-0.0.15 release that was tagged on Fri Apr 4 15:38:48 2014 +0200

Tell us what you think

Submit the word you see below: