You are here

Agreguesi i feed

Didier Roche: Welcome To The (Ubuntu) Bionic Age: Nautilus, a LTS and desktop icons

Planet Ubuntu - Mar, 23/01/2018 - 10:36pd
Nautilus, Ubuntu 18.04 LTS and desktop icons: upstream and downstream views.

If you are following closely the news of various tech websites, one of the latest hot topic in the community was about Nautilus removing desktop icons. Let’s try to clarify some points to ensure the various discussions around it have enough background information and not reacting on emotions only as it could be seen lately. You will have both downstream (mine) and upstream (Carlos) perspectives here.

Why upstream Nautilus developers are removing the desktop icons

First, I wasn’t personally really surprised by the announce. Let’s be clear: GNOME, since its 3.0 release, doesn’t have any icons on the desktop by default. There was an option in Tweaks to turn it back on, but let’s be honest: this wasn’t really supported.

The proof is that this code was never really maintained for 7 years and didn’t transition to newer view technologies like the ones Nautilus is migrating to. Having patched myself this code many years ago for Unity (moving desktop icons to the right depending on the icon size, in intellihide mode which thus doesn’t workarea STRUT), I can testify that this code was getting old. Consequently, it became old and rotten for something not even used on default upstream GNOME experience! It would be some irony to keep it that way.

I’m reading a lot of comments about “just keep it as option, the answer is easy”. Let me disagree with this. As already stated previously during my artful blog post series, and for the same reason that we keep Ubuntu Dock with a very few set of supported options, any added option has a cost:

  • It’s another code path to test (manually, most of the time, unfortunately), and the exploding combination of options which can interact badly between each other just produce an unfinished projects, where you have to be careful to not enable this and that option together, or it crashes or cause side-effects… People having played enough with Compiz Config Settings Manager should know what I’m talking about.
  • Not only that, but more code means more bugs, and if you have to transition to a newer technology, you have to modify that code as well. And working on that is detrimental to other bug fixes, features, tests or documentations that could benefit the project. So, this piece of code that you keep and don’t use, has a very negative impact on your whole project. Worse, it impacts indirectly even users who are using the defaults as they are not benefiting of planned enhancements from other part of the project, due to maintainer’s time constraints.

So, yeah, there is never “just an option”.

In addition to that argument that I took to defend upstream’s position (even in front of the French ubuntu community), I want also to hilight that the plan to remove desktop icons was really well executed in term of communication. However, seeing the feedback the upstream developers got when following this communication plan, which takes time, doesn’t motivate to do it again, where in my opinion, it should be the standard for any important (or considered as this) decisions:

  • Carlos blogged about it on planet GNOME by the end of December. He didn’t only explain the context, why this change, who are impacted by this, possible solutions, but he also presented some proposals. So, there is complete what/why/who/how!
  • In addition to this, there is a very good abstract, more technically oriented, in an upstream bug report.
  • A GNOME Shell proof of concept extension was even built to show that the long term solution for users who wants to support icons on the desktop is feasible. It wasn’t just a throwaway experience as clear goals and targets were defined on what’s need to be done to move from a PoC to a working extension. And yes, by the exact same group who are removing desktop icons from Nautilus, I guess that means a lot!

Consequently, those are the detailed information for users to understand why this change is happening and what will be the long-term consequence of it. That is the foundation for good comments and exchanges on the various striking news blog posts. Very well done Carlos! I hope that more and more of those decisions on any free software project will be presented and explained as well as this one. ;)

A word from Nautilus upstream maintainer

Now that I’ve said what I wanted to tell about my view on the upstream changes, and before detailing what we are going to do for Ubuntu 18.04 LTS, let me introduce you Nautilus upstream maintainer already many times mentioned: Carlos Soriano.

Hello Ubuntu and GNOME community,

Thanks Didier for the detailed explanation and giving me a place in your blog!

I’m writting here because I wanted to clarify some details in the interaction with downstreams, in this case Ubuntu and Canonical developers. When I wrote the blog post with all the details I explained only the part that purely refers to upstream Nautilus. And that was actually quite well received. However, two weeks after my blog post some websites explained the change in a not very good way (the clickbait magic).

Usually that is not very important, those who want factual information know that we are in IRC all the time and that we usually write blog posts about upcoming changes in our blog agregation at Planet GNOME or here in Didier’s blog for Ubuntu. This time though, some people in our closer communities (both GNOME and Ubuntu) got splashed with missconceptions, and I wanted to address that, and what better than to do it with Didier :)

One missconception was that Ubuntu and Canonical were ‘yet again’ using older version of software just because. Well maybe you are surprised now, my recommendation for Ubuntu and Canonical was to actually stay in Nautilus 3.26. With a LTS version coming that is by far the most reasonable option. While for a regular user the upstream recommendation is to try out nemo-desktop (by the way, this is another missconception, we said nemo-desktop, not Nemo the app, for a user those are in practice two different things), for a distribution that needs to support and maintain all kind of requests and stability promises for years, staying with a single code that they already worked with is the best option.

Another missconception I saw these days is that seems we take decisions in a rush. In short, I became Nautilus maintainer 3 years and 4 months ago. Exactly 3 years and one month ago I realized that we need to remove that part from Nautilus. It has been quite hard to reason within myself during these 3 years that an option that upstream was not considered the experience we wanted to provide was holding most of the major works on Nautilus, including making away contributions from new contributors given the poor state of the desktop part that unfortunately impacted the whole code of the application itself. In all this time, dowstreams like Ubuntu were a major reason for me to hold this code. Discussions about this decision happened all this time with the other developers of Nautilus.

And the last missconception was that it looks like GNOME devs and Ubuntu devs are in completely separate nichos where noone communicates with each other. While we are usually focused on our personal tasks, when a change is going to happen we do communicate. In this case, I reached out to the desktop team at Canonical before taking the final decision, providing a draft of the blog post to check out the impact, and give possible options for the LTS release of Ubuntu and further.

In summary, the take out from here is that while we might have slighly different visions, at the end of the day we just want to provide the best experience to the users, and for that believe me we do the best we can.

In case you have any question you can always reach out to us Nautilus upstream in #nautilus IRC channel at irc.gnome.org or in our mailing list.

Hope you enjoy this read, and hopefully we will have the benefits of this work to be shown soon. Thanks again Didier!

What does this mean for Ubuntu?

We thought about this as the Ubuntu Desktop team. Our next release is a LTS (Long-Term Support) version, meaning that Ubuntu 18.04 (currently named Bionic Beaver during its development) will have 5 years of support in term of bug fixes and security updates.

It also mean that most of our user audience will upgrade from our last Ubuntu 16.04 LTS to Ubuntu 18.04 LTS (or even 14.04 -> 16.04 -> 18.04!). The changes are quite large in those last 2 years in term of software updates and new features. On top of this, those users will experience for the first time the Unity -> GNOME Shell transition, and we want to give them a feeling of comfort and familiar landmarks in our default experience, despite the huge changes underneath.

On Ubuntu desktop, we are shipping a Dock, visible by default. Consequently, the desktop view itself, without any application on top, is more important in our user experience than it is for upstream GNOME Shell default one. We think that shipping icons on the desktop is still relevant for our user bases.

Where does this leave us regarding those changes? Thinking about the problem, we came to approximately the same conclusions that upstream Nautilus developers have:

  • Staying for the LTS release, on Nautilus 3.26: the pros is that it’s a battle tested code, that we already know we can support (shipped on 17.10). This matches the fact that the LTS is a very important and strong commitment to us. The cons is that it won’t be by release date the latest and greatest upstream Nautilus release, and maybe some integrations with other parts of GNOME 3.28 code would require more downstream work from us.
  • Using an alternative file manager for the desktop, like Nemo. Shipping on a LTS entirely new code, having to support 2 file managers (Nautilus for normal file browsing and Nemo for the desktop) and ensuring the integration between those two and all other applications works well quickly ruled out that solution.
  • Upgrading to Nautilus 3.28 and shipping the PoC GNOME-Shell extension, contributing to it as much as possible before release. The issue (despite being the long-term solution) is that we’ll ship as in the previous solution entirely new code and that the extension needs new APIs from Nautilus which aren’t fully shaped yet (and maybe won’t be ready for GNOME 3.28 even). Also, we did plan a long time in advance (end of September for 18.04 LTS) on the features and work needed to be done for the next release and we still have a lot to do for Ubuntu 18.04 LTS, some of them being GNOME upstream code. Consequently, rushing into this extension coding, looking to our approaching Feature Freeze deadline on March 1st, would mean that we either drop some initially planned features, or fix less bugs, less polish, which will be detrimental to our overall release quality.

As in every release, we decide on a component by component bases what to do (upgrade to latest or not), weighing the pros and cons and trying to take the best decision for our end user audience. We think that most of GNOME components will be upgraded to 3.28. However, in that particular instance, we decided to keep Nautilus to version 3.26 on Ubuntu 18.04 LTS. You can read the discussion that took place during our weekly Ubuntu Desktop meeting on IRC, leading to that decision.

Another pro to that decision is that it gives flavors shipping Nautilus by default, like Ubuntu Budgie and Edubuntu, a little bit more time to find a solution matching their need, as they don’t run GNOME Shell, and so, can’t use that extension.

The experience will thus be: desktop icons (with Nautilus 3.26) on the default ubuntu session. The vanilla GNOME session I talked many times about will still be running Nautilus 3.26 (as we can only have one version of a software in the archive and installed on user’s machine with traditional deb packages), but with icons on the desktop disabled, as we did on Ubuntu Artful. I think some motivated users will build Nautilus 3.28 in a ppa, but it won’t receive official security and bug fixes support of course.

Meanwhile, we will start contributing for a more long term plan on this new GNOME Shell extension with the Nautilus developers to shape a proper API, have good Drag and Drop support and so on, progressively… This will give better long term code and we hope that the following ubuntu releases will be able to move to it once we reach the minimal set of features we want from it (and consequently, update to latest Nautilus version!).

I hope that sheds some lights on both GNOME upstream and our ubuntu decisions, seeing the two perspectives and why those actions were taken, as well as the long term plan. Hopefully, those posts explaining a little bit the context will lead to informed and constructive comments as well!

Benjamin Mako Hill: Introducing Computational Methods to Social Media Scientists

Planet Ubuntu - Mar, 23/01/2018 - 1:38pd

The ubiquity of large-scale data and improvements in computational hardware and algorithms have provided enabled researchers to apply computational approaches to the study of human behavior. One of the richest contexts for this kind of work is social media datasets like Facebook, Twitter, and Reddit.

We were invited by Jean BurgessAlice Marwick, and Thomas Poell to write a chapter about computational methods for the Sage Handbook of Social Media. Rather than simply listing what sorts of computational research has been done with social media data, we decided to use the chapter to both introduce a few computational methods and to use those methods in order to analyze the field of social media research.

A “hairball” diagram from the chapter illustrating how research on social media clusters into distinct citation network neighborhoods. Explanations and Examples

In the chapter, we start by describing the process of obtaining data from web APIs and use as a case study our process for obtaining bibliographic data about social media publications from Elsevier’s Scopus API.  We follow this same strategy in discussing social network analysis, topic modeling, and prediction. For each, we discuss some of the benefits and drawbacks of the approach and then provide an example analysis using the bibliographic data.

We think that our analyses provide some interesting insight into the emerging field of social media research. For example, we found that social network analysis and computer science drove much of the early research, while recently consumer analysis and health research have become more prominent.

More importantly though, we hope that the chapter provides an accessible introduction to computational social science and encourages more social scientists to incorporate computational methods in their work, either by gaining computational skills themselves or by partnering with more technical colleagues. While there are dangers and downsides (some of which we discuss in the chapter), we see the use of computational tools as one of the most important and exciting developments in the social sciences.

Steal this paper!

One of the great benefits of computational methods is their transparency and their reproducibility. The entire process—from data collection to data processing to data analysis—can often be made accessible to others. This has both scientific benefits and pedagogical benefits.

To aid in the training of new computational social scientists, and as an example of the benefits of transparency, we worked to make our chapter pedagogically reproducible. We have created a permanent website for the chapter at https://communitydata.cc/social-media-chapter/ and uploaded all the code, data, and material we used to produce the paper itself to an archive in the Harvard Dataverse.

Through our website, you can download all of the raw data that we used to create the paper, together with code and instructions for how to obtain, clean, process, and analyze the data. Our website walks through what we have found to be an efficient and useful workflow for doing computational research on large datasets. This workflow even includes the paper itself, which is written using LaTeX + knitr. These tools let changes to data or code propagate through the entire workflow and be reflected automatically in the paper itself.

If you  use our chapter for teaching about computational methods—or if you find bugs or errors in our work—please let us know! We want this chapter to be a useful resource, will happily consider any changes, and have even created a git repository to help with managing these changes!

The book chapter and this blog post were written with Jeremy Foote and Aaron Shaw. You can read the book chapter here. This blog post was originally published on the Community Data Science Collective blog.

Introducing Computational Methods to Social Media Scientists

Planet Debian - Mar, 23/01/2018 - 1:38pd

The ubiquity of large-scale data and improvements in computational hardware and algorithms have provided enabled researchers to apply computational approaches to the study of human behavior. One of the richest contexts for this kind of work is social media datasets like Facebook, Twitter, and Reddit.

We were invited by Jean BurgessAlice Marwick, and Thomas Poell to write a chapter about computational methods for the Sage Handbook of Social Media. Rather than simply listing what sorts of computational research has been done with social media data, we decided to use the chapter to both introduce a few computational methods and to use those methods in order to analyze the field of social media research.

A “hairball” diagram from the chapter illustrating how research on social media clusters into distinct citation network neighborhoods. Explanations and Examples

In the chapter, we start by describing the process of obtaining data from web APIs and use as a case study our process for obtaining bibliographic data about social media publications from Elsevier’s Scopus API.  We follow this same strategy in discussing social network analysis, topic modeling, and prediction. For each, we discuss some of the benefits and drawbacks of the approach and then provide an example analysis using the bibliographic data.

We think that our analyses provide some interesting insight into the emerging field of social media research. For example, we found that social network analysis and computer science drove much of the early research, while recently consumer analysis and health research have become more prominent.

More importantly though, we hope that the chapter provides an accessible introduction to computational social science and encourages more social scientists to incorporate computational methods in their work, either by gaining computational skills themselves or by partnering with more technical colleagues. While there are dangers and downsides (some of which we discuss in the chapter), we see the use of computational tools as one of the most important and exciting developments in the social sciences.

Steal this paper!

One of the great benefits of computational methods is their transparency and their reproducibility. The entire process—from data collection to data processing to data analysis—can often be made accessible to others. This has both scientific benefits and pedagogical benefits.

To aid in the training of new computational social scientists, and as an example of the benefits of transparency, we worked to make our chapter pedagogically reproducible. We have created a permanent website for the chapter at https://communitydata.cc/social-media-chapter/ and uploaded all the code, data, and material we used to produce the paper itself to an archive in the Harvard Dataverse.

Through our website, you can download all of the raw data that we used to create the paper, together with code and instructions for how to obtain, clean, process, and analyze the data. Our website walks through what we have found to be an efficient and useful workflow for doing computational research on large datasets. This workflow even includes the paper itself, which is written using LaTeX + knitr. These tools let changes to data or code propagate through the entire workflow and be reflected automatically in the paper itself.

If you  use our chapter for teaching about computational methods—or if you find bugs or errors in our work—please let us know! We want this chapter to be a useful resource, will happily consider any changes, and have even created a git repository to help with managing these changes!

The book chapter and this blog post were written with Jeremy Foote and Aaron Shaw. You can read the book chapter here. This blog post was originally published on the Community Data Science Collective blog.

Benjamin Mako Hill https://mako.cc/copyrighteous copyrighteous

Alberto Ruiz: GUADEC 2017: GNOME’s Renaissance

Planet GNOME - Mar, 23/01/2018 - 1:35pd

NOTE: This is a blog post I kept as a draft right after GUADEC to reflect on it and the GNOME project but failed to finish and publish until now. Forgive any outdated information though I think the post is mostly relevant still.

I’m on my train back to London from Manchester, where I just spent 7 amazing days with my fellow GNOME community members. Props to the local team for an amazing organization, everything went smoothly and people seemed extremely pleased with the setup as far as I can tell and the venues seemed to have worked extremely well. I mostly want to reflect on a feeling that I have which is that GNOME seems to be experiencing a renaissance in the energy and focus of the community as well as the broader interest from other players.

Source: kitty-kat @ flickr CC BY-SA 4.0 Peak attendance and sponsorship

Our attendance numbers have gone up considerably from the most recent years, approximately 260 registrations, minus a bunch of people who could not make it in the end. That is an increase of ~50-60 attendants which is really encouraging.

On top of that this years’ sponsorships went up both in the number of companies sponsoring and supporting the event, this is really encouraging as it shows that there is an increased interest in the project and acknowledgement that GUADEC

Comebacks

There are two comebacks that were actually very encouraging, first, Canonical and Ubuntu community members are back, and it was really encouraging to see them participating. Additionally, members of the Solaris Desktop team showed up too. Also, having Andrew Walton from VMWare around was encouraging too.

Balance was brought back to the force

Historically Red Hat would have a prominent presence at GUADEC and in GNOME in general, it was really amazing to see that Endless is now a Foundation AdBoard member, but also, that the amount of Endless crew at GUADEC matched that of Red Hat.

Contrary of what people may think, Red Hat does not generally enjoy being the only player in any given free software community, however since Nokia and Sun/Oracle retreated their heavy investment in GNOME, Red Hat was probably the most prominent player in the community until now. While Red Hat is still investing as heavily as ever in GNOME, we’re not the only major player anymore, and that is something to celebrate and to look forward to see expanded to other organizations.

It was particularly encouraging to see the design sessions packed with people from Endless, Canonical and Red Hat as well as many other interested individuals.

Flathub

It feels to me that Flatpak, and specially the current efforts around Flathub have helped focused the community towards a common vision for the developer and the user story and you can feel the common excitement that we’re onto something and its implications across the stack and our infrastructure.

Obviously, not everybody shares this enthusiasm around flatpak itself, but the broad consensus is that the model around it is worth pursuing and that it has the potential to raise considerably the viability of the Free Software desktop and personal devices, not to mention, it gives a route towards monetization of free software and Linux apps.

GitLab, Meson and BuildStream

Another batch of modernization in our stack and infrastructure. First and foremost, the GitLab migration which we believe it will not only improve the interaction of newcomers and early testers as well as improving our Continuous Integration pipeline.

Consensus around Meson and leaving autotools behind is another big step and many other relevant free software projects are jumping to the bandwagon. And last but not least, Tristan and Codethink are leading an effort to consolidate continuous and jhbuild into BuildStream, a better way to build a collection of software from multiple source repositories.

Looking ahead

I think that the vibe at GUADEC and the current state of GNOME is really exciting, there are a lot of things to look forward to as well. My main take away is that the project is in an incredibly healthy state, with a rich ecosystem of people committing entire companies on products based on what the GNOME community writes and a commitment to solve some of the very hard remaining problems left to make the Free Desktop a viable contender to what the industry offers right now.

Promising times ahead.

Mentors and co-mentors for Debian's Google Summer of Code 2018

Planet Debian - Mar, 23/01/2018 - 12:50pd

Debian is applying as a mentoring organization for the Google Summer of Code 2018, an internship program open to university students aged 18 and up.

Debian already has a wide range of projects listed but it is not too late to add more or to improve the existing proposals. Google will start reviewing the ideas page over the next two weeks and students will start looking at it in mid-February.

Please join us and help extending Debian! You can consider listing a potential project for interns or listing your name as a possible co-mentor for one of the existing projects on Debian's Google Summer of Code wiki page.

At this stage, mentors are not obliged to commit to accepting an intern but it is important for potential mentors to be listed to get the process started. You will have the opportunity to review student applications in March and April and give the administrators a definite decision if you wish to proceed in early April.

Mentors, co-mentors and other volunteers can follow an intern through the entire process or simply volunteer for one phase of the program, such as helping recruit students in a local university or helping test the work completed by a student at the end of the summer.

Participating in GSoC has many benefits for Debian and the wider free software community. If you have questions, please come and ask us on IRC #debian-outreach or the debian-outreachy mailing list.

Daniel Pocock and Laura Arjona Reina https://bits.debian.org/ Bits from Debian

Ick: a continuous integration system

Planet Debian - Hën, 22/01/2018 - 7:30md

TL;DR: Ick is a continuous integration or CI system. See http://ick.liw.fi/ for more information.

More verbose version follows.

First public version released

The world may not need yet another continuous integration system (CI), but I do. I've been unsatisfied with the ones I've tried or looked at. More importantly, I am interested in a few things that are more powerful than what I've ever even heard of. So I've started writing my own.

My new personal hobby project is called ick. It is a CI system, which means it can run automated steps for building and testing software. The home page is at http://ick.liw.fi/, and the download page has links to the source code and .deb packages and an Ansible playbook for installing it.

I have now made the first publicly advertised release, dubbed ALPHA-1, version number 0.23. It is of alpha quality, and that means it doesn't have all the intended features and if any of the features it does have work, you should consider yourself lucky.

Invitation to contribute

Ick has so far been my personal project. I am hoping to make it more than that, and invite contributions. See the governance page for the constitution, the getting started page for tips on how to start contributing, and the contact page for how to get in touch.

Architecture

Ick has an architecture consisting of several components that communicate over HTTPS using RESTful APIs and JSON for structured data. See the architecture page for details.

Manifesto

Continuous integration (CI) is a powerful tool for software development. It should not be tedious, fragile, or annoying. It should be quick and simple to set up, and work quietly in the background unless there's a problem in the code being built and tested.

A CI system should be simple, easy, clear, clean, scalable, fast, comprehensible, transparent, reliable, and boost your productivity to get things done. It should not be a lot of effort to set up, require a lot of hardware just for the CI, need frequent attention for it to keep working, and developers should never have to wonder why something isn't working.

A CI system should be flexible to suit your build and test needs. It should support multiple types of workers, as far as CPU architecture and operating system version are concerned.

Also, like all software, CI should be fully and completely free software and your instance should be under your control.

(Ick is little of this yet, but it will try to become all of it. In the best possible taste.)

Dreams of the future

In the long run, I would ick to have features like ones described below. It may take a while to get all of them implemented.

  • A build may be triggered by a variety of events. Time is an obvious event, as is source code repository for the project changing. More powerfully, any build dependency changing, regardless of whether the dependency comes from another project built by ick, or a package from, say, Debian: ick should keep track of all the packages that get installed into the build environment of a project, and if any of their versions change, it should trigger the project build and tests again.

  • Ick should support building in (or against) any reasonable target, including any Linux distribution, any free operating system, and any non-free operating system that isn't brain-dead.

  • Ick should manage the build environment itself, and be able to do builds that are isolated from the build host or the network. This partially works: one can ask ick to build a container and run a build in the container. The container is implemented using systemd-nspawn. This can be improved upon, however. (If you think Docker is the only way to go, please contribute support for that.)

  • Ick should support any workers that it can control over ssh or a serial port or other such neutral communication channel, without having to install an agent of any kind on them. Ick won't assume that it can have, say, a full Java run time, so that the worker can be, say, a micro controller.

  • Ick should be able to effortlessly handle very large numbers of projects. I'm thinking here that it should be able to keep up with building everything in Debian, whenever a new Debian source package is uploaded. (Obviously whether that is feasible depends on whether there are enough resources to actually build things, but ick itself should not be the bottleneck.)

  • Ick should optionally provision workers as needed. If all workers of a certain type are busy, and ick's been configured to allow using more resources, it should do so. This seems like it would be easy to do with virtual machines, containers, cloud providers, etc.

  • Ick should be flexible in how it can notify interested parties, particularly about failures. It should allow an interested party to ask to be notified over IRC, Matrix, Mastodon, Twitter, email, SMS, or even by a phone call and speech syntethiser. "Hello, interested party. It is 04:00 and you wanted to be told when the hello package has been built for RISC-V."

Please give feedback

If you try ick, or even if you've just read this far, please share your thoughts on it. See the contact page for where to send it. Public feedback is preferred over private, but if you prefer private, that's OK too.

Lars Wirzenius' blog http://blog.liw.fi/englishfeed/ englishfeed

Improving communication

Planet Debian - Hën, 22/01/2018 - 3:49md

After my last post, a lot of things happened, but what I'm going to talk about now is the thing that I believe had the most impact in improving my experience with the Outreachy internship: the changes that were made in communication, specially between me and my mentors.

When I struggled with the tasks, with moving forward, it was somewhat a wish of mine to change the ways I communicate with my mentors. (Alright, Renata, so why didn't you start by just doing that? Well, I wasn't sure where to begin.)

I didn't know how to propose something like that to my mentors, I mean... maybe that was how Outreachy was supposed to be and I just might have set different expectations? The first step to figure this out I took by reaching Anna, an Outreachy intern with Wikimedia who I'd been talking to since the interns announcement had been made.

I asked her about how she interacted with her mentors and how often, so I knew what I could ask for. She told me about her weekly meetings with her mentors and how she could chat direcly with them when she ran into some issues. And, indeed, I felt like things like that what I wanted to happen.

Before I could reach out and discuss this with my mentors, though, Daniel himself read last week's post and brought up the idea of us speaking on the phone for the first time. That was indeed a good experience and I told him I would like to repeat or establish some sort of schedule to communicate with each other.

Yes, well, a schedule would be the best improvement, I think. It's not just about the means (phone call or IRC, for instance) that we communicate, but to know that, at some point, either one per week or bi-weekly, there would be someone to talk to at a determined time so I could untie any knots that were created during my internship (if that makes sense). I know I could just send an email at any time to my mentors (and sometimes I do) and they would reply, but that's not quite the point.

So, to make this short: I started to talk to one of my mentors daily and it's been really helpful. We are working on a schedule for bi-weekly calls. And we always have e-mails. I'm glad to say that now I talk not just with mentors, but also with fellow brazilian Outreachers and former participants and everyone is willing to help out.

For all the ways to reach me, you can look up my Debian wiki profile.

Renata https://rsip22.github.io/blog/ Renata's blog

FAI.me build service now supports backports

Planet Debian - Hën, 22/01/2018 - 2:00md

The FAI.me build service now supports packages from the backports repository. When selecting the stable distribution, you can also enable backports packages. The customized installation image will then uses the kernel from backports (currently 4.14) and you can add additional packages by appending /stretch-backports to the package name, e.g. notmuch/stretch-backports.

Currently, the FAIme service offers images build with Debian stable, stable with backports and Debian testing.

If you have any ideas for extensions or any feedback, send an email to FAI.me =at= fai-project.org

FAI.me

Thomas Lange http://blog.fai-project.org/ FAI (Fully Automatic Installation) / Plan your Installation and FAI installs your Plan

Rblpapi 0.3.8: Strictly maintenance

Planet Debian - Hën, 22/01/2018 - 1:47md

Another Rblpapi release, now at version 0.3.8, arrived on CRAN yesterday. Rblpapi provides a direct interface between R and the Bloomberg Terminal via the C++ API provided by Bloomberg Labs (but note that a valid Bloomberg license and installation is required).

This is the eight release since the package first appeared on CRAN in 2016. This release wraps up a few smaller documentation and setup changes, but also includes an improvement to the (less frequently-used) subscription mode which Whit cooked up on the weekend. Details below:

Changes in Rblpapi version 0.3.8 (2018-01-20)
  • The 140 day limit for intra-day data histories is now mentioned in the getTicks help (Dirk in #226 addressing #215 and #225).

  • The Travis CI script was updated to use run.sh (Dirk in #226).

  • The install_name_tool invocation under macOS was corrected (@spennihana in #232)

  • The blpAuthenticate help page has additional examples (@randomee in #252).

  • The blpAuthenticate code was updated and improved (Whit in #258 addressing #257)

  • The jump in version number was an oversight; this should have been 0.3.7.

And only while typing up these notes do I realize that I fat-fingered the version number. This should have been 0.3.7. Oh well.

Courtesy of CRANberries, there is also a diffstat report for the this release. As always, more detailed information is on the Rblpapi page. Questions, comments etc should go to the issue tickets system at the GitHub repo.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

Dirk Eddelbuettel http://dirk.eddelbuettel.com/blog Thinking inside the box

Keeping an Irish home warm and free in winter

Planet Debian - Hën, 22/01/2018 - 10:20pd

The Irish Government's Better Energy Homes Scheme gives people grants from public funds to replace their boiler and install a zoned heating control system.

Having grown up in Australia, I think it is always cold in Ireland and would be satisfied with a simple control switch with a key to make sure nobody ever turns it off but that isn't what they had in mind for these energy efficiency grants.

Having recently stripped everything out of the house, right down to the brickwork and floorboards in some places, I'm cautious about letting any technologies back in without checking whether they are free and trustworthy.

This issue would also appear to fall under the scope of FSFE's Public Money Public Code campaign.

Looking at the last set of heating controls in the house, they have been there for decades. Therefore, I can't help wondering, if I buy some proprietary black box today, will the company behind it still be around when it needs a software upgrade in future? How many of these black boxes have wireless transceivers inside them that will be compromised by security flaws within the next 5-10 years, making another replacement essential?

With free and open technologies, anybody who is using it can potentially make improvements whenever they want. Every time a better algorithm is developed, if all the homes in the country start using it immediately, we will always be at the cutting edge of energy efficiency.

Are you aware of free and open solutions that qualify for this grant funding? Can a solution built with devices like Raspberry Pi and Arduino qualify for the grant?

Please come and share any feedback you have on the FSFE discussion list (join, reply to the thread).

Daniel.Pocock https://danielpocock.com/tags/debian DanielPocock.com - debian

Continuous integration testing of TeX Live sources

Planet Debian - Hën, 22/01/2018 - 10:15pd

The TeX Live sources consists in total of around 15000 files and 8.7M lines (see git stats). It integrates several upstream projects, including big libraries like FreeType, Cairo, and Poppler. Changes come in from a variety of sources: external libraries, TeX specific projects (LuaTeX, pdfTeX etc), as well as our own adaptions and changes/patches to upstream sources. Since quite some time I wanted to have a continuous integration (CI) testing, but since our main repository is based on Subversion, the usual (easy, or the one I know) route via Github and one of the CI testing providers, didn’t come to my mind – until last week.

Over the weekend I have set up CI testing for our TeX Live sources by using the following ingredients: git-svn for checkout, Github for hosting, Travis-CI for testing, and a cron job that does the connection. To be more specific:

  • git-svn I use git-svn to check out only the source part of the (otherwise far to big) subversion repository onto my server. This is similar to the git-svn checkout of the whole of TeX Live as I reported here, but contains only the source part.
  • Github The git-svn checkout is pushed to the project TeX-Live/texlive-source on Github.
  • Travis-CI The CI testing is done in the TeX-Live/texlive-source project on Travis-CI (who are offering free services for open source projects, thanks!)

Although this sounds easy, there are a few stumbling blocks: First of all, the .travis.yml file is not contained in the main subversion repository. So adding it to the master tree that is managed via git-svn is not working, because the history is rewritten (git svn rebase). My solution was to create a separate branch travis-ci which adds only the .travis.yml file and merge master.

Travis-CI by default tests all branches, and does not test those not containing a .travis.yml, but to be sure I added an except clause stating that the master branch should not be tested. This way other developers can try different branches, too. The full .travis.yml can be checked on Github, here is the current status:

# .travis.yml for texlive-source CI building # Norbert Preining # Public Domain language: c branches: except: - master before_script: - find . -name \*.info -exec touch '{}' \; before_install: - sudo apt-get -qq update - sudo apt-get install -y libfontconfig-dev libx11-dev libxmu-dev libxaw7-dev script: ./Build

What remains is stitching these things together by adding a cron job that regularly does git svn rebase on the master branch, merges the master branch into travis-ci branch, and pushes everything to Github. The current cron job is here:

#!/bin/bash # cron job for updating texlive-source and pushing it to github for ci set -e TLSOURCE=/home/norbert/texlive-source.git GIT="git --no-pager" quiet_git() { stdout=$(tempfile) stderr=$(tempfile) if ! $GIT "$@" $stdout 2>$stderr; then echo "STDOUT of git command:" cat $stdout echo "************" cat $stderr >&2 rm -f $stdout $stderr exit 1 fi rm -f $stdout $stderr } cd $TLSOURCE quiet_git checkout master quiet_git svn rebase quiet_git checkout travis-ci # don't use [skip ci] here because we only built the # last commit, which would stop building quiet_git merge master -m "merging master" quiet_git push --all

With this setup we can CI testing of our changes in the TeX Live sources, and in the future maybe some developers will use separate branches to get testing there, too.

Enjoy.

Norbert Preining https://www.preining.info/blog There and back again

Dustin Kirkland: Dell XPS 13 with Ubuntu -- The Ultimate Developer Laptop of 2018!

Planet Ubuntu - Hën, 22/01/2018 - 9:51pd

I'm the proud owner of a new Dell XPS 13 Developer Edition (9360) laptop, pre-loaded from the Dell factory with Ubuntu 16.04 LTS Desktop.

Kudos to the Dell and the Canonical teams that have engineered a truly remarkable developer desktop experience.  You should also check out the post from Dell's senior architect behind the XPS 13, Barton George.
As it happens, I'm also the proud owner of a long loved, heavily used, 1st Generation Dell XPS 13 Developer Edition laptop :-)  See this post from May 7, 2012.  You'll be happy to know that machine is still going strong.  It's now my wife's daily driver.  And I use it almost every day, for any and all hacking that I do from the couch, after hours, after I leave the office ;-)

Now, this latest XPS edition is a real dream of a machine!

From a hardware perspective, this newer XPS 13 sports an Intel i7-7660U 2.5GHz processor and 16GB of memory.  While that's mildly exciting to me (as I've long used i7's and 16GB), here's what I am excited about...

The 500GB NVME storage and a whopping 1239 MB/sec I/O throughput!

kirkland@xps13:~$ sudo hdparm -tT /dev/nvme0n1
/dev/nvme0n1:
Timing cached reads: 25230 MB in 2.00 seconds = 12627.16 MB/sec
Timing buffered disk reads: 3718 MB in 3.00 seconds = 1239.08 MB/sec

And on top of that, this is my first QHD+ touch screen laptop display, sporting a magnificent 3200x1800 resolution.  The graphics are nothing short of spectacular.  Here's nearly 4K of Hollywood hard "at work" :-)


The keyboard is super comfortable.  I like it a bit better than the 1st generation.  Unlike your Apple friends, we still have our F-keys, which is important to me as a Byobu user :-)  The placement of the PgUp, PgDn, Home, and End keys (as Fn + Up/Down/Left/Right) takes a while to get used to.


The speakers are decent for a laptop, and the microphone is excellent.  The webcam is placed in an odd location (lower left of the screen), but it has quite nice resolution and focus quality.


And Bluetooth and WiFi, well, they "just work".  I got 98.2 Mbits/sec of throughput over WiFi.

kirkland@xps:~$ iperf -c 10.0.0.45
------------------------------------------------------------
Client connecting to 10.0.0.45, TCP port 5001
TCP window size: 85.0 KByte (default)
------------------------------------------------------------
[ 3] local 10.0.0.149 port 40568 connected with 10.0.0.45 port 5001
[ ID] Interval Transfer Bandwidth
[ 3] 0.0-10.1 sec 118 MBytes 98.2 Mbits/sec

There's no external display port, so you'll need something like this USB-C-to-HDMI adapter to project to a TV or monitor.


There's 1x USB-C port, 2x USB-3 ports, and an SD-Card reader.


One of the USB-3 ports can be used to charge your phone or other devices, even while your laptop is suspended.  I use this all the time, to keep my phone topped up while I'm aboard planes, trains, and cars.  To do so, you'll need to enable "USB PowerShare" in the BIOS.  Here's an article from Dell's KnowledgeBase explaining how.


Honestly, I have only one complaint...  And that's that there is no Trackstick mouse (which is available on some Dell models).  I'm not a huge fan of the Touchpad.  It's too sensitive, and my palms are always touching it inadvertently.  So I need to use an external mouse to be effective.  I'll continue to provide this feedback to the Dell team, in the hopes that one day I'll have my perfect developer laptop!  Otherwise, this machine is a beauty.  I'm sure you'll love it too.

Cheers,
Dustin

PrimeZ270-p, Intel i7400 review and Debian – 1

Planet Debian - Hën, 22/01/2018 - 6:23pd

This is going to be a biggish one as well.

This is a continuation from my last blog post .

Before diving into installation, I had been reading for quite a while Matthew Garett’s work. Thankfully most of his blog posts do get mirrored on planet.debian.org hence it is easy to get some idea as what needs to be done although have told him (I think even shared here) that he should somehow make his site more easily navigable. Trying to find posts on either ‘GPT’ and ‘UEFI’ and to have those posts in an ascending or descending way date-wise is not possible, at least I couldn’t find a way to do it as he doesn’t do it date-wise or something.

The closest I could come to is sing ‘$keyword’ site:https://mjg59.dreamwidth.org/ via a search-engine and go through the entries shared therein. This doesn’t mean I don’t value his contribution. It is in fact, the opposite. AFAIK he was one of the first people who drew the community’s attention when UEFI came in and only Microsoft Windows could be booted on them, nothing else.

I may be wrong but AFAIK he was the first one to talk about having a shim and was part of getting people to be part of the shim process.

While I’m sure Matthew’s understanding may have evolved significantly from what he had shared before, it was two specific blog posts that I had to re-read before trying to install MS-Windows and then Debian-GNU/Linux system on it. .

I went to a friend’s house who had windows 7 running at his end, I ran over there, used diskpart and did the change to GPT using Windows technet article.

I had to use/go the GPT way as I understood that MS-Windows takes all the four primary partitions for itself, leaving nothing for any other operating system to do/use .

I did the conversion to GPT and tried to have MS-Windows 10 as my current motherboard and all future motherboards from Intel Gen7/Gen8 onwards do not support anything less than Windows 10. I did see an unofficial patch floating on github somewhere but now have lost the reference to it. I had read some of the bug-reports of the repo. which seemed to suggest it was still a work in progress.

Now this is where it starts becoming a bit… let’s say interesting.

Now a friend/client of mine offered me a job to review MS-Windows 10, with his product keys of course. I was a bit hesitant as it had been a long time since I had worked with MS-Windows and didn’t know if I could do it or not, the other was a suspicion that I might like it too much. While I did review it, I found –

a. It it one heck of a bloatware – I had thought MS-Windows would have learned it by now but no, they still have to have to learn that adware and bloatware aren’t solutions. I still can’t get my head wrapped around as to how 4.1 GB of an MS-WIndows ISO gets extracted to 20 GB and still have to install shit-loads of third-party tools to actually get anything done. Just amazed (and not in good way.) .

Just to share as an example I still had to get something like Revo Uninstaller as MS-Windows even till date hasn’t learned to uninstall programs cleanly and needs a tool like that to clean the registry and other places to remove the titbits left along the way.

Edit/Update – It still doesn’t have Fall Creators Update which is still supposed to be another 4 GB+ iso which god only knows how much space that will take.

b. It’s still not gold – With all the hoopla around MS-Windows 10 that I had been hearing and seeing ads, I was under the impression that MS-Windows had turned gold i.e. it had a release like Debian would have ‘buster’ something around next year probably around or after 2019 Debconf is held. Windows 10 Microsoft would be released around July 2018, so it’s still a few months off.

c. I had read an insightful article few years ago by a Junior Microsoft employee sharing/emphasizing why MS cannot do GNU/Linux volunteer/bazaar type of development. To put in not so many words, it came down to the cultural differences the way two communities operate. While in GNU/Linux a one more patch, one more pull request will be encouraged, and it may be integrated in that point release or it can’t it would be in the next point release (unless it changes something much more core/fundamentally which needs more in-depth review) MS-Windows on the other hand, actively discourages that sort of behavior as it meant more time for integration and testing and from the sound of it MS still doesn’t do Continuous Integration (CI), regressive testing etc. as is common in many GNU/Linux common projects more and more.

I wish I could have shared the article but don’t have the link anymore. @Lazyweb, if you would be so kind so as to help find that article. The developer had shared some sort of ssh credentials or something to prove who he was which he later to remove (probably) because of the consequences to him for sharing that insight were not worth it, although the writings seemed to be valid.

There were many more quibbles but shared the above ones. For e.g. copying files from hdd to usb disks doesn’t tell how much time it takes, while in Debian I’ve come to see time taken for any operation as guaranteed.

Before starting on to the main issue, some info. before-hand although I don’t know how relevant or not that info. might be –

Prime Z270-P uses EFI 2.60 by American Megatrends –

/home/shirish> sudo dmesg | grep -i efi
[sudo] password for shirish:
[ 0.000000] efi: EFI v2.60 by American Megatrends

I can share more info. if needed later.

Now as I understood/interpretated info. found on the web and by experience Microsoft makes quite a few more partitions than necessary to get MS-Windows installed.

This is how it stacks up/shows up –

> sudo fdisk -l
Disk /dev/sda: 3.7 TiB, 4000787030016 bytes, 7814037168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: xxxxxxxxxxxxxxxxxxxxxxxxxxx

Device Start End Sectors Size Type
/dev/sda1 34 262177 262144 128M Microsoft reserved
/dev/sda2 264192 1185791 921600 450M Windows recovery environment
/dev/sda3 1185792 1390591 204800 100M EFI System
/dev/sda4 1390592 3718037503 3716646912 1.7T Microsoft basic data
/dev/sda5 3718037504 3718232063 194560 95M Linux filesystem
/dev/sda6 3718232064 5280731135 1562499072 745.1G Linux filesystem
/dev/sda7 5280731136 7761199103 2480467968 1.2T Linux filesystem
/dev/sda8 7761199104 7814035455 52836352 25.2G Linux swap

I had made 2 GB for /boot in MS-Windows installer as I had thought it would take only some space and leave the rest for Debian GNU/Linux’s /boot to put its kernel entries, tools to check memory and whatever else I wanted to have on /boot/debian but for some reason I have not yet understood, that didn’t work out as I expected it to be.

Device Start End Sectors Size Type
/dev/sda1 34 262177 262144 128M Microsoft reserved
/dev/sda2 264192 1185791 921600 450M Windows recovery environment
/dev/sda3 1185792 1390591 204800 100M EFI System
/dev/sda4 1390592 3718037503 3716646912 1.7T Microsoft basic data

As seen in the above, the first four primary partitions are taken by MS-Windows themselves. I just wish I had understood how to use GPT disklabels properly so I could figure out things better, but it seems (for reasons not fully understood) why the efi partition is a lowly 100 MB which I suspect where /boot is when I asked it to be 2 GB. Is that UEFI doing, Microsoft’s doing or something which is a default bit, dunno. Having the EFI partition smaller hampers the way I want to do things as will be clear in a short while from now.

After I installed MS-Windows, I installed Debian GNU/Linux using the net install method.

The following is what I had put on piece of paper as what partitions would be for GNU/Linux –

/boot – 512 MB (should be enough to accommodate couple of kernel versions, memory checking and any other tools I might need in the future.

/ – 700 GB – well admittedly that looks insane a bit but I do like to play with new programs/binaries as and when possible and don’t want to run out of space as and when I forget to clean it up.

[off-topic, wishlist] One tool I would like to have (and dunno if it’s there) is an ability to know when I installed a package, how many times I have used it, how frequently and the ability to add small notes or description to the package. Many a times I have seen that the package description is either too vague or doesn’t focus on the practical usefulness of a package to me .

An easy example to share what I mean would be the apt package –

aptitude show apt
Package: apt
Version: 1.6~alpha6
Essential: yes
State: installed
Automatically installed: no
Priority: required
Section: admin
Maintainer: APT Development Team
Architecture: amd64
Uncompressed Size: 3,840 k
Depends: adduser, gpgv | gpgv2 | gpgv1, debian-archive-keyring, libapt-pkg5.0 (>= 1.6~alpha6), libc6 (>= 2.15), libgcc1 (>= 1:3.0), libgnutls30 (>= 3.5.6), libseccomp2 (>=1.0.1), libstdc++6 (>= 5.2)
Recommends: ca-certificates
Suggests: apt-doc, aptitude | synaptic | wajig, dpkg-dev (>= 1.17.2), gnupg | gnupg2 | gnupg1, powermgmt-base, python-apt
Breaks: apt-transport-https (< 1.5~alpha4~), apt-utils (< 1.3~exp2~), aptitude (< 0.8.10)
Replaces: apt-transport-https (< 1.5~alpha4~), apt-utils (< 1.3~exp2~)
Provides: apt-transport-https (= 1.6~alpha6)
Description: commandline package manager
This package provides commandline tools for searching and managing as well as querying information about packages as a low-level access to all features of the libapt-pkg library.

These include:
* apt-get for retrieval of packages and information about them from authenticated sources and for installation, upgrade and removal of packages together with their dependencies
* apt-cache for querying available information about installed as well as installable packages
* apt-cdrom to use removable media as a source for packages
* apt-config as an interface to the configuration settings
* apt-key as an interface to manage authentication keys

Now while I love all the various tools that the apt package has, I do have special fondness for $apt-cache rdepends $package

as it gives another overview of a package or library or shared library that I may be interested in and which other packages are in its orbit.

Over period of time it becomes easy/easier to forget packages that you don’t use day-to-day hence having something like such a tool would be a god-send where you can put personal notes about packages. Another could be reminders of tickets posted upstream or something on those lines. I don’t know of any tool/package which does something on those lines. [/off-topic, wishlist]

/home – 1.2 TB

swap – 25.2 GB

Admit I got a bit overboard on swap space but as and when I get more memory at least should have swap 1:1 right. I am not sure if the old rules would still apply or not.

Then I used Debian buster alpha 2 netinstall iso

https://cdimage.debian.org/cdimage/buster_di_alpha2/amd64/iso-cd/debian-buster-DI-alpha2-amd64-netinst.iso and put it on the usb stick. I did use the sha1sum to ensure that the netinstall iso was the same as the original one https://cdimage.debian.org/cdimage/buster_di_alpha2/amd64/iso-cd/SHA1SUMS

After that simply doing a dd if of was enough to copy the net install to the usb stick.

I did have some issues with the installation which I’ll share in the next post but the most critical issue was that I had to again do make a /boot and even though I made /boot as a separate partition and gave 1 GB to it during the partitioning step, I got only 100 MB and I have no idea why it is like that.

/dev/sda5 3718037504 3718232063 194560 95M Linux filesystem

> df -h /boot
Filesystem Size Used Avail Use% Mounted on
/dev/sda5 88M 68M 14M 84% /boot

home/shirish> ls -lh /boot
total 55M
-rw-r--r-- 1 root root 193K Dec 22 19:42 config-4.14.0-2-amd64
-rw-r--r-- 1 root root 193K Jan 15 01:15 config-4.14.0-3-amd64
drwx------ 3 root root 1.0K Jan 1 1970 efi
drwxr-xr-x 5 root root 1.0K Jan 20 10:40 grub
-rw-r--r-- 1 root root 19M Jan 17 10:40 initrd.img-4.14.0-2-amd64
-rw-r--r-- 1 root root 21M Jan 20 10:40 initrd.img-4.14.0-3-amd64
drwx------ 2 root root 12K Jan 1 17:49 lost+found
-rw-r--r-- 1 root root 2.9M Dec 22 19:42 System.map-4.14.0-2-amd64
-rw-r--r-- 1 root root 2.9M Jan 15 01:15 System.map-4.14.0-3-amd64
-rw-r--r-- 1 root root 4.4M Dec 22 19:42 vmlinuz-4.14.0-2-amd64
-rw-r--r-- 1 root root 4.7M Jan 15 01:15 vmlinuz-4.14.0-3-amd64

root@debian:/boot/efi/EFI# ls -lh
total 3.0K
drwx------ 2 root root 1.0K Dec 31 21:38 Boot
drwx------ 2 root root 1.0K Dec 31 19:23 debian
drwx------ 4 root root 1.0K Dec 31 21:32 Microsoft

I would be the first to say I don’t really the understand this EFI business.

The only thing I do understand that it’s good that even without OS it becomes easier to see that all the components if you change/add which would or would not work in BIOS. In bios, getting info on components were iffy at best.

There have been other issues with EFI which I may take in another blog post but for now I would be happy if somebody can share –

how to have a big /boot/ so it’s not a small partition for debian boot. I don’t see any value in having a bigger /boot for MS-Windows unless there is a way to also get grub2 pointer/header added in MS-Windows bootloader. Will share the reasons for it in the next blog post.

I am open to reinstalling both MS-Windows and Debian from scratch although that would happen when debian-buster-alpha3 arrives. Any answer to the above would give me something to try the solution and share if I get the desired result.

Looking forward for answers.

shirishag75 https://flossexperiences.wordpress.com #planet-debian – Experiences in the community

French Gender-Neutral Translation for Roundcube

Planet Debian - Hën, 22/01/2018 - 6:00pd

Here's a quick blog post to tell the world I'm now doing a French gender-neutral translation for Roundcube.

A while ago, someone wrote on the Riseup translation list to complain against the current fr_FR translation. French is indeed a very gendered language and it is common place in radical spaces to use gender-neutral terminologies.

So yeah, here it is: https://github.com/baldurmen/roundcube_fr_FEM

I haven't tested the UI integration yet, but I'll do that once the Riseup folks integrate it to their Roundcube instance.

Louis-Philippe Véronneau https://veronneau.org/ Louis-Philippe Véronneau

#15: Tidyverse and data.table, sitting side by side ... (Part 1)

Planet Debian - Dje, 21/01/2018 - 11:40md

Welcome to the fifteenth post in the rarely rational R rambling series, or R4 for short. There are two posts I have been meaning to get out for a bit, and hope to get to shortly---but in the meantime we are going start something else.

Another longer-running idea I had was to present some simple application cases with (one or more) side-by-side code comparisons. Why? Well at times it feels like R, and the R community, are being split. You're either with one (increasingly "religious" in their defense of their deemed-superior approach) side, or the other. And that is of course utter nonsense. It's all R after all.

Programming, just like other fields using engineering methods and thinking, is about making choices, and trading off between certain aspects. A simple example is the fairly well-known trade-off between memory use and speed: think e.g. of a hash map allowing for faster lookup at the cost of some more memory. Generally speaking, solutions are rarely limited to just one way, or just one approach. So if pays off to know your tools, and choose wisely among all available options. Having choices is having options, and those tend to have non-negative premiums to take advantage off. Locking yourself into one and just one paradigm can never be better.

In that spirit, I want to (eventually) show a few simple comparisons of code being done two distinct ways.

One obvious first candidate for this is the gunsales repository with some R code which backs an earlier NY Times article. I got involved for a similar reason, and updated the code from its initial form. Then again, this project also helped motivate what we did later with the x13binary package which permits automated installation of the X13-ARIMA-SEATS binary to support Christoph's excellent seasonal CRAN package (and website) for which we now have a forthcoming JSS paper. But the actual code example is not that interesting / a bit further off the mainstream because of the more specialised seasonal ARIMA modeling.

But then this week I found a much simpler and shorter example, and quickly converted its code. The code comes from the inaugural datascience 1 lesson at the Crosstab, a fabulous site by G. Elliot Morris (who may be the highest-energy undergrad I have come across lately) focusssed on political polling, forecasts, and election outcomes. Lesson 1 is a simple introduction, and averages some polls of the 2016 US Presidential Election.

Complete Code using Approach "TV"

Elliot does a fine job walking the reader through his code so I will be brief and simply quote it in one piece:

## Getting the polls library(readr) polls_2016 <- read_tsv(url("http://elections.huffingtonpost.com/pollster/api/v2/questions/16-US-Pres-GE%20TrumpvClinton/poll-responses-clean.tsv")) ## Wrangling the polls library(dplyr) polls_2016 <- polls_2016 %>% filter(sample_subpopulation %in% c("Adults","Likely Voters","Registered Voters")) library(lubridate) polls_2016 <- polls_2016 %>% mutate(end_date = ymd(end_date)) polls_2016 <- polls_2016 %>% right_join(data.frame(end_date = seq.Date(min(polls_2016$end_date), max(polls_2016$end_date), by="days"))) ## Average the polls polls_2016 <- polls_2016 %>% group_by(end_date) %>% summarise(Clinton = mean(Clinton), Trump = mean(Trump)) library(zoo) rolling_average <- polls_2016 %>% mutate(Clinton.Margin = Clinton-Trump, Clinton.Avg = rollapply(Clinton.Margin,width=14, FUN=function(x){mean(x, na.rm=TRUE)}, by=1, partial=TRUE, fill=NA, align="right")) library(ggplot2) ggplot(rolling_average)+ geom_line(aes(x=end_date,y=Clinton.Avg),col="blue") + geom_point(aes(x=end_date,y=Clinton.Margin))

It uses five packages to i) read some data off them interwebs, ii) then filters / subsets / modifies it leading to a right (outer) join with itself before iv) averaging per-day polls first and then creates rolling averages over 14 days before v) plotting. Several standard verbs are used: filter(), mutate(), right_join(), group_by(), and summarise(). One non-verse function is rollapply() which comes from zoo, a popular package for time-series data.

Complete Code using Approach "DT"

As I will show below, we can do the same with fewer packages as data.table covers the reading, slicing/dicing and time conversion. We still need zoo for its rollapply() and of course the same plotting code:

## Getting the polls library(data.table) pollsDT <- fread("http://elections.huffingtonpost.com/pollster/api/v2/questions/16-US-Pres-GE%20TrumpvClinton/poll-responses-clean.tsv") ## Wrangling the polls pollsDT <- pollsDT[sample_subpopulation %in% c("Adults","Likely Voters","Registered Voters"), ] pollsDT[, end_date := as.IDate(end_date)] pollsDT <- pollsDT[ data.table(end_date = seq(min(pollsDT[,end_date]), max(pollsDT[,end_date]), by="days")), on="end_date"] ## Average the polls library(zoo) pollsDT <- pollsDT[, .(Clinton=mean(Clinton), Trump=mean(Trump)), by=end_date] pollsDT[, Clinton.Margin := Clinton-Trump] pollsDT[, Clinton.Avg := rollapply(Clinton.Margin, width=14, FUN=function(x){mean(x, na.rm=TRUE)}, by=1, partial=TRUE, fill=NA, align="right")] library(ggplot2) ggplot(pollsDT) + geom_line(aes(x=end_date,y=Clinton.Avg),col="blue") + geom_point(aes(x=end_date,y=Clinton.Margin))

This uses several of the components of data.table which are often called [i, j, by=...]. Row are selected (i), columns are either modified (via := assignment) or summarised (via =), and grouping is undertaken by by=.... The outer join is done by having a data.table object indexed by another, and is pretty standard too. That allows us to do all transformations in three lines. We then create per-day average by grouping by day, compute the margin and construct its rolling average as before. The resulting chart is, unsurprisingly, the same.

Benchmark Reading

We can looking how the two approaches do on getting data read into our session. For simplicity, we will read a local file to keep the (fixed) download aspect out of it:

R> url <- "http://elections.huffingtonpost.com/pollster/api/v2/questions/16-US-Pres-GE%20TrumpvClinton/poll-responses-clean.tsv" R> download.file(url, destfile=file, quiet=TRUE) R> file <- "/tmp/poll-responses-clean.tsv" R> res <- microbenchmark(tidy=suppressMessages(readr::read_tsv(file)), + dt=data.table::fread(file, showProgress=FALSE)) R> res Unit: milliseconds expr min lq mean median uq max neval tidy 6.67777 6.83458 7.13434 6.98484 7.25831 9.27452 100 dt 1.98890 2.04457 2.37916 2.08261 2.14040 28.86885 100 R>

That is a clear relative difference, though the absolute amount of time is not that relevant for such a small (demo) dataset.

Benchmark Processing

We can also look at the processing part:

R> rdin <- suppressMessages(readr::read_tsv(file)) R> dtin <- data.table::fread(file, showProgress=FALSE) R> R> library(dplyr) R> library(lubridate) R> library(zoo) R> R> transformTV <- function(polls_2016=rdin) { + polls_2016 <- polls_2016 %>% + filter(sample_subpopulation %in% c("Adults","Likely Voters","Registered Voters")) + polls_2016 <- polls_2016 %>% + mutate(end_date = ymd(end_date)) + polls_2016 <- polls_2016 %>% + right_join(data.frame(end_date = seq.Date(min(polls_2016$end_date), + max(polls_2016$end_date), by="days"))) + polls_2016 <- polls_2016 %>% + group_by(end_date) %>% + summarise(Clinton = mean(Clinton), + Trump = mean(Trump)) + + rolling_average <- polls_2016 %>% + mutate(Clinton.Margin = Clinton-Trump, + Clinton.Avg = rollapply(Clinton.Margin,width=14, + FUN=function(x){mean(x, na.rm=TRUE)}, + by=1, partial=TRUE, fill=NA, align="right")) + } R> R> transformDT <- function(dtin) { + pollsDT <- copy(dtin) ## extra work to protect from reference semantics for benchmark + pollsDT <- pollsDT[sample_subpopulation %in% c("Adults","Likely Voters","Registered Voters"), ] + pollsDT[, end_date := as.IDate(end_date)] + pollsDT <- pollsDT[ data.table(end_date = seq(min(pollsDT[,end_date]), + max(pollsDT[,end_date]), by="days")), on="end_date"] + pollsDT <- pollsDT[, .(Clinton=mean(Clinton), Trump=mean(Trump)), + by=end_date][, Clinton.Margin := Clinton-Trump] + pollsDT[, Clinton.Avg := rollapply(Clinton.Margin, width=14, + FUN=function(x){mean(x, na.rm=TRUE)}, + by=1, partial=TRUE, fill=NA, align="right")] + } R> R> res <- microbenchmark(tidy=suppressMessages(transformTV(rdin)), + dt=transformDT(dtin)) R> res Unit: milliseconds expr min lq mean median uq max neval tidy 12.54723 13.18643 15.29676 13.73418 14.71008 104.5754 100 dt 7.66842 8.02404 8.60915 8.29984 8.72071 17.7818 100 R>

Not quite a factor of two on the small data set, but again a clear advantage. data.table has a reputation for doing really well for large datasets; here we see that it is also faster for small datasets.

Side-by-side

Stripping the reading, as well as the plotting both of which are about the same, we can compare the essential data operations.

Summary

We found a simple task solved using code and packages from an increasingly popular sub-culture within R, and contrasted it with a second approach. We find the second approach to i) have fewer dependencies, ii) less code, and iii) running faster.

Now, undoubtedly the former approach will have its staunch defenders (and that is all good and well, after all choice is good and even thirty years later some still debate vi versus emacs endlessly) but I thought it to be instructive to at least to be able to make an informed comparison.

Acknowledgements

My thanks to G. Elliot Morris for a fine example, and of course a fine blog and (if somewhat hyperactive) Twitter account.

This post by Dirk Eddelbuettel originated on his Thinking inside the box blog. Please report excessive re-aggregation in third-party for-profit settings.

Dirk Eddelbuettel http://dirk.eddelbuettel.com/blog Thinking inside the box

Sebastian Dröge: Speeding up RGB to grayscale conversion in Rust by a factor of 2.2 – and various other multimedia related processing loops

Planet GNOME - Dje, 21/01/2018 - 2:54md

In the previous blog post I wrote about how to write a RGB to grayscale conversion filter for GStreamer in Rust. In this blog post I’m going to write about how to optimize the processing loop of that filter, without resorting to unsafe code or SIMD instructions by staying with plain, safe Rust code.

I also tried to implement the processing loop with faster, a Rust crate for writing safe SIMD code. It looks very promising, but unless I missed something in the documentation it currently is missing some features to be able to express this specific algorithm in a meaningful way. Once it works on stable Rust (waiting for SIMD to be stabilized) and includes runtime CPU feature detection, this could very well be a good replacement for the ORC library used for the same purpose in GStreamer in various places. ORC works by JIT-compiling a minimal “array operation language” to SIMD assembly for your specific CPU (and has support for x86 MMX/SSE, PPC Altivec, ARM NEON, etc.).

If someone wants to prove me wrong and implement this with faster, feel free to do so and I’ll link to your solution and include it in the benchmark results below.

All code below can be found in this GIT repository.

Table of Contents
  1. Baseline Implementation
  2. First Optimization – Assertions
  3. First Optimization – Assertions Try 2
  4. Second Optimization – Iterate a bit more
  5. Third Optimization – Getting rid of the bounds check finally
  6. Summary
  7. Addendum: slice::split_at
Baseline Implementation

This is how the baseline implementation looks like.

pub fn bgrx_to_gray_chunks_no_asserts( in_data: &[u8], out_data: &mut [u8], in_stride: usize, out_stride: usize, width: usize, ) { let in_line_bytes = width * 4; let out_line_bytes = width * 4; for (in_line, out_line) in in_data .chunks(in_stride) .zip(out_data.chunks_mut(out_stride)) { for (in_p, out_p) in in_line[..in_line_bytes] .chunks(4) .zip(out_line[..out_line_bytes].chunks_mut(4)) { let b = u32::from(in_p[0]); let g = u32::from(in_p[1]); let r = u32::from(in_p[2]); let x = u32::from(in_p[3]); let grey = ((r * RGB_Y[0]) + (g * RGB_Y[1]) + (b * RGB_Y[2]) + (x * RGB_Y[3])) / 65536; let grey = grey as u8; out_p[0] = grey; out_p[1] = grey; out_p[2] = grey; out_p[3] = grey; } } }

This basically iterates over each line of the input and output frame (outer loop), and then for each BGRx chunk of 4 bytes in each line it converts the values to u32, multiplies with a constant array, converts back to u8 and stores the same value in the whole output BGRx chunk.

Note: This is only doing the actual conversion from linear RGB to grayscale (and in BT.601 colorspace). To do this conversion correctly you need to know your colorspaces and use the correct coefficients for conversion, and also do gamma correction. See this about why it is important.

So what can be improved on this? For starters, let’s write a small benchmark for this so that we know whether any of our changes actually improve something. This is using the (unfortunately still) unstable benchmark feature of Cargo.

#![feature(test)] #![feature(exact_chunks)] extern crate test; pub fn bgrx_to_gray_chunks_no_asserts(...) [...] } #[cfg(test)] mod tests { use super::*; use test::Bencher; use std::iter; fn create_vec(w: usize, h: usize) -> Vec<u8> { iter::repeat(0).take(w * h * 4).collect::<_>() } #[bench] fn bench_chunks_1920x1080_no_asserts(b: &mut Bencher) { let i = test::black_box(create_vec(1920, 1080)); let mut o = test::black_box(create_vec(1920, 1080)); b.iter(|| bgrx_to_gray_chunks_no_asserts(&i, &mut o, 1920 * 4, 1920 * 4, 1920)); } }

This can be run with cargo bench and then prints the amount of nanoseconds each iterator of the closure was taking. To only really measure the processing itself, allocations and initializations of the input/output frame are happening outside of the closure. We’re not interested in times for that.

First Optimization – Assertions

To actually start optimizing this function, let’s take a look at the assembly that the compiler is outputting. The easiest way of doing that is via the Godbolt Compiler Explorer website. Select “rustc nightly” and use “-C opt-level=3” for the compiler flags, and then copy & paste your code in there. Once it compiles, to find the assembly that corresponds to a line, simply right-click on the line and “Scroll to assembly”.

Alternatively you can use cargo rustc –release — -C opt-level=3 –emit asm and check the assembly file that is output in the target/release/deps directory.

What we see then for our inner loop is something like the following

.LBB4_19: cmp r15, r11 mov r13, r11 cmova r13, r15 mov rdx, r8 sub rdx, r13 je .LBB4_34 cmp rdx, 3 jb .LBB4_35 inc r9 movzx edx, byte ptr [rbx - 1] movzx ecx, byte ptr [rbx - 2] movzx esi, byte ptr [rbx] imul esi, esi, 19595 imul edx, edx, 38470 imul ecx, ecx, 7471 add ecx, edx add ecx, esi shr ecx, 16 mov byte ptr [r10 - 3], cl mov byte ptr [r10 - 2], cl mov byte ptr [r10 - 1], cl mov byte ptr [r10], cl add r10, 4 add r8, -4 add r15, -4 add rbx, 4 cmp r9, r14 jb .LBB4_19

This is already quite optimized. For each loop iteration the first few instructions are doing some bounds checking and if they fail jump to the .LBB4_34 or .LBB4_35 labels. How to understand that this is bounds checking? Scroll down in the assembly to where these labels are defined and you’ll see something like the following

.LBB4_34: lea rdi, [rip + .Lpanic_bounds_check_loc.D] xor esi, esi xor edx, edx call core::panicking::panic_bounds_check@PLT ud2 .LBB4_35: cmp r15, r11 cmova r11, r15 sub r8, r11 lea rdi, [rip + .Lpanic_bounds_check_loc.F] mov esi, 2 mov rdx, r8 call core::panicking::panic_bounds_check@PLT ud2

Also if you check (with the colors, or the “scroll to source” feature) which Rust code these correspond to, you’ll see that it’s the first and third access to the 4-byte slice that contains our BGRx values.

Afterwards in the assembly, the following steps are happening: 0) incrementing of the “loop counter” representing the number of iterations we’re going to do (r9), 1) actual reading of the B, G and R value and conversion to u32 (the 3 movzx, note that the reading of the x value is optimized away as the compiler sees that it is always multiplied by 0 later), 2) the multiplications with the array elements (the 3 imul), 3) combining of the results and division (i.e. shift) (the 2 add and the shr), 4) storing of the result in the output (the 4 mov). Afterwards the slice pointers are increased by 4 (rbx and r10) and the lengths (used for bounds checking) are decreased by 4 (r8 and r15). Finally there’s a check (cmp) to see if r9 (our loop) counter is at the end of the slice, and if not we jump back to the beginning and operate on the next BGRx chunk.

Generally what we want to do for optimizations is to get rid of unnecessary checks (bounds checking), memory accesses, conditions (cmp, cmov) and jumps (the instructions starting with j). These are all things that are slowing down our code.

So the first thing that seems useful to optimize here is the bounds checking at the beginning. It definitely seems not useful to do two checks instead of one for the two slices (the checks are for the both slices at once but Godbolt does not detect that and believes it’s only the input slice). And ideally we could teach the compiler that no bounds checking is needed at all.

As I wrote in the previous blog post, often this knowledge can be given to the compiler by inserting assertions.

To prevent two checks and just have a single check, you can insert a assert_eq!(in_p.len(), 4) at the beginning of the inner loop and the same for the output slice. Now we only have a single bounds check left per iteration.

As a next step we might want to try to move this knowledge outside the inner loop so that there is no bounds checking at all in there anymore. We might want to add assertions like the following outside the outer loop then to give all knowledge we have to the compiler

assert_eq!(in_data.len() % 4, 0); assert_eq!(out_data.len() % 4, 0); assert_eq!(out_data.len() / out_stride, in_data.len() / in_stride); assert!(in_line_bytes <= in_stride); assert!(out_line_bytes <= out_stride);

Unfortunately adding those has no effect at all on the inner loop, but having them outside the outer loop for good measure is not the worst idea so let’s just keep them. At least it can be used as some kind of documentation of the invariants of this code for future readers.

So let’s benchmark these two implementations now. The results on my machine are the following

test tests::bench_chunks_1920x1080_no_asserts ... bench: 4,420,145 ns/iter (+/- 139,051) test tests::bench_chunks_1920x1080_asserts ... bench: 4,897,046 ns/iter (+/- 166,555)

This is surprising, our version without the assertions is actually faster by a factor of ~1.1 although it had fewer conditions. So let’s take a closer look at the assembly at the top of the loop again, where the bounds checking happens, in the version with assertions

.LBB4_19: cmp rbx, r11 mov r9, r11 cmova r9, rbx mov r14, r12 sub r14, r9 lea rax, [r14 - 1] mov qword ptr [rbp - 120], rax mov qword ptr [rbp - 128], r13 mov qword ptr [rbp - 136], r10 cmp r14, 5 jne .LBB4_33 inc rcx [...]

While this indeed has only one jump as expected for the bounds checking, the number of comparisons is the same and even worse: 3 memory writes to the stack are happening right before the jump. If we follow to the .LBB4_33 label we will see that the assert_eq! macro is going to do something with core::fmt::Debug. This is setting up the information needed for printing the assertion failure, the “expected X equals to Y” output. This is certainly not good and the reason why everything is slower now.

First Optimization – Assertions Try 2

All the additional instructions and memory writes were happening because the assert_eq! macro is outputting something user friendly that actually contains the values of both sides. Let’s try again with the assert! macro instead

test tests::bench_chunks_1920x1080_no_asserts ... bench: 4,420,145 ns/iter (+/- 139,051) test tests::bench_chunks_1920x1080_asserts ... bench: 4,897,046 ns/iter (+/- 166,555) test tests::bench_chunks_1920x1080_asserts_2 ... bench: 3,968,976 ns/iter (+/- 97,084)

This already looks more promising. Compared to our baseline version this gives us a speedup of a factor of 1.12, and compared to the version with assert_eq! 1.23. If we look at the assembly for the bounds checks (everything else stays the same), it also looks more like what we would’ve expected

.LBB4_19: cmp rbx, r12 mov r13, r12 cmova r13, rbx add r13, r14 jne .LBB4_33 inc r9 [...]

One cmp less, only one jump left. And no memory writes anymore!

So keep in mind that assert_eq! is more user-friendly but quite a bit more expensive even in the “good case” compared to assert!.

Second Optimization – Iterate a bit more

This is still not very satisfying though. No bounds checking should be needed at all as each chunk is going to be exactly 4 bytes. We’re just not able to convince the compiler that this is the case. While it may be possible (let me know if you find a way!), let’s try something different. The zip iterator is done when the shortest iterator of both is done, and there are optimizations specifically for zipped slice iterators implemented. Let’s try that and replace the grayscale value calculation with

let grey = in_p.iter() .zip(RGB_Y.iter()) .map(|(i, c)| u32::from(*i) * c) .sum::<u32>() / 65536;

If we run that through our benchmark after removing the assert!(in_p.len() == 4) (and the same for the output slice), these are the results

test tests::bench_chunks_1920x1080_asserts_2 ... bench: 3,968,976 ns/iter (+/- 97,084) test tests::bench_chunks_1920x1080_iter_sum ... bench: 11,393,600 ns/iter (+/- 347,958)

We’re actually 2.9 times slower! Even when adding back the assert!(in_p.len() == 4) assertion (and the same for the output slice) we’re still slower

test tests::bench_chunks_1920x1080_asserts_2 ... bench: 3,968,976 ns/iter (+/- 97,084) test tests::bench_chunks_1920x1080_iter_sum ... bench: 11,393,600 ns/iter (+/- 347,958) test tests::bench_chunks_1920x1080_iter_sum_2 ... bench: 10,420,442 ns/iter (+/- 242,379)

If we look at the assembly of the assertion-less variant, it’s a complete mess now

.LBB0_19: cmp rbx, r13 mov rcx, r13 cmova rcx, rbx mov rdx, r8 sub rdx, rcx cmp rdx, 4 mov r11d, 4 cmovb r11, rdx test r11, r11 je .LBB0_20 movzx ecx, byte ptr [r15 - 2] imul ecx, ecx, 19595 cmp r11, 1 jbe .LBB0_22 movzx esi, byte ptr [r15 - 1] imul esi, esi, 38470 add esi, ecx movzx ecx, byte ptr [r15] imul ecx, ecx, 7471 add ecx, esi test rdx, rdx jne .LBB0_23 jmp .LBB0_35 .LBB0_20: xor ecx, ecx .LBB0_22: test rdx, rdx je .LBB0_35 .LBB0_23: shr ecx, 16 mov byte ptr [r10 - 3], cl mov byte ptr [r10 - 2], cl cmp rdx, 3 jb .LBB0_36 inc r9 mov byte ptr [r10 - 1], cl mov byte ptr [r10], cl add r10, 4 add r8, -4 add rbx, -4 add r15, 4 cmp r9, r14 jb .LBB0_19

In short, there are now various new conditions and jumps for short-circuiting the zip iterator in the various cases. And because of all the noise added, the compiler was not even able to optimize the bounds check for the output slice away anymore (.LBB0_35 cases). While it was able to unroll the iterator (note that the 3 imul multiplications are not interleaved with jumps and are actually 3 multiplications instead of yet another loop), which is quite impressive, it couldn’t do anything meaningful with that information it somehow got (it must’ve understood that each chunk has 4 bytes!). This looks like something going wrong somewhere in the optimizer to me.

If we take a look at the variant with the assertions, things look much better

.LBB3_19: cmp r11, r12 mov r13, r12 cmova r13, r11 add r13, r14 jne .LBB3_33 inc r9 movzx ecx, byte ptr [rdx - 2] imul r13d, ecx, 19595 movzx ecx, byte ptr [rdx - 1] imul ecx, ecx, 38470 add ecx, r13d movzx ebx, byte ptr [rdx] imul ebx, ebx, 7471 add ebx, ecx shr ebx, 16 mov byte ptr [r10 - 3], bl mov byte ptr [r10 - 2], bl mov byte ptr [r10 - 1], bl mov byte ptr [r10], bl add r10, 4 add r11, -4 add r14, 4 add rdx, 4 cmp r9, r15 jb .LBB3_19

This is literally the same as the assertion version we had before, except that the reading of the input slice, the multiplications and the additions are happening in iterator order instead of being batched all together. It’s quite impressive that the compiler was able to completely optimize away the zip iterator here, but unfortunately it’s still many times slower than the original version. The reason must be the instruction-reordering. The previous version had all memory reads batched and then the operations batched, which is apparently much better for the internal pipelining of the CPU (it is going to perform the next instructions without dependencies on the previous ones already while waiting for the pending instructions to finish).

It’s also not clear to me why the LLVM optimizer is not able to schedule the instructions the same way here. It apparently has all information it needs for that if no iterator is involved, and both versions are leading to exactly the same assembly except for the order of instructions. This also seems like something fishy.

Nonetheless, we still have our manual bounds check (the assertion) left here and we should really try to get rid of that. No progress so far.

Third Optimization – Getting rid of the bounds check finally

Let’s tackle this from a different angle now. Our problem is apparently that the compiler is not able to understand that each chunk is exactly 4 bytes.

So why don’t we write a new chunks iterator that has always exactly the requested amount of items, instead of potentially less for the very last iteration. And instead of panicking if there are leftover elements, it seems useful to just ignore them. That way we have API that is functionally different from the existing chunks iterator and provides behaviour that is useful in various cases. It’s basically the slice equivalent of the exact_chunks iterator of the ndarray crate.

By having it functionally different from the existing one, and not just an optimization, I also submitted it for inclusion in Rust’s standard library and it’s nowadays available as an unstable feature in nightly. Like all newly added API. Nonetheless, the same can also be implemented inside your code with basically the same effect, there are no dependencies on standard library internals.

So, let’s use our new exact_chunks iterator that is guaranteed (by API) to always give us exactly 4 bytes. In our case this is exactly equivalent to the normal chunks as by construction our slices always have a length that is a multiple of 4, but the compiler can’t infer that information. The resulting code looks as follows

pub fn bgrx_to_gray_exact_chunks( in_data: &[u8], out_data: &mut [u8], in_stride: usize, out_stride: usize, width: usize, ) { assert_eq!(in_data.len() % 4, 0); assert_eq!(out_data.len() % 4, 0); assert_eq!(out_data.len() / out_stride, in_data.len() / in_stride); let in_line_bytes = width * 4; let out_line_bytes = width * 4; assert!(in_line_bytes <= in_stride); assert!(out_line_bytes <= out_stride); for (in_line, out_line) in in_data .exact_chunks(in_stride) .zip(out_data.exact_chunks_mut(out_stride)) { for (in_p, out_p) in in_line[..in_line_bytes] .exact_chunks(4) .zip(out_line[..out_line_bytes].exact_chunks_mut(4)) { assert!(in_p.len() == 4); assert!(out_p.len() == 4); let b = u32::from(in_p[0]); let g = u32::from(in_p[1]); let r = u32::from(in_p[2]); let x = u32::from(in_p[3]); let grey = ((r * RGB_Y[0]) + (g * RGB_Y[1]) + (b * RGB_Y[2]) + (x * RGB_Y[3])) / 65536; let grey = grey as u8; out_p[0] = grey; out_p[1] = grey; out_p[2] = grey; out_p[3] = grey; } } }

It’s exactly the same as the previous version with assertions, except for using exact_chunks instead of chunks and the same for the mutable iterator. The resulting benchmark of all our variants now looks as follow

test tests::bench_chunks_1920x1080_no_asserts ... bench: 4,420,145 ns/iter (+/- 139,051) test tests::bench_chunks_1920x1080_asserts ... bench: 4,897,046 ns/iter (+/- 166,555) test tests::bench_chunks_1920x1080_asserts_2 ... bench: 3,968,976 ns/iter (+/- 97,084) test tests::bench_chunks_1920x1080_iter_sum ... bench: 11,393,600 ns/iter (+/- 347,958) test tests::bench_chunks_1920x1080_iter_sum_2 ... bench: 10,420,442 ns/iter (+/- 242,379) test tests::bench_exact_chunks_1920x1080 ... bench: 2,007,459 ns/iter (+/- 112,287)

Compared to our initial version this is a speedup of a factor of 2.2, compared to our version with assertions a factor of 1.98. This seems like a worthwhile improvement, and if we look at the resulting assembly there are no bounds checks at all anymore

.LBB0_10: movzx edx, byte ptr [rsi - 2] movzx r15d, byte ptr [rsi - 1] movzx r12d, byte ptr [rsi] imul r13d, edx, 19595 imul edx, r15d, 38470 add edx, r13d imul ebx, r12d, 7471 add ebx, edx shr ebx, 16 mov byte ptr [rcx - 3], bl mov byte ptr [rcx - 2], bl mov byte ptr [rcx - 1], bl mov byte ptr [rcx], bl add rcx, 4 add rsi, 4 dec r10 jne .LBB0_10

Also due to this the compiler is able to apply some more optimizations and we only have one loop counter for the number of iterations r10 and the two pointers rcx and rsi that are increased/decreased in each iteration. There is no tracking of the remaining slice lengths anymore, as in the assembly of the original version (and the versions with assertions).

Summary

So overall we got a speedup of a factor of 2.2 while still writing very high-level Rust code with iterators and not falling back to unsafe code or using SIMD. The optimizations the Rust compiler is applying are quite impressive and the Rust marketing line of zero-cost abstractions is really visible in reality here.

The same approach should also work for many similar algorithms, and thus many similar multimedia related algorithms where you iterate over slices and operate on fixed-size chunks.

Also the above shows that as a first step it’s better to write clean and understandable high-level Rust code without worrying too much about performance (assume the compiler can optimize well), and only afterwards take a look at the generated assembly and check which instructions should really go away (like bounds checking). In many cases this can be achieved by adding assertions in strategic places, or like in this case by switching to a slightly different abstraction that is closer to the actual requirements (however I believe the compiler should be able to produce the same code with the help of assertions with the normal chunks iterator, but making that possible requires improvements to the LLVM optimizer probably).

And if all does not help, there’s still the escape hatch of unsafe (for using functions like slice::get_unchecked() or going down to raw pointers) and the possibility of using SIMD instructions (by using faster or stdsimd directly). But in the end this should be a last resort for those little parts of your code where optimizations are needed and the compiler can’t be easily convinced to do it for you.

Addendum: slice::split_at

User newpavlov suggested on Reddit to use repeated slice::split_at in a while loop for similar performance.

This would for example like

pub fn bgrx_to_gray_split_at( in_data: &[u8], out_data: &mut [u8], in_stride: usize, out_stride: usize, width: usize, ) { assert_eq!(in_data.len() % 4, 0); assert_eq!(out_data.len() % 4, 0); assert_eq!(out_data.len() / out_stride, in_data.len() / in_stride); let in_line_bytes = width * 4; let out_line_bytes = width * 4; assert!(in_line_bytes <= in_stride); assert!(out_line_bytes <= out_stride); for (in_line, out_line) in in_data .exact_chunks(in_stride) .zip(out_data.exact_chunks_mut(out_stride)) { let mut in_pp: &[u8] = in_line[..in_line_bytes].as_ref(); let mut out_pp: &mut [u8] = out_line[..out_line_bytes].as_mut(); assert!(in_pp.len() == out_pp.len()); while in_pp.len() >= 4 { let (in_p, in_tmp) = in_pp.split_at(4); let (out_p, out_tmp) = { out_pp }.split_at_mut(4); in_pp = in_tmp; out_pp = out_tmp; let b = u32::from(in_p[0]); let g = u32::from(in_p[1]); let r = u32::from(in_p[2]); let x = u32::from(in_p[3]); let grey = ((r * RGB_Y[0]) + (g * RGB_Y[1]) + (b * RGB_Y[2]) + (x * RGB_Y[3])) / 65536; let grey = grey as u8; out_p[0] = grey; out_p[1] = grey; out_p[2] = grey; out_p[3] = grey; } } }

Performance-wise this brings us very close to the exact_chunks version

test tests::bench_exact_chunks_1920x1080 ... bench: 1,965,631 ns/iter (+/- 58,832) test tests::bench_split_at_1920x1080 ... bench: 2,046,834 ns/iter (+/- 35,990)

and the assembly is also very similar

.LBB0_10: add rbx, -4 movzx r15d, byte ptr [rsi] movzx r12d, byte ptr [rsi + 1] movzx edx, byte ptr [rsi + 2] imul r13d, edx, 19595 imul r12d, r12d, 38470 imul edx, r15d, 7471 add edx, r12d add edx, r13d shr edx, 16 movzx edx, dl imul edx, edx, 16843009 mov dword ptr [rcx], edx lea rcx, [rcx + 4] add rsi, 4 cmp rbx, 3 ja .LBB0_10

Here the compiler even optimizes the storing of the value into a single write operation of 4 bytes, at the cost of an additional multiplication and zero-extend register move.

Overall this code performs very well too, but in my opinion it looks rather ugly compared to the versions using the different chunks iterators. Also this is basically what the exact_chunks iterator does internally: repeatedly calling slice::split_at. In theory both versions could lead to the very same assembly, but the LLVM optimizer is currently handling both slightly different.

Sebastian Dröge: Speeding up RGB to grayscale conversion in Rust by a factor of 2.2 – and various other multimedia related processing loops

Planet Ubuntu - Dje, 21/01/2018 - 2:48md

In the previous blog post I wrote about how to write a RGB to grayscale conversion filter for GStreamer in Rust. In this blog post I’m going to write about how to optimize the processing loop of that filter, without resorting to unsafe code or SIMD instructions by staying with plain, safe Rust code.

I also tried to implement the processing loop with faster, a Rust crate for writing safe SIMD code. It looks very promising, but unless I missed something in the documentation it currently is missing some features to be able to express this specific algorithm in a meaningful way. Once it works on stable Rust (waiting for SIMD to be stabilized) and includes runtime CPU feature detection, this could very well be a good replacement for the ORC library used for the same purpose in GStreamer in various places. ORC works by JIT-compiling a minimal “array operation language” to SIMD assembly for your specific CPU (and has support for x86 MMX/SSE, PPC Altivec, ARM NEON, etc.).

If someone wants to prove me wrong and implement this with faster, feel free to do so and I’ll link to your solution and include it in the benchmark results below.

All code below can be found in this GIT repository.

Table of Contents
  1. Baseline Implementation
  2. First Optimization – Assertions
  3. First Optimization – Assertions Try 2
  4. Second Optimization – Iterate a bit more
  5. Third Optimization – Getting rid of the bounds check finally
  6. Summary
  7. Addendum: slice::split_at
Baseline Implementation

This is how the baseline implementation looks like.

pub fn bgrx_to_gray_chunks_no_asserts( in_data: &[u8], out_data: &mut [u8], in_stride: usize, out_stride: usize, width: usize, ) { let in_line_bytes = width * 4; let out_line_bytes = width * 4; for (in_line, out_line) in in_data .chunks(in_stride) .zip(out_data.chunks_mut(out_stride)) { for (in_p, out_p) in in_line[..in_line_bytes] .chunks(4) .zip(out_line[..out_line_bytes].chunks_mut(4)) { let b = u32::from(in_p[0]); let g = u32::from(in_p[1]); let r = u32::from(in_p[2]); let x = u32::from(in_p[3]); let grey = ((r * RGB_Y[0]) + (g * RGB_Y[1]) + (b * RGB_Y[2]) + (x * RGB_Y[3])) / 65536; let grey = grey as u8; out_p[0] = grey; out_p[1] = grey; out_p[2] = grey; out_p[3] = grey; } } }

This basically iterates over each line of the input and output frame (outer loop), and then for each BGRx chunk of 4 bytes in each line it converts the values to u32, multiplies with a constant array, converts back to u8 and stores the same value in the whole output BGRx chunk.

Note: This is only doing the actual conversion from linear RGB to grayscale (and in BT.601 colorspace). To do this conversion correctly you need to know your colorspaces and use the correct coefficients for conversion, and also do gamma correction. See this about why it is important.

So what can be improved on this? For starters, let’s write a small benchmark for this so that we know whether any of our changes actually improve something. This is using the (unfortunately still) unstable benchmark feature of Cargo.

#![feature(test)] #![feature(exact_chunks)] extern crate test; pub fn bgrx_to_gray_chunks_no_asserts(...) [...] } #[cfg(test)] mod tests { use super::*; use test::Bencher; use std::iter; fn create_vec(w: usize, h: usize) -> Vec<u8> { iter::repeat(0).take(w * h * 4).collect::<_>() } #[bench] fn bench_chunks_1920x1080_no_asserts(b: &mut Bencher) { let i = test::black_box(create_vec(1920, 1080)); let mut o = test::black_box(create_vec(1920, 1080)); b.iter(|| bgrx_to_gray_chunks_no_asserts(&i, &mut o, 1920 * 4, 1920 * 4, 1920)); } }

This can be run with cargo bench and then prints the amount of nanoseconds each iterator of the closure was taking. To only really measure the processing itself, allocations and initializations of the input/output frame are happening outside of the closure. We’re not interested in times for that.

First Optimization – Assertions

To actually start optimizing this function, let’s take a look at the assembly that the compiler is outputting. The easiest way of doing that is via the Godbolt Compiler Explorer website. Select “rustc nightly” and use “-C opt-level=3” for the compiler flags, and then copy & paste your code in there. Once it compiles, to find the assembly that corresponds to a line, simply right-click on the line and “Scroll to assembly”.

Alternatively you can use cargo rustc –release — -C opt-level=3 –emit asm and check the assembly file that is output in the target/release/deps directory.

What we see then for our inner loop is something like the following

.LBB4_19: cmp r15, r11 mov r13, r11 cmova r13, r15 mov rdx, r8 sub rdx, r13 je .LBB4_34 cmp rdx, 3 jb .LBB4_35 inc r9 movzx edx, byte ptr [rbx - 1] movzx ecx, byte ptr [rbx - 2] movzx esi, byte ptr [rbx] imul esi, esi, 19595 imul edx, edx, 38470 imul ecx, ecx, 7471 add ecx, edx add ecx, esi shr ecx, 16 mov byte ptr [r10 - 3], cl mov byte ptr [r10 - 2], cl mov byte ptr [r10 - 1], cl mov byte ptr [r10], cl add r10, 4 add r8, -4 add r15, -4 add rbx, 4 cmp r9, r14 jb .LBB4_19

This is already quite optimized. For each loop iteration the first few instructions are doing some bounds checking and if they fail jump to the .LBB4_34 or .LBB4_35 labels. How to understand that this is bounds checking? Scroll down in the assembly to where these labels are defined and you’ll see something like the following

.LBB4_34: lea rdi, [rip + .Lpanic_bounds_check_loc.D] xor esi, esi xor edx, edx call core::panicking::panic_bounds_check@PLT ud2 .LBB4_35: cmp r15, r11 cmova r11, r15 sub r8, r11 lea rdi, [rip + .Lpanic_bounds_check_loc.F] mov esi, 2 mov rdx, r8 call core::panicking::panic_bounds_check@PLT ud2

Also if you check (with the colors, or the “scroll to source” feature) which Rust code these correspond to, you’ll see that it’s the first and third access to the 4-byte slice that contains our BGRx values.

Afterwards in the assembly, the following steps are happening: 0) incrementing of the “loop counter” representing the number of iterations we’re going to do (r9), 1) actual reading of the B, G and R value and conversion to u32 (the 3 movzx, note that the reading of the x value is optimized away as the compiler sees that it is always multiplied by 0 later), 2) the multiplications with the array elements (the 3 imul), 3) combining of the results and division (i.e. shift) (the 2 add and the shr), 4) storing of the result in the output (the 4 mov). Afterwards the slice pointers are increased by 4 (rbx and r10) and the lengths (used for bounds checking) are decreased by 4 (r8 and r15). Finally there’s a check (cmp) to see if r9 (our loop) counter is at the end of the slice, and if not we jump back to the beginning and operate on the next BGRx chunk.

Generally what we want to do for optimizations is to get rid of unnecessary checks (bounds checking), memory accesses, conditions (cmp, cmov) and jumps (the instructions starting with j). These are all things that are slowing down our code.

So the first thing that seems useful to optimize here is the bounds checking at the beginning. It definitely seems not useful to do two checks instead of one for the two slices (the checks are for the both slices at once but Godbolt does not detect that and believes it’s only the input slice). And ideally we could teach the compiler that no bounds checking is needed at all.

As I wrote in the previous blog post, often this knowledge can be given to the compiler by inserting assertions.

To prevent two checks and just have a single check, you can insert a assert_eq!(in_p.len(), 4) at the beginning of the inner loop and the same for the output slice. Now we only have a single bounds check left per iteration.

As a next step we might want to try to move this knowledge outside the inner loop so that there is no bounds checking at all in there anymore. We might want to add assertions like the following outside the outer loop then to give all knowledge we have to the compiler

assert_eq!(in_data.len() % 4, 0); assert_eq!(out_data.len() % 4, 0); assert_eq!(out_data.len() / out_stride, in_data.len() / in_stride); assert!(in_line_bytes <= in_stride); assert!(out_line_bytes <= out_stride);

Unfortunately adding those has no effect at all on the inner loop, but having them outside the outer loop for good measure is not the worst idea so let’s just keep them. At least it can be used as some kind of documentation of the invariants of this code for future readers.

So let’s benchmark these two implementations now. The results on my machine are the following

test tests::bench_chunks_1920x1080_no_asserts ... bench: 4,420,145 ns/iter (+/- 139,051) test tests::bench_chunks_1920x1080_asserts ... bench: 4,897,046 ns/iter (+/- 166,555)

This is surprising, our version without the assertions is actually faster by a factor of ~1.1 although it had fewer conditions. So let’s take a closer look at the assembly at the top of the loop again, where the bounds checking happens, in the version with assertions

.LBB4_19: cmp rbx, r11 mov r9, r11 cmova r9, rbx mov r14, r12 sub r14, r9 lea rax, [r14 - 1] mov qword ptr [rbp - 120], rax mov qword ptr [rbp - 128], r13 mov qword ptr [rbp - 136], r10 cmp r14, 5 jne .LBB4_33 inc rcx [...]

While this indeed has only one jump as expected for the bounds checking, the number of comparisons is the same and even worse: 3 memory writes to the stack are happening right before the jump. If we follow to the .LBB4_33 label we will see that the assert_eq! macro is going to do something with core::fmt::Debug. This is setting up the information needed for printing the assertion failure, the “expected X equals to Y” output. This is certainly not good and the reason why everything is slower now.

First Optimization – Assertions Try 2

All the additional instructions and memory writes were happening because the assert_eq! macro is outputting something user friendly that actually contains the values of both sides. Let’s try again with the assert! macro instead

test tests::bench_chunks_1920x1080_no_asserts ... bench: 4,420,145 ns/iter (+/- 139,051) test tests::bench_chunks_1920x1080_asserts ... bench: 4,897,046 ns/iter (+/- 166,555) test tests::bench_chunks_1920x1080_asserts_2 ... bench: 3,968,976 ns/iter (+/- 97,084)

This already looks more promising. Compared to our baseline version this gives us a speedup of a factor of 1.12, and compared to the version with assert_eq! 1.23. If we look at the assembly for the bounds checks (everything else stays the same), it also looks more like what we would’ve expected

.LBB4_19: cmp rbx, r12 mov r13, r12 cmova r13, rbx add r13, r14 jne .LBB4_33 inc r9 [...]

One cmp less, only one jump left. And no memory writes anymore!

So keep in mind that assert_eq! is more user-friendly but quite a bit more expensive even in the “good case” compared to assert!.

Second Optimization – Iterate a bit more

This is still not very satisfying though. No bounds checking should be needed at all as each chunk is going to be exactly 4 bytes. We’re just not able to convince the compiler that this is the case. While it may be possible (let me know if you find a way!), let’s try something different. The zip iterator is done when the shortest iterator of both is done, and there are optimizations specifically for zipped slice iterators implemented. Let’s try that and replace the grayscale value calculation with

let grey = in_p.iter() .zip(RGB_Y.iter()) .map(|(i, c)| u32::from(*i) * c) .sum::<u32>() / 65536;

If we run that through our benchmark after removing the assert!(in_p.len() == 4) (and the same for the output slice), these are the results

test tests::bench_chunks_1920x1080_asserts_2 ... bench: 3,968,976 ns/iter (+/- 97,084) test tests::bench_chunks_1920x1080_iter_sum ... bench: 11,393,600 ns/iter (+/- 347,958)

We’re actually 2.9 times slower! Even when adding back the assert!(in_p.len() == 4) assertion (and the same for the output slice) we’re still slower

test tests::bench_chunks_1920x1080_asserts_2 ... bench: 3,968,976 ns/iter (+/- 97,084) test tests::bench_chunks_1920x1080_iter_sum ... bench: 11,393,600 ns/iter (+/- 347,958) test tests::bench_chunks_1920x1080_iter_sum_2 ... bench: 10,420,442 ns/iter (+/- 242,379)

If we look at the assembly of the assertion-less variant, it’s a complete mess now

.LBB0_19: cmp rbx, r13 mov rcx, r13 cmova rcx, rbx mov rdx, r8 sub rdx, rcx cmp rdx, 4 mov r11d, 4 cmovb r11, rdx test r11, r11 je .LBB0_20 movzx ecx, byte ptr [r15 - 2] imul ecx, ecx, 19595 cmp r11, 1 jbe .LBB0_22 movzx esi, byte ptr [r15 - 1] imul esi, esi, 38470 add esi, ecx movzx ecx, byte ptr [r15] imul ecx, ecx, 7471 add ecx, esi test rdx, rdx jne .LBB0_23 jmp .LBB0_35 .LBB0_20: xor ecx, ecx .LBB0_22: test rdx, rdx je .LBB0_35 .LBB0_23: shr ecx, 16 mov byte ptr [r10 - 3], cl mov byte ptr [r10 - 2], cl cmp rdx, 3 jb .LBB0_36 inc r9 mov byte ptr [r10 - 1], cl mov byte ptr [r10], cl add r10, 4 add r8, -4 add rbx, -4 add r15, 4 cmp r9, r14 jb .LBB0_19

In short, there are now various new conditions and jumps for short-circuiting the zip iterator in the various cases. And because of all the noise added, the compiler was not even able to optimize the bounds check for the output slice away anymore (.LBB0_35 cases). While it was able to unroll the iterator (note that the 3 imul multiplications are not interleaved with jumps and are actually 3 multiplications instead of yet another loop), which is quite impressive, it couldn’t do anything meaningful with that information it somehow got (it must’ve understood that each chunk has 4 bytes!). This looks like something going wrong somewhere in the optimizer to me.

If we take a look at the variant with the assertions, things look much better

.LBB3_19: cmp r11, r12 mov r13, r12 cmova r13, r11 add r13, r14 jne .LBB3_33 inc r9 movzx ecx, byte ptr [rdx - 2] imul r13d, ecx, 19595 movzx ecx, byte ptr [rdx - 1] imul ecx, ecx, 38470 add ecx, r13d movzx ebx, byte ptr [rdx] imul ebx, ebx, 7471 add ebx, ecx shr ebx, 16 mov byte ptr [r10 - 3], bl mov byte ptr [r10 - 2], bl mov byte ptr [r10 - 1], bl mov byte ptr [r10], bl add r10, 4 add r11, -4 add r14, 4 add rdx, 4 cmp r9, r15 jb .LBB3_19

This is literally the same as the assertion version we had before, except that the reading of the input slice, the multiplications and the additions are happening in iterator order instead of being batched all together. It’s quite impressive that the compiler was able to completely optimize away the zip iterator here, but unfortunately it’s still many times slower than the original version. The reason must be the instruction-reordering. The previous version had all memory reads batched and then the operations batched, which is apparently much better for the internal pipelining of the CPU (it is going to perform the next instructions without dependencies on the previous ones already while waiting for the pending instructions to finish).

It’s also not clear to me why the LLVM optimizer is not able to schedule the instructions the same way here. It apparently has all information it needs for that if no iterator is involved, and both versions are leading to exactly the same assembly except for the order of instructions. This also seems like something fishy.

Nonetheless, we still have our manual bounds check (the assertion) left here and we should really try to get rid of that. No progress so far.

Third Optimization – Getting rid of the bounds check finally

Let’s tackle this from a different angle now. Our problem is apparently that the compiler is not able to understand that each chunk is exactly 4 bytes.

So why don’t we write a new chunks iterator that has always exactly the requested amount of items, instead of potentially less for the very last iteration. And instead of panicking if there are leftover elements, it seems useful to just ignore them. That way we have API that is functionally different from the existing chunks iterator and provides behaviour that is useful in various cases. It’s basically the slice equivalent of the exact_chunks iterator of the ndarray crate.

By having it functionally different from the existing one, and not just an optimization, I also submitted it for inclusion in Rust’s standard library and it’s nowadays available as an unstable feature in nightly. Like all newly added API. Nonetheless, the same can also be implemented inside your code with basically the same effect, there are no dependencies on standard library internals.

So, let’s use our new exact_chunks iterator that is guaranteed (by API) to always give us exactly 4 bytes. In our case this is exactly equivalent to the normal chunks as by construction our slices always have a length that is a multiple of 4, but the compiler can’t infer that information. The resulting code looks as follows

pub fn bgrx_to_gray_exact_chunks( in_data: &[u8], out_data: &mut [u8], in_stride: usize, out_stride: usize, width: usize, ) { assert_eq!(in_data.len() % 4, 0); assert_eq!(out_data.len() % 4, 0); assert_eq!(out_data.len() / out_stride, in_data.len() / in_stride); let in_line_bytes = width * 4; let out_line_bytes = width * 4; assert!(in_line_bytes <= in_stride); assert!(out_line_bytes <= out_stride); for (in_line, out_line) in in_data .exact_chunks(in_stride) .zip(out_data.exact_chunks_mut(out_stride)) { for (in_p, out_p) in in_line[..in_line_bytes] .exact_chunks(4) .zip(out_line[..out_line_bytes].exact_chunks_mut(4)) { assert!(in_p.len() == 4); assert!(out_p.len() == 4); let b = u32::from(in_p[0]); let g = u32::from(in_p[1]); let r = u32::from(in_p[2]); let x = u32::from(in_p[3]); let grey = ((r * RGB_Y[0]) + (g * RGB_Y[1]) + (b * RGB_Y[2]) + (x * RGB_Y[3])) / 65536; let grey = grey as u8; out_p[0] = grey; out_p[1] = grey; out_p[2] = grey; out_p[3] = grey; } } }

It’s exactly the same as the previous version with assertions, except for using exact_chunks instead of chunks and the same for the mutable iterator. The resulting benchmark of all our variants now looks as follow

test tests::bench_chunks_1920x1080_no_asserts ... bench: 4,420,145 ns/iter (+/- 139,051) test tests::bench_chunks_1920x1080_asserts ... bench: 4,897,046 ns/iter (+/- 166,555) test tests::bench_chunks_1920x1080_asserts_2 ... bench: 3,968,976 ns/iter (+/- 97,084) test tests::bench_chunks_1920x1080_iter_sum ... bench: 11,393,600 ns/iter (+/- 347,958) test tests::bench_chunks_1920x1080_iter_sum_2 ... bench: 10,420,442 ns/iter (+/- 242,379) test tests::bench_exact_chunks_1920x1080 ... bench: 2,007,459 ns/iter (+/- 112,287)

Compared to our initial version this is a speedup of a factor of 2.2, compared to our version with assertions a factor of 1.98. This seems like a worthwhile improvement, and if we look at the resulting assembly there are no bounds checks at all anymore

.LBB0_10: movzx edx, byte ptr [rsi - 2] movzx r15d, byte ptr [rsi - 1] movzx r12d, byte ptr [rsi] imul r13d, edx, 19595 imul edx, r15d, 38470 add edx, r13d imul ebx, r12d, 7471 add ebx, edx shr ebx, 16 mov byte ptr [rcx - 3], bl mov byte ptr [rcx - 2], bl mov byte ptr [rcx - 1], bl mov byte ptr [rcx], bl add rcx, 4 add rsi, 4 dec r10 jne .LBB0_10

Also due to this the compiler is able to apply some more optimizations and we only have one loop counter for the number of iterations r10 and the two pointers rcx and rsi that are increased/decreased in each iteration. There is no tracking of the remaining slice lengths anymore, as in the assembly of the original version (and the versions with assertions).

Summary

So overall we got a speedup of a factor of 2.2 while still writing very high-level Rust code with iterators and not falling back to unsafe code or using SIMD. The optimizations the Rust compiler is applying are quite impressive and the Rust marketing line of zero-cost abstractions is really visible in reality here.

The same approach should also work for many similar algorithms, and thus many similar multimedia related algorithms where you iterate over slices and operate on fixed-size chunks.

Also the above shows that as a first step it’s better to write clean and understandable high-level Rust code without worrying too much about performance (assume the compiler can optimize well), and only afterwards take a look at the generated assembly and check which instructions should really go away (like bounds checking). In many cases this can be achieved by adding assertions in strategic places, or like in this case by switching to a slightly different abstraction that is closer to the actual requirements (however I believe the compiler should be able to produce the same code with the help of assertions with the normal chunks iterator, but making that possible requires improvements to the LLVM optimizer probably).

And if all does not help, there’s still the escape hatch of unsafe (for using functions like slice::get_unchecked() or going down to raw pointers) and the possibility of using SIMD instructions (by using faster or stdsimd directly). But in the end this should be a last resort for those little parts of your code where optimizations are needed and the compiler can’t be easily convinced to do it for you.

Addendum: slice::split_at

User newpavlov suggested on Reddit to use repeated slice::split_at in a while loop for similar performance.

This would for example like

pub fn bgrx_to_gray_split_at( in_data: &[u8], out_data: &mut [u8], in_stride: usize, out_stride: usize, width: usize, ) { assert_eq!(in_data.len() % 4, 0); assert_eq!(out_data.len() % 4, 0); assert_eq!(out_data.len() / out_stride, in_data.len() / in_stride); let in_line_bytes = width * 4; let out_line_bytes = width * 4; assert!(in_line_bytes <= in_stride); assert!(out_line_bytes <= out_stride); for (in_line, out_line) in in_data .exact_chunks(in_stride) .zip(out_data.exact_chunks_mut(out_stride)) { let mut in_pp: &[u8] = in_line[..in_line_bytes].as_ref(); let mut out_pp: &mut [u8] = out_line[..out_line_bytes].as_mut(); assert!(in_pp.len() == out_pp.len()); while in_pp.len() >= 4 { let (in_p, in_tmp) = in_pp.split_at(4); let (out_p, out_tmp) = { out_pp }.split_at_mut(4); in_pp = in_tmp; out_pp = out_tmp; let b = u32::from(in_p[0]); let g = u32::from(in_p[1]); let r = u32::from(in_p[2]); let x = u32::from(in_p[3]); let grey = ((r * RGB_Y[0]) + (g * RGB_Y[1]) + (b * RGB_Y[2]) + (x * RGB_Y[3])) / 65536; let grey = grey as u8; out_p[0] = grey; out_p[1] = grey; out_p[2] = grey; out_p[3] = grey; } } }

Performance-wise this brings us very close to the exact_chunks version

test tests::bench_exact_chunks_1920x1080 ... bench: 1,965,631 ns/iter (+/- 58,832) test tests::bench_split_at_1920x1080 ... bench: 2,046,834 ns/iter (+/- 35,990)

and the assembly is also very similar

.LBB0_10: add rbx, -4 movzx r15d, byte ptr [rsi] movzx r12d, byte ptr [rsi + 1] movzx edx, byte ptr [rsi + 2] imul r13d, edx, 19595 imul r12d, r12d, 38470 imul edx, r15d, 7471 add edx, r12d add edx, r13d shr edx, 16 movzx edx, dl imul edx, edx, 16843009 mov dword ptr [rcx], edx lea rcx, [rcx + 4] add rsi, 4 cmp rbx, 3 ja .LBB0_10

Here the compiler even optimizes the storing of the value into a single write operation of 4 bytes, at the cost of an additional multiplication and zero-extend register move.

Overall this code performs very well too, but in my opinion it looks rather ugly compared to the versions using the different chunks iterators. Also this is basically what the exact_chunks iterator does internally: repeatedly calling slice::split_at. In theory both versions could lead to the very same assembly, but the LLVM optimizer is currently handling both slightly different.

New year haul

Planet Debian - Dje, 21/01/2018 - 12:08pd

Some new acquired books. This is a pretty wide variety of impulse purchases, filled with the optimism of a new year with more reading time.

Libba Bray — Beauty Queens (sff)
Sarah Gailey — River of Teeth (sff)
Seanan McGuire — Down Among the Sticks and Bones (sff)
Alexandra Pierce & Mimi Mondal (ed.) — Luminescent Threads (nonfiction anthology)
Karen Marie Moning — Darkfever (sff)
Nnedi Okorafor — Binti (sff)
Malka Older — Infomocracy (sff)
Brett Slatkin — Effective Python (nonfiction)
Zeynep Tufekci — Twitter and Tear Gas (nonfiction)
Martha Wells — All Systems Red (sff)
Helen S. Wright — A Matter of Oaths (sff)
J.Y. Yang — Waiting on a Bright Moon (sff)

Several of these are novellas that were on sale over the holidays; the rest came from a combination of reviews and random on-line book discussions.

The year hasn't been great for reading time so far, but I do have a couple of things ready to review and a third that I'm nearly done with, which is not a horrible start.

Russ Allbery https://www.eyrie.org/~eagle/ Eagle's Path

PC desktop build, Intel, spectre issues etc.

Planet Debian - Sht, 20/01/2018 - 11:05md

This is and would be a longish one.

I have been using desktop computers for around couple of decades now. My first two systems were an Intel Pentium III and then a Pentium Dual-core, the first one on kobian/mercury motherboard. The motherboards were actually called Mercury and was a brand which was later sold to Kobian which kept the brand-name. The motherboards and the CPU/processor used to be cheap. One could set up a decentish low-end system with display for around INR 40k/- which seemed to be decent as a country we had just come out of non-alignment movement and also chose to come out of isolationist tendencies (technology and otherwise as well). Most middle-class income families got their first taste of computers after y2k. There were quite a few y2k incomes which prompted the Government to lose duties further.

One of the highlights during 1991 when satellite TV came was shown by CNN (probably CNN International) was the coming down of the Berlin Wall. There were many of us who were completely ignorant of world politics or what is/was happening in other parts of the world.

Computer systems at those times were considered a luxury item and duties were sky-high ( between 1992-2001). The launch of Mars Pathfinder, its subsequent successful landing on the Martian surface also catapulted people’s imagination about PCs and micro-processors.

I can still recall the excitement that was among young people of my age first seeing the liftoff from Cape Canaveral and then later the processed images of Spirits cameras showing images of a desolate desert-type land. We also witnessed the beginnings of ‘International Space Station‘ (ISS) .

Me and few of my friends had drunk lot of Carl Sagan and many other sci-fi coolaids/stories. Star Trek, the movies and the universal values held/shared by them was a major influence to all our lives.

People came to know about citizen based science or projects/distributed science projects, y2k fear appeared to be unfounded all these factors and probably a few more prompted the Government of India to reduce duties on motherboards, processors, components as well as taking Computers out of the restricted list which lead to competition and finally the common man being able to dream of a system sooner than later. Y2K also kick-started the beginnings of Indian software industry which is the bread and butter of many a middle class men-women who are in the service industry using technology directly or indirectly.

In 2002 I bought my first system, an Intel Pentium III, i810 chipset (integrated graphics) with 256 MB of SDRAM which was supposed to be sufficient for the tasks it was being used for, Some light gaming, some web-mails, seeing movies,etc running on a mercury board. I don’t remember the code-name partly because the code-names are/were really weird and partly because it is just too long ago. I remember using Windows ’98 and trying to install one of the early GNU/Linux variants on that machine. Ir memory serves right, you had to flick a jumper (like a switch) to use the extended memory.

I do not know/remember what happened but I think somewhere in a year or two in that time-frame Mercury India filed for bankruptcy and the name, manufacturing was sold to Kobian. After Kobian took over the ownership, it said it would neither honor the 3/5 year warranty or even repairs on the motherboards Mercury had sold, it created a lot of bad will against the company and relegated itself to the bottom of the pile for both experienced and new system-builders. Also mercury motherboards weren’t reputed/known to have a long life although the one I had gave me quite a decent life.

The next machine I purchased was a Pentium Dual-core, (around 2009/2010) LGA a Williamnette which had out-of-order execution, the bug meltdown which is making news nowadays has history this far back. I think I bought it in 45nm which was a huge jump from the previous version although still secure in the mATX package. Again the board was from mercury. (Intel 845 chipset, DDR2 2 GB RAM and SATA came to stay).

So meltdown has been in existence for 10-12 odd years and is in everything which either uses Intel or ARM processors.

As you can probably make-out most systems came stretched out 2-3 years later than when they were launched in American or/and European markets. Also business or tourism travel was neither so easy, smooth or transparent as is today. All of which added to delay in getting new products in India.

Sadly, the Indian market is similar to other countries where Intel is used in more than 90% machines. I know of few institutions (though pretty much rare) who insisted and got AMD solutions.

That was the time when gigabyte came onto the scene which formed the basis of the Wolfdale-3M 45nm system which was in the same price range as the earlier models, and offered a weeny tiny bit of additional graphics performance.To the best of my knowledge, it was perhaps the first motherboard which had solid state capacitors being offered/put in a budget motherboard. The mobo-processor bundle used to be in the range of INR 7/8k excluding RAM. cabinet etc, I had a Philips 17″ CRT display which ran a good decade or so, so just had to get the new cabinet, motherboard, CPU, RAM and was good to go.

Few months later at a hardware exhibition held in the city I was invited to an Asus party which was just putting a toe-hold in the Indian market. I went to the do, enjoyed myself. They had a small competition where they asked some questions and asked if people had queries. To my surprise, I found that most people who were there were hardware vendors and for one reason or the other they chose to remain silent. Hence I got an AMD Asus board. This is different from winning another Gigabyte motherboard which I also won in the same year in another competition as well in the same time-frame. Both were mid-range motherboards (ATX build).

As I had just bought a Gigabyte (mATX) motherboard and had made the build, I had to give both the motherboards away, one to a friend and one to my uncle and both were pleased with the AMD-based mobos which they somehow paired with AMD processors. At that time AMD had one-upped Intel in both graphics and even bare computing especially at the middle level and they were striving to push into new markets.

Apart from the initial system bought, most of my systems when being changed were in the INR 20-25k/- budget including all and any accessories I bought later.

The only real expensive parts I purchased have been external hdd ( 1 TB WD passport) and then a Viewsonic 17″ LCD which together sent me back by around INR 10k/- but both seem to give me adequate performance (both have outlived the warranty years) with the monitor being used almost 24×7 over 6 years or so, of course over GNU/Linux specifically Debian. Both have been extremely well value for the money.

As I had been exposed to both the motherboards I had been following those and other motherboards as well. What was and has been interesting to observe what Asus did later was to focus more on the high-end gaming market while Gigabyte continued to dilute it energy both in the mid and high-end motherboards.

Cut to 2017 and had seen quite a few reports –

http://www.pcstats.com/NewsView.cfm?NewsID=131618

http://www.digitimes.com/news/a20170904PD207.html

http://www.guru3d.com/news-story/asus-has-the-largest-high-end-intel-motherboard-share.html

All of which points to the fact that Asus had cornered a large percentage of the market and specifically the gaming market . While there are no formal numbers as both Asus and Gigabyte choose to releases only APAC numbers rather than a country-wide split which would have made for some interesting reading.

Just so that people do not presume anything, there are about 4-5 motherboard vendors in the Indian market. There is Asus at the top (I believe) followed by Gigabyte, Intel at a distant 3rd place (because it’s too expensive). There are also pockets of Asrock and MSI and I know of people who follow them religiously although their mobos are supposed to be somewhat pensive than the two above. Asus and Gigabyte do try to fight out with each other but each has its core competency I believe with Asus being used by heavy gamers, overclockers more than Gigabyte.

Anyway come October 2017 and my main desktop died and am left as they say up the creek without the paddle. I didn’t even have Net access for about 3 weeks due to BSNL or PMC’s foolishness and then later small riots breaking out due to Koregaon Bhima conflict.

This led to a situation where I had to buy/build a system with oldish/half knowledge. I was open to having an AMD system but both datacare and even Rashi peripherals, Pune both of whom used to deal in AMD systems shared they had stopped dealing in AMD stuff sometime back. While datacare had AMD mobos, getting processors were an issue. Both the vendors are near to my home so if I buy from them getting support becomes an non-issue. I could have gone out of my way to get an AMD processor but getting support could have been an issue as would have had to travel and I do not know the vendors enough. Hence fell back to the Intel platform.

I asked around quite a few PC retailers and distributors around and found the Asus Prime Z270-P was the only mid-range motherboard available at that time. I did come to know a bit later of other motherboards in the z270 series but most vendors didn’t/don’t stock them as there is capital, interest and stock cost.

History – Historically, there has also been huge time lag in getting motherboards, processors etc. between worldwide announcements, and then announcements of sale in India and actually getting hands-on to the newest motherboards and processors as seen above. This had led to quite a bit of frustration to many a users. I have known of many a soul visiting Lamington Road, Mumbai to get the latest motherboard, processor. Even to-date this system flourishes as Mumbai has an International Airport and there is always a demand and people willing to pay a premium for the newest processor/motherboard even before any reviews are in.

I was highly surprised to know recently that Prime Z370-P motherboards are already selling (just 3 months late) with the Intel 8th generation processors although these are still as samples rather than a torrent some of the other motherboard-combo might be.

At the end I bought an Intel I7400 chip and an Asus Prime Z270-P motherboard with 2400 mhz Corsair 8 GB and a 4 TB WD Green (5400) HDD with a Circle 545 cabinet and (with the almost criminal 400 Watts SMPS). Later came to know that it’s not really even 400 Watts, but around 20-25% less . The whole package costed me north of INR 50k/- with still need to spend on a better SMPS (probably a Cosair or Coolermaster 80 600/650 SMPS) with a few accessories I still need to complete the system.

I will be changing the PSU most probably next week.

Disclosure – The neatness you see is not me. I was unsure if I would be able to put the heatsink on the CPU properly as that is the most sensitive part while building a system. A bent pin on the CPU could play havoc as well as void the warranty on the CPU or motherboard or both. The new thing I saw were the knobs that can be seen on the heatsink fan is something which I hadn’t seen before. The vendor did the fixing of the processor on the mobo for me as well as tied up the remaining power cables without asking for which I am and was grateful and would definitely provide him with more business as and when I need components.

Future – While it’s ok for now, I’m still using a pretty old 2 speaker setup which I hope to upgrade to either a 2.1/3.1 speaker setup, have full 64 GB 2400 Mhz Kingston Razor/G.Skill/Corsair memory, an M.2 512 MB SSD .

If I do get the Taiwan Debconf bursary I do hope to buy some or all of the above plus a Samsung or some other Android/Replicant/Librem smartphone. I have been also looking for a vastly simplified smartphone for my mum with big letters and everything but that has been a failure to find in the Indian market. Of course this all depends if I do get the bursary and even after the bursary if Global warranty and currency exchange works out in my favor vis-a-vis what I would have to pay in India.

Apart from above, Taiwan is supposed to be a pretty good source to get graphic novels, manga comics, lots of RPG games for very cheap prices with covers and hand-drawn material etc. All of this is based upon few friend’s anecdotal experiences so dunno if all of that would still hold true if I manage to be there.

There are also quite a few chip foundries and maybe during debconf could have visit to one of them if possible. It would be rewarding if the visit was to any 45nm or lower chip foundry as India is still stuck at 65nm range till date.

I would be sharing about my experience about the board, the CPU, the expectations I had from the Intel chip and the somewhat disappointing experience of using Debian on the new board in the next post, not necessarily Debian’s fault but the free software ecosystem being at fault here.

Feel free to point out any mistakes you find, grammatically or even otherwise. The blog post has been in the works for over couple of weeks so its possible for mistakes to creep in.

shirishag75 https://flossexperiences.wordpress.com #planet-debian – Experiences in the community

TLCockpit v0.8

Planet Debian - Sht, 20/01/2018 - 3:32md

Today I released v0.8 of TLCockpit, the GUI front-end for the TeX Live Manager tlmgr. I spent the winter holidays in updating and polishing, but also in helping me debug problems that users have reported. Hopefully the new version works better for all.

If you are looking for a general introduction to TLCockpit, please see the blog introducing it. Here I only want to introduce the changes made since the last release:

  • add debug facility: It is now possible to pass -d for debugging to tlcockpit, activating debugging. There is also -dd for more verbose debugging.
  • select mirror facility: The edit screen for the repository setting now allows selecting from the current list of mirrors, see the following screenshot:
  • initial loading speedup: Till now we used to parse the json output of tlmgr, which included everything the whole database contains. We now load the initial minimal information via info --data and load additional data when details for a package is shown on demand. This should especially make a difference on systems without a compiled json Perl library available.
  • fixed self update: In the previous version, updating the TeX Live Manager itself was not properly working – it was updated but the application itself became unresponsive afterwards. This is hopefully fixed (although this is really tricky).
  • status indicator: The status indicator has moved from the menu bar (where it was somehow a stranger) to below the package listing, and now also includes the currently running command, see screenshot after the next item.
  • nice spinner: Only an eye-candy, but I added a rotating spinner while loading the database, updates, backups, or doing postactions. See the attached screenshot, which also shows the new location of the status indicator and the additional information provided.

I hope that this version is more reliable, stable, and easier to use. As usual, please use the issue page of the github project to report problems.

TeX Live should contain the new version starting from tomorrow.

Enjoy.

Norbert Preining https://www.preining.info/blog There and back again

Faqet

Subscribe to AlbLinux agreguesi