dimanche 20 janvier 2013

Processing HTML with Scala as if XML

The controversial XML API for Scala is still usefull for simple use cases. However it comes short when dealing with HTML as found on public websites since it is not well formed.

import java.net.URL
import scala.xml.XML
val site = new URL("http://michel-daviot.blogspot.fr/")
XML.load(site)
//Exception in thread "main" org.xml.sax.SAXParseException; lineNumber: 7; columnNumber: 265;
//The entity name must immediately follow the '&' in the entity reference.
view raw XMLLoad.scala hosted with ❤ by GitHub



It is quite easy however to overcome this limitation by using the library tagsoup which allows to "fix" the HTML markup to make it look like XML.

You will add this dependency to your pom.xml :

<dependency>
<groupId>org.ccil.cowan.tagsoup</groupId>
<artifactId>tagsoup</artifactId>
<version>1.2.1</version>
</dependency>
view raw pom.xml hosted with ❤ by GitHub


This simple object can then be used to load an URL containing HTML markup as if it was XML.

import java.net.URL
import scala.xml.XML
import org.xml.sax.InputSource
import scala.xml.parsing.NoBindingFactoryAdapter
import org.ccil.cowan.tagsoup.jaxp.SAXFactoryImpl
import java.net.HttpURLConnection
import scala.xml.Node
object HTML {
lazy val adapter = new NoBindingFactoryAdapter
lazy val parser = (new SAXFactoryImpl).newSAXParser
def load(url: URL, headers: Map[String, String] = Map.empty): Node = {
val conn = url.openConnection().asInstanceOf[HttpURLConnection]
for ((k, v) <- headers)
conn.setRequestProperty(k, v)
val source = new InputSource(conn.getInputStream)
adapter.loadXML(source, parser)
}
}
view raw HTML.scala hosted with ❤ by GitHub


Note that it also allows to set HTTP headers to the request, for instance if you want to use a cookie or a sessionId to get logged in.

Example calling code, which will list all HTML links from the loaded page :

import java.net.URL
val site = new URL("http://michel-daviot.blogspot.fr/")
val content = HTML.load(site)
for (
a <- content \\ "a";
href = a.attribute("href");
if href.isDefined
) println(href.get)
view raw HtmlDemo.scala hosted with ❤ by GitHub


And as a bonus, an object to force proxy settings from the code (for sure, you could also set it from the command line).

object SetProxy {
def apply(proxyConfig: (String, Int)) {
val (host, port) = proxyConfig
for (protocol <- Seq("http", "https")) {
System.setProperty(s"$protocol.proxyPort", port.toString)
System.setProperty(s"$protocol.proxyHost", host)
}
}
}
view raw SetProxy.scala hosted with ❤ by GitHub

mardi 15 janvier 2013

Easy Scala with Maven & Eclipse (scala 2.10)


I spent some time to figure this out ... not sold on sbt yet, and I am used to Eclipse 3.7.

So here is the simple procedure :



  1. Install the Scala plugin for Eclipse from scala-ide.org. I use this update-site for scala 2.10
  2. Install this eclipse plugin for maven / m2e : http://alchim31.free.fr/m2e-scala/update-site
  3. You can then import easily a Maven project.
  4. Right-click on the project, Configure > Add Scala Nature


Sample maven file :

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>info.daviot</groupId>
<version>0.1-SNAPSHOT</version>
<artifactId>template</artifactId>
<packaging>jar</packaging>
<properties>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<scala.version>2.11.5</scala.version>
<java.version>1.7</java.version>
</properties>
<dependencies>
<dependency>
<artifactId>scala-library</artifactId>
<groupId>org.scala-lang</groupId>
<version>${scala.version}</version>
</dependency>
<!-- optional dependencies -->
<dependency>
<groupId>com.softwaremill.macwire</groupId>
<artifactId>macros_2.11</artifactId>
<version>0.8.0</version>
</dependency>
<dependency>
<groupId>com.typesafe.akka</groupId>
<artifactId>akka-actor_2.11</artifactId>
<version>2.3.8</version>
</dependency>
<dependency>
<groupId>com.github.nscala-time</groupId>
<artifactId>nscala-time_2.11</artifactId>
<version>1.4.0</version>
</dependency>
<dependency>
<groupId>com.propensive</groupId>
<artifactId>rapture-json-jawn_2.11</artifactId>
<version>1.1.0</version>
</dependency>
<!-- logs -->
<dependency>
<groupId>org.clapper</groupId>
<artifactId>grizzled-slf4j_2.11</artifactId>
<version>1.0.2</version>
</dependency>
<dependency>
<groupId>ch.qos.logback</groupId>
<artifactId>logback-classic</artifactId>
<version>1.1.2</version>
</dependency>
<dependency>
<groupId>org.slf4j</groupId>
<artifactId>slf4j-api</artifactId>
<version>1.7.6</version>
</dependency>
<!-- tests -->
<dependency>
<groupId>org.scalatest</groupId>
<artifactId>scalatest_2.11</artifactId>
<version>2.2.2</version>
<scope>test</scope>
</dependency>
<dependency>
<artifactId>junit</artifactId>
<groupId>junit</groupId>
<version>4.10</version>
<scope>test</scope>
</dependency>
<dependency>
<groupId>org.powermock</groupId>
<artifactId>powermock-api-mockito</artifactId>
<version>1.5.5</version>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<sourceDirectory>src/main/scala</sourceDirectory>
<testSourceDirectory>src/test/scala</testSourceDirectory>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.1.6</version>
</plugin>
<plugin>
<groupId>org.scalariform</groupId>
<artifactId>scalariform-maven-plugin</artifactId>
<version>0.1.4</version>
<executions>
<execution>
<phase>process-sources</phase>
<goals>
<goal>format</goal>
</goals>
<configuration>
<rewriteArrowSymbols>true</rewriteArrowSymbols>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.5</version>
<configuration>
<source>${java.version}</source>
<target>${java.version}</target>
</configuration>
<executions>
<execution>
<phase>compile</phase>
<goals>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
view raw pom.xml hosted with ❤ by GitHub